AWS: New cloud instances with Trainium2 chips for more AI performance

AWS promises significantly better training and inference times for AI models with its new Trainium2 hardware generation.

(Image: Amazon)

Dec 8, 2024 at 8:13 am CET

3 min. read

iX Magazin

By

Cornelius May

Machine learning played a prominent role at Amazon Web Services' (AWS) annual in-house exhibition re:Invent –, including when it comes to new hardware. The focus was on the EC2 UltraServers powered by Trainium2 chips and EC2 instances, which are now available. According to AWS, the new Trn2 instances offer 20.8 petaflops of computing power per instance and an up to 40 percent better price-performance ratio compared to the GPU-based EC2 P5 instances.

A Trn2 UltraServer consists of four Trn2 instances that are connected to each other via a NeuronLink connection. This architecture is designed to scale computing power up to 83.2 petaflops to reduce training and inference times for the world's largest AI models. Models with up to one trillion parameters could thus be processed with improved latency.

"Project Rainier" for AI clusters

AWS also announced "Project Rainier", which combines hundreds of Trainium2 UltraServers into an EC2 UltraCluster, enabling an increase in cluster size compared to existing solutions. These UltraClusters are used in organizations such as Anthropic to train AI models. Anthropic uses them, for example, to optimize Claude models for Amazon Bedrock on Trainium2. This infrastructure should enable customers to efficiently train models with trillions of parameters and use them in real time.

Read also

Amazon releases new AI model family – Nova

AWS emphasized that it is not enough to increase the size of clusters to improve performance. Instead, the new architecture of Trainium2 UltraServers improves data distribution and resource allocation. This reduces the overall training time without encountering traditional network limits.

New instances with Nvidia Blackwell and outlook

In addition to the Trainium2 solutions, AWS presented the EC2 P6 instances. They are based on the next generation of Blackwell GPUs from Nvidia. Compared to the current generation, AWS promises up to 2.5 times higher performance and optimization specifically for compute-intensive generative AI applications. AWS sees the P6 instances primarily for applications that require fast response times and high scalability.

Videos by heise

AWS has also already announced the upcoming Trainium3 chip as the successor to Trainium2. This will be manufactured using a 3-nanometer process. Compared to its predecessor, it is said to be more energy-efficient and four times more powerful. This will allow customers to iterate models faster and use them in real time. Trainium3 is expected to be available in later versions of the UltraServer.