AMD explains Instinct accelerator MI355X and cuts memory in MI325X

With 9.2 FP6 flops, AMD's Instinct MI355X is set to compete with Nvidia's Blackwell in the second half of 2025. MI325X "loses" 32 GB of memory.

listen Print view
AMD Instinct MI325X from the front

AMD's Instinct MI325X as a module for data centers.

(Image: AMD)

7 min. read
Contents

At AMD's in-house exhibition Advancing AI in San Francisco, California, the company once again put its AI strategy and the corresponding products in the spotlight. While hopes are high for the MI355X in 2025, the current generation in the form of the MI325X is getting a smaller memory upgrade than previously thought. AMD is also not idle when it comes to software and claims to have doubled the performance of AI training with advances in its in-house ROCm version 6.2 compared to 6.0, and even up to a factor of 2.8 in inferencing. This was achieved through improvements to the kernel, parallelization and changes in the distribution to various computing accelerators; however, in the rapidly developing AI environment, such increases through software optimizations are not unusual – on the other hand, this is also not a sign of poor (older) software.

AMD is continuing to rely on its CDNA architecture for the time being. The first chiplets with 3-nanometer process technology from chip contract manufacturer TSMC are set to arrive with the CDNA4 generation. The first accelerators with this architecture will be used in the Instinct MI355X, which AMD plans to launch in the second half of 2025.

In addition to the current HBM3e maximum of 288 GByte for eight memory stacks, it should also have significantly more computing power. AMD left this point somewhat in the dark during the first Computex mention, only talking about a performance factor of 35 compared to the current MI300X. However, this is due to the more economical FP4 data formats supported in the MI355X and enormously large AI models. However, AMD has now also released the actual computing throughput in teraflops (trillions of computing steps per second): With FP16 accuracy, it is said to be around 2300 TFlops, 77 percent more than the MI300X/MI325X. The FP8 throughput also increases accordingly to 4600 TFlops and is therefore only just behind Nvidia's Blackwell B200. However, AMD has not yet translated its lead on paper 1:1 into increased performance in AI training or inferencing – where CDNA4 must also increase efficiency. Speaking of which: Specific TDP figures are still missing, but they are based on the industry trend. In other words, it will probably be well over one kilowatt.

A Universal Baseboard (UBB) is the basis for most AI servers and holds eight accelerators in the standard Open Accelerator Module.

(Image: AMD)

Like Blackwell, AMD also supports the more economical floating point formats FP4 and FP6 as block data types in CDNA4. If the arithmetic units are designed accordingly, they can execute twice as many operations per clock step with FP4 as with FP8, i.e. 9200 TFlops in mathematical terms. Nvidia has also demonstrated this with Blackwell, achieving almost 10,000 TFlops. However, AMD goes one step further and, unlike Nvidia's Blackwell, even designs the MI355X for FP6 with twice the throughput of FP8. With 9200 to 5000 TFlops, this is significantly more. But FP6 and FP4 have even more advantages: According to AMD, due to the lower memory consumption, AI models with up to 4200 billion parameters fit into the 2.3 TByte memory of an eight-pack of Universal Baseboards (UBB). With MI325X it is only 1800 billion.

At 8 TByte/s, the memory transfer rate is at Blackwell level and around a third higher than MI300X. AMD has not yet provided any information on the important GPU-to-GPU connection; this is 896 GByte/s for the MI325X.

There is rather bad news on the technical side for AMD's MI325X Instinct accelerator, which was first shown at Computex 2024 and can be configured up to a TDP of 1000 watts. AMD has decided not to pull the option of "up to" 288 GByte HBM3e stack memory and to only launch the next smaller expansion stage with 256 GByte. The reasons given on request do not sound very plausible, for example, software optimizations are now intended to make do with slightly less memory. Perhaps the necessary 36 GB modules have simply become too expensive for the target price or are simply sold out due to high demand.

The 8-packs on the universal baseboard should be available immediately after the start of production in the fourth quarter of 2024. However, it will take until the first quarter of 2025 before individual MI325X accelerators will be available from partners. AMD did not comment specifically on prices. However, the intention is to offer a better total cost of ownership, i.e. integrated acquisition and operating costs, and at the same time make very economical decisions. As AMD expects to be faster than Nvidia's H200, a lower TCO can be expected even at the same price.

AMD has not changed the remaining technical data and refers to in-house benchmarks, according to which a single MI325X would be 10 percent ahead of Nvidia's current H200 when training Metas Llama-2 7B, i.e. with 7 billion parameters. However, the fact that it is only on a par with the eight-pack and the larger Llama-2 70B models suggests further optimization potential in software – or that the GPU-GPU connection is not as fast in practice as that of the competition.

Videos by heise

If you haven't seen any As for all the Xs, you've been paying attention. AMD has not said much more about new integrated accelerators with CPU cores since the launch of the Instinct MI300A. As for Advancing AI 2024, they only wanted to confirm that they will continue to focus on a well-designed interface when co-designing CPU and GPU, but not whether there will continue to be fully integrated CPU cores or chiplets on a package with GPU cores for the data center.

Disclaimer: AMD covered the author's travel and accommodation costs to the "Advancing AI 2024" event.

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmung wird hier ein externer Preisvergleich (heise Preisvergleich) geladen.

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (heise Preisvergleich) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

(csp)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.