AMD sacrifices classic computing power for AI in the Instinct MI350X/MI355X
AMD's upcoming data center accelerators MI350X and 355X bring much more AI optimizations, but sacrifice classic floating point arithmetic per clock.
(Image: AMD)
AMD officially unveiled the upcoming MI350X and MI355X Instinct accelerators on June 12, 2025 at its in-house Advancing AI event in San José, California, and gave a preview of next year's MI400. The two MI35x models come with up to 288 GByte HBM3e stack memory (high-bandwidth memory) and, in the case of the MI355X with direct liquid cooling, consume 1.4 to 1.5 kilowatts of power. According to AMD, they should be around 2.6 to 4.2 times faster than their MI300X predecessors when it comes to AI inferencing. They should be able to compete with Nvidia's CPU-GPU combination GB200 in AI training or offer an advantage of up to 30 percent compared to the B200 accelerator.
AMD wants to secure a share of the huge AI pot of money that Nvidia has been feasting on financially for years and has reported one record quarter after another in its stock market reports. AMD advertises an up to 40 percent higher throughput rate per dollar (tokens/$), which is only partially fed by the maximum 30 percent higher throughput – In addition, AMD must also offer the MI355X cheaper than Nvidia's B200 systems.
AMD's Instinct series are massively parallel accelerators designed specifically for use in data centers; the MI350X and MI355X use CDNA4, the fourth generation of these computing accelerators. Both have a memory transfer rate of up to 8 TByte/s and differ mainly in terms of clock rate and power consumption. The MI350X is comparatively tame at 1 kW. However, with a throughput of 72 to 79 trillion computing steps at double-precision floating point numbers (FP64-TFLOPS), it is only just under 9 percent slower on paper than the much more power-hungry MI355X. The latter is said to consume up to 1.4 kW, i.e. 40 percent more than its smaller sister. In a preview, an AMD spokesperson even mentioned up to 1.5 kW.
(Image:Â AMD)
AMD plans to use the MI350X in air-cooled server racks with up to 64 GPUs. The MI355X is intended for use in high-density racks with up to 128 GPUs, but then requires direct liquid cooling (DLC) to prevent overheating.
However, it is not surprising that AMD is taking further steps towards AI optimization. On the one hand, this is where the big investor money is, and on the other, AMD's head of technology Mark Papermaster emphasized the importance of mixed-precision calculations at ISC25 in Hamburg just two days ago. AMD has not yet mentioned a version of the MI350 with integrated CPU chips similar to the MI300A.
One step forward, one step back
The focus of the predecessors from the MI300/325 series was still on use in supercomputers and data centers alike. AMD has changed this with the MI355X. The computing units are further optimized for use in AI tasks, but have to sacrifice performance in classic tasks. There are even setbacks per clock and computing unit.
As with the older Instinct MI models, the AMD engineers also use 3D chiplets in the MI355X. Two IO dies are used as the basis, which are manufactured using proven 6-nanometer technology. These contain a total of 256 MByte Infinity cache, divided into 2 MByte blocks, as well as the seven fourth-generation Infinity Fabric Links (IF), which now transfer 153.6 GByte/s each and connect up to eight MI35xXs at a total of 1075 GByte per second. AMD has also revised the 5.5 TByte/s fast connection of the two IO dies: It is now wider, but clocks lower. This reduces the required voltage as the main driver of power consumption. AMD calls this connection between the two IO dies the Infinity Fabric Advanced Package (IF-AP).
(Image:Â AMD)
The eight accelerator compute dies (XCDs), which TSMC manufactures using the more modern N3P process, are located on the two IO dies. Each of these contains 32 active compute units – four are deactivated to improve chip yield. If you are good at remembering numbers, you will notice that the predecessor had 48 more units with 304 CUs. This also means that the supply of data from the HBM3e memory is now 1.5 times better than before: 16 percent fewer CUs, 30 percent higher transfer rate.
According to AMD, this is based on practical experience, which confirms that the preferred AI applications are very bandwidth-hungry. Another change to the architecture is a larger fast buffer memory within the CUs (Local Data Share, LDS) of 160 KByte. The larger HBM memory and the revision of memory virtualization are accompanied by the adaptation of the so-called universal translation caches, which perform similar tasks to the translation lookaside buffers (TLBs) in processors. The TLB contains frequently used assignments of virtual to physical addresses. When accessing memory, the assignment is first searched for in the TLB before the page directory/table is consulted. If it is available in the cache, it is referred to as a "hit", otherwise as a "miss". The search in the TLB is considerably faster than accessing the page table.
Videos by heise
The biggest difference, however, is the structure of the individual arithmetic units in the compute units. AMD has given them new data formats, so that the matrix units can now handle FP6 and FP4 in addition to FP8 (both according to the OCP-FP8 and OCP-MX specifications as with Nvidia). The throughput of the two new formats is twice as high as that of the familiar FP8 – with Nvidia's B200, FP6 only reaches FP8 speed.
The thick multipliers of the matrix units in particular had to bleed for this. The throughput with FP64 data formats, which are not used in AI applications, was halved compared to the predecessor accelerators. This means that AMD is also following Nvidia's lead here, which has been prioritizing FP64 significantly lower for some time. However, the vector units, which are similar to the classic shader SIMDs in graphics cards, have not been touched in the MI350X/MI355X.
(Image:Â AMD)
According to AMD, the complete chips therefore now achieve a throughput of up to 20,000 TFLOPS with sparse matrices ("sparsity") with FP6 or FP4 accuracy. With FP8 or INT8, it is still half that, as well as with regularly populated matrices.
(Image:Â AMD)
"Helios" AI racks and outlook on MI400
With MI355X and MI350X, AMD also wants to specify its own AI racks for the first time. The basis continues to be UBB8 formats, universal base boards for eight accelerator modules, which will continue to be available from partners. New are the AI server racks with up to 128 MI355X accelerators and Direct Liquid Cooling. One such cabinet will then be able to achieve 2.57 exaflops of AI computing power in FP6/FP4 format and accommodate 36 TByte of HBM3e memory. AMD also emphasized once again that its in-house solution is based entirely on open standards such as OCP-UBBs or Ethernet from the Ultra Ethernet Consortium. The company wants to differentiate itself from Nvidia's proprietary server racks with NVLink.
The successor MI400 is due to be released in 2026 and will compete with Nvidia's Vera Rubin, which is expected by then, also as a complete solution in newly developed Helios racks similar to Nvidia's NVL72. AMD is promising around 50 percent more HBM4 memory capacity (31 TByte, added) and transfer rate (1.4 PByte/s, added) as well as scale-out bandwidth (i.e. into the network) for 72 MI400. The company expects the FP4/FP8 computing power and the scale-up bandwidth of the local HBM channels to be on a par.
(Image:Â AMD)
A single MI400 is expected to double FP4 performance compared to MI355X to 40 petaflops (40,000 teraflops, including sparsity) and connect 432 GByte HBM4 stack memory at up to 19.6 TByte/s. With 300 GByte, each GPU will be able to communicate twice as fast as MI350X/355X. AMD has not revealed how high the power consumption will then be, but has suggested an enormous performance advantage in a misleading diagram. According to the footnotes, this was apparently calculated on a platform basis: 72 MI400 vs. eight MI355X, which is why we are not showing it graphically here.
(csp)