Blackwell: Nvidia unveils its next generation of AI accelerators

Page 2: Blackwell products

Nvidia wants to offer Blackwell as a GB200 board on which two Blackwell dual chips are combined with a Grace processor. According to Nvidia, the Grace CPU has only been slightly adapted, but details are also missing here.

While Nvidia has accelerated the proprietary fast GPU connection NVLink from 900 to 1800 GByte/s in what is now the fifth generation, both Blackwells have to share the C2C link with 900 GByte/s to the Grace CPU. Each Blackwell double chip has a maximum combined upstream and downstream transfer rate of 450 GByte/s at its disposal. Nvidia achieves the transfer rate of 1.8 TByte by doubling the speed per lane pair from 25 to 50 GByte/s and direction, i.e. 100 GByte/s full-duplex multiplied by 18 lane pairs.

Nvidia GB200 — The GB200 board with a Grace CPU and two B100 accelerators.

(Image: Nvidia)

Nvidia combines two GB200s for a liquid-cooled 1U rack slot; each Blackwell dual chip can be configured with up to 1200 watts of power consumption, leaving 300 watts for the Grace CPU, which together amounts to 2.7 kW. However, the classic HGX rack bays with eight SXM cards will also continue to be available. For HGX B200, each double chip can draw up to 1000 watts, in the air-cooled B100 still 700 watts.

An intermediate stage comes with the GB200 NVL72, which Nvidia calls the "New Unit of Compute" - any similarity to Intel's recently abandoned "Next Unit of Compute" for mini PCs is certainly only coincidental. GB200 NVL is a pre-configured, liquid-cooled rack with 36 Grace CPUs and 72 Blackwell dual chips connected via NVSwitches. As this connection is made at full speed, without any further NVLink intermediate stage, Nvidia also refers to the NVL72 as a DGX system.

Nvidia GB200 NVL72 — The GB200-NVL72 system.

(Image: Nvidia)

NVSwitch with NVLink 5 for DGX SuperPODs

Nvidia is introducing a new NVLink switch for faster data transfer. The chip now also connects racks with each other and can connect up to 576 Blackwell double chips (i.e. 288 GB200) – previously the maximum was 256 GPUs. The switch has 50 billion transistors and, like Blackwell, is manufactured in TSMC's 4NP process. It transfers the full 1.8 TByte/s of each connected Blackwell dual chip from each to each client and manages a total of 7.2 TByte/s. Nvidia liquid cools two of these switches in a 1U tray, which is also used in the NVL72.

Up to 128 of the 576 chips can be combined in a confidential computing domain, in which access to the confidential memory by external partitions is prevented.

Nvidia has also brought the PCI Express connections up to date with PCIe 6. Together, the lanes can transfer another 256 GBytes – again in addition to upstream and downstream.

The improved connectivity is required, for example, for the SuperPOD with eight DGX-GB200 systems, which is also available as a turnkey solution. A SuperPOD therefore contains 288 Grace CPUs, 576 Blackwell double chips and 240 TBytes of total memory. According to Nvidia, the system achieves 11.5 exaflops (11,500 PFLOPS or 11,500,000 TFLOPS).

Nvidia wants to connect "hundreds of thousands" of GPUs with NVLink5, switches and 800 Gbit/s fast network from Mellanox.

Note: Nvidia has covered the author's travel and hotel costs to GTC 2024.