All against Nvidia: open standard "UALink" networks AI clusters

Nvidia dominates AI accelerators and couples them via NVLink. In contrast, AMD, Intel, Broadcom, Cisco and hyperscalers are now using UALink and Ultra Ethernet.

Switch with network cables.

(Image: momente/Shutterstock.com)

May 30, 2024 at 9:35 pm CEST

3 min. read

c't Magazin

By

Christof Windeck

The explosive market growth in AI servers is turning the technology upside down. Nvidia not only dominates the market for AI computing accelerators, but also the superfast networking technology required for this with its proprietary NVLink.

This is why the companies AMD, Broadcom, Cisco, Google, HPE, Intel, Meta and Microsoft are now cooperating on the open interconnect Ultra Accelerator Link (UALink). Ultra Ethernet and the Compute Express Link (CXL) based on PCIe 5.0 also play a role here.

Videos by heise

Interconnect depends on computing accelerators

Clusters of current AI high-performance computing accelerators are interconnected differently than older supercomputers, for example. The interconnect in the individual computing nodes is no longer connected to the main processors (CPUs), but directly to the AI accelerators (GPUs). On the one hand, these are coupled together rapidly within the node to be able to process AI models quickly that do not fit into the local memory of a single computing accelerator. On the other hand, several of the AI accelerators also have external connections to reach other nodes via a switch with high bandwidth and low latency.

Nvidia has been relying on NVLink for several years. Four years ago, the company acquired the network expert Mellanox for seven billion US dollars.

Blockschaltbild KI-Server mit AMD Instinct MI — The 200G Ethernet adapters for networking several servers with AI computing accelerators from AMD are connected to the latter, not to the CPU.

(Image: AMD)

Competition left behind

Within the individual nodes, the respective manufacturers of AI accelerators use proprietary processes, such as AMD Infinity Fabric or the open standard Compute Express Link (CXL). In the future, the Ultra Accelerator Link will form the external bridge between a maximum of 1024 nodes in an AI cluster.

With Infiniband, a fast interconnect for clusters has been available for years. However, development does not seem to be progressing fast enough and there are (still) only a few companies that manufacture Infiniband hardware.

There is more competition for Ethernet, and the infrastructure of cables and switches can be used more flexibly. Intel, among others, is already working on 800G Ethernet adapters and Broadcom on switch chips. Work is already underway on the specification for 1.6-TBit Ethernet (IEEE P802.3dj draft). Methods such as Remote DMA over Converged Ethernet (RoCE) are available as an alternative to Infiniband for high-performance networking of clusters.

The Ultra Ethernet Consortium (UEC) has been working under the umbrella of the Linux Foundation since the end of 2023.It aims to accelerate and optimize data transfers at all levels: Physical Layer, Link Layer, Transport Layer and Software Layer.

Within the next four months, the Ultra Accelerator Link Consortium plans to publish an initial version of the specification, formally establish itself and then certainly set up a website. So far, there is only a press release that was issued a few days before Computex 2024.