This is what the new standard for Ethernet in AI clusters should look like

The Ultra Ethernet Consortium (UEC) provides initial insights into the new Ethernet specification, which is specifically designed for AI clusters.

listen Print view
Ethernet plug

(Image: alexskopje/Shutterstock.com)

3 min. read
By
  • Benjamin Pfister

The Ultra Ethernet Consortium (UEC) is working on finalizing the 1.0 specification of the Ultra Ethernet standard, which was developed specifically for high-performance computing (HPC) and artificial intelligence (AI). The focus is on the UET protocol (Ultra Ethernet Transport), which optimizes various layers to improve the performance of AI and HPC workloads.

Overall, the specification introduces three different profiles, each with assigned subsets of the full functionalities of UET: AI Base, AI Full and HPC. UET fundamentally uses RDMA (Remote Direct Memory Access) mechanisms, allowing access directly from the network to the host memory and bypassing the operating system kernel for optimized latency. It also offers mechanisms such as "Deferrable Send", a technique that avoids delays caused by buffer availability checks. If no buffer is available, the receiver can continue sending as soon as it is available, reducing dependency on the sender's timers. UET also does not use a handshake to establish a connection. The peers in UET work with short-term connections for the respective transactions, which are discarded at the end of the transaction. This is intended to improve scalability and reduce costs.

UET also works with two traffic classes (TCs). Packets are assigned to TCs and queues to avoid deadlocks between responses and requests in a lossless environment. Other innovations include efficient congestion control mechanisms that use "spraying methods" to optimize load balancing in ECMP (equal-cost multipath) networks. Dynamic window sizes based on the round-trip time (RTT) of the path, ECN markings and any packet losses can also be used. So-called In-Network Collectives (INCs), also known as "switch offloading", can outsource network operations to switches for hardware acceleration of end devices.

Security-by-design approaches were considered right from the start. It is based on proven protocols such as IPSec, but also on the open-source project PSP. It offers AES-GCM, key derivation functions and protection against replay attacks. However, the protocol developers also attached great importance to high efficiency based on group keys within a trusted environment. At the link layer level, UET introduces Link Layer Retry (LLR) to reduce the impact of individual faulty links in an AI cluster. In an LLR connection, each packet is held in a buffer at the sender until the receiver acknowledges receipt. The support is exchanged between the peers via the Link Layer Discovery Protocol (LLDP).

Videos by heise

The Ultra Ethernet Transport Protocol specification will bring some exciting innovations for AI clusters. These will make the Ethernet protocol, which is known for its great flexibility, even more interesting for use in AI clusters. This can be seen from the fact that GPU heavyweight Nvidia has now also joined the consortium. Founding members of the consortium include AMD, Arista, Broadcom, Cisco, HPE, Intel, Meta and Microsoft.

(mki)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.