AI networks without bottlenecks: Arista revises load balancing and monitoring

Arista launches updates for devices and monitoring. A new load distribution should reduce latencies and accelerate tasks in AI networks.

listen Print view
Server in the data center

(Image: Gorodenkoff/Shutterstock.com)

3 min. read
By
  • Benjamin Pfister

Data center network equipment provider Arista Networks has revised its network equipment and the associated monitoring for AI workloads. It now includes Cluster Load Balancing (CLB), which is designed to distribute data streams evenly. In addition, the CloudVision Universal Network Observability (CV UNO) monitoring tool is designed to create end-to-end visibility in the AI network and thus enable agnostics for the associated flows and potential error patterns.

AI clusters usually generate few data streams in the network for AI training, but these have a high bandwidth. Conventional methods for load balancing, which operate purely at the network header level, are therefore often inefficient for AI workloads and lead to an uneven distribution of traffic. In addition, increased latencies and packet losses sometimes occur, which delay the completion of tasks.

CLB is designed to reduce this time by agnosticizing the data traffic of Remote Direct Memory Access (RDMA). CLB also affects the communication behavior of the AI training software via the computing unit with its dedicated network. Specifically, the load balancing works with the bidirectional data flow in a spine-leaf architecture, i.e., both from spine to leaf and in the opposite direction. The CLB is designed to recognize relevant flows, ensure an even distribution of all data streams, and keep latency low at the same time.

Each RDMA endpoint, such as a server in an AI cluster, has at least one queue pair that communicates with a remote queue pair on another server. Such a pair consists of a send queue and a receive queue and can access the memory directly without involving the CPU. This allows latencies to be reduced. According to the company, the customer Oracle has avoided problems with colliding data streams and increased throughput in machine learning networks thanks to the revised load distribution.

With its CV UNO monitoring platform, Arista aims to provide its customers with a comprehensive view of AI networks. Users can use the new monitoring tool to view the status of AI jobs. This also includes job completion times, buffer/link utilization and overload indicators such as ECN-marked packets, PFC pause frames and packet errors.

Videos by heise

The so-called deep-dive analytics should be able to detect critical, job-specific events on switches and server NICs, such as RDMA errors or PCIe fatal errors. It should also be able to precisely identify associated flows to see performance bottlenecks. The function also has a flow visualization for AI job sequences with a granularity of microseconds. In the area of AI infrastructure, Nvidia and Arista competitor Cisco have recently expanded their collaboration.

(emw)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.