AI networks without bottlenecks: Arista revises load balancing and monitoring
Arista launches updates for devices and monitoring. A new load distribution should reduce latencies and accelerate tasks in AI networks.
(Image: Gorodenkoff/Shutterstock.com)
- Benjamin Pfister
Data center network equipment provider Arista Networks has revised its network equipment and the associated monitoring for AI workloads. It now includes Cluster Load Balancing (CLB), which is designed to distribute data streams evenly. In addition, the CloudVision Universal Network Observability (CV UNO) monitoring tool is designed to create end-to-end visibility in the AI network and thus enable agnostics for the associated flows and potential error patterns.
Cluster load balancing reduces latencies
AI clusters usually generate few data streams in the network for AI training, but these have a high bandwidth. Conventional methods for load balancing, which operate purely at the network header level, are therefore often inefficient for AI workloads and lead to an uneven distribution of traffic. In addition, increased latencies and packet losses sometimes occur, which delay the completion of tasks.
CLB is designed to reduce this time by agnosticizing the data traffic of Remote Direct Memory Access (RDMA). CLB also affects the communication behavior of the AI training software via the computing unit with its dedicated network. Specifically, the load balancing works with the bidirectional data flow in a spine-leaf architecture, i.e., both from spine to leaf and in the opposite direction. The CLB is designed to recognize relevant flows, ensure an even distribution of all data streams, and keep latency low at the same time.
Each RDMA endpoint, such as a server in an AI cluster, has at least one queue pair that communicates with a remote queue pair on another server. Such a pair consists of a send queue and a receive queue and can access the memory directly without involving the CPU. This allows latencies to be reduced. According to the company, the customer Oracle has avoided problems with colliding data streams and increased throughput in machine learning networks thanks to the revised load distribution.
Monitoring tool detects problems in the data flow
With its CV UNO monitoring platform, Arista aims to provide its customers with a comprehensive view of AI networks. Users can use the new monitoring tool to view the status of AI jobs. This also includes job completion times, buffer/link utilization and overload indicators such as ECN-marked packets, PFC pause frames and packet errors.
Videos by heise
The so-called deep-dive analytics should be able to detect critical, job-specific events on switches and server NICs, such as RDMA errors or PCIe fatal errors. It should also be able to precisely identify associated flows to see performance bottlenecks. The function also has a flow visualization for AI job sequences with a granularity of microseconds. In the area of AI infrastructure, Nvidia and Arista competitor Cisco have recently expanded their collaboration.
(emw)