xAI has apparently completed the world's fastest supercomputer

The Colossus supercomputer uses 100,000 H100 accelerators from Nvidia. The computing power is enormous, but so is the power consumption.

An HGX-H100 system from Nvidia. Dell uses one of these in its Poweredge XE9680.

(Image: Nvidia)

Sep 4, 2024 at 9:46 pm CEST

3 min. read

By

Mark Mantel

If xAI were to measure its supercomputer Colossus in the Linpack benchmark, it would probably top the Top500 list of the fastest computing systems. According to founder Elon Musk, Colossus has completed its first iteration with a whopping 100,000 Nvidia H100 (Hopper) accelerators and is now training the AI model behind the Grok chatbot.

Musk writes on X that it only took around four months from start to finish. This is unusual for a supercomputer of this size – especially error corrections during commissioning take a lot of time.

Potentially the fastest exascale supercomputer

In purely mathematical terms, 100,000 H100 accelerators achieve an FP64 computing power of 3.4 exaflops, i.e. 3.4 trillion computing operations per second. FP64 values – double precision floating point operations – are the relevant benchmark in the Top500 list. With the much simpler AI operations (typically FP8 or INT8), an enormous 396 exaflops would theoretically be possible. Added to this would be the computing power of the processors for controlling the Nvidia accelerators. As a rule, this is one CPU per four GPUs, in this case 25,000.

Even if you take into account the imperfect scaling of the countless hardware components, xAI has one of, if not the world's fastest supercomputers. However, it is unlikely to appear officially in the next Top500 list, as private companies rarely submit results.

For comparison: Frontier, leader of the Top500, combines a good 9,000 Epyc processors (Zen 3) with more than 37,000 Instinct MI250X accelerators from AMD. The system achieves a peak FP64 performance of 1.7 exaflops in the Linpack benchmark.

It is questionable where xAI was able to obtain 100,000 H100 modules in the short term. Nvidia is still fully booked; sales are increasing every quarter because the supplier TSMC is expanding its production. Dell CEO Michael Dell publicly praises the cooperation in the construction of Colossus, so the company had a hand in it.

Among other things, Dell builds the Poweredge server XE9680, which expands Nvidia's HGX standard system consisting of eight H100 modules with its own CPU server consisting of two Intel Xeon SPs. This combination could have significantly accelerated the assembly process. Alternatively, Dell could have built custom systems at short notice. At least part of the H100 contingent would therefore have come from Dell.

Enormous power requirement

The manufacturer specifies each H100 with 700 watts. The 100,000 units alone would therefore have an energy requirement of 70 megawatts. Added to this are components such as the processors and switches, as well as a considerable amount of cooling.

Musk plans to add another 50,000 H100s and 50,000 H200s with more memory to Colossus in the coming months. By then at the latest, the electrical power consumption will be well over 100 MW, and even closer to 200 MW.

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmmung wird hier ein externer Preisvergleich (heise Preisvergleich) geladen.

Preisvergleiche immer laden

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (heise Preisvergleich) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

(mma)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.

Back to top

Alle Angebote

Newsletter heise-Bot Push Push-Nachrichten

${intro} ${title}