Nvidia Rubin CPX: New AI inference accelerator set to launch by late 2026

AI apps and models evolve rapidly, creating new optimization needs. Nvidia now unveils the Rubin CPX accelerator to target this growing demand.

listen Print view

(Image: Bild: Nvidia)

5 min. read
Contents

Nvidia plans to launch a special accelerator chip, the Rubin CPX, at the end of 2026. It is specifically designed to accelerate AI applications with enormous context requirements (see below), which in turn are expected to be particularly profitable. These include AIs that deliver program code or create AI films from scratch. Rubin CPX is designed to address the specific bottlenecks in the processing of such AI models.

The chip is part of Nvidia's upcoming 2026 Vera Rubin generation of data center accelerators, which has already had its tape-out. It is either integrated directly into the rack slot or available as an additional accelerator in separate racks.

The currently popular AI models such as DeepSeek R1, Llama4 Maverick, gpt-ossm, Qwen3 or Kimi K2 use a technique called Mixture-of-Experts. Different specialized neural networks (the experts) are used for different queries (parts). This in turn leads to lower memory and computing power requirements for the individual experts.

Videos by heise

The trick is to find the optimal mix of experts for each query; the distribution of queries among the experts is crucial. Especially in connection with reasoning, the individual experts have to communicate with each other and the MoE model as a whole becomes more complex.

At the same time, the responses of AI applications are becoming orders of magnitude more complex, for example through the output of entire program code sequences or artificially generated films. This leads to an exponential increase in the number of tokens that need to be constantly considered. A token is the smallest unit of information that is assigned a numerical ID within an AI to simplify calculations. A token can represent information ranging from one letter to short phrases. Estimates equate an English word to 1.5 tokens on average.

Nvidia's Vera Rubin accelerator in an artist's impression. Systems equipped with this technology are expected to arrive in data centers in 2026.

In order for the answer to be consistent, the AI has to take far more tokens into account internally when weighting than are displayed in the answer window, this is called context. Chat-GPT 3.5 initially had a context window of 4096 tokens. GPT-4o is already at 128,000 tokens, Google Gemini 1.5 Pro at 2 million tokens.

One resulting optimization approach is the disaggregated serving of requests. Context and prefill stages are assigned to different accelerators when answering a request. Nvidia is already using this with current GB200 Blackwell systems. For example, in the optimized submissions to the MLCommons MLPerf Inference v5.1 AI benchmarks, 56 GPUs of the 72 Blackwell GPUs in an NVL72 rack only work on the context, and only the remaining 16 generate the content. Based on an imprecisely labeled diagram, this optimization results in an estimated performance jump of 40 to 45 percent on Blackwell.

Nvidia makes use of another property of these LLMs: with the appropriate fine-tuning, they can also manage with quite low computing accuracy, so that the in-house, 4-bit floating point format NVFP4 with block-divided exponents is sufficient for the required response accuracy with MLPerf Inference.

Nvidia has already optimized Blackwell Ultra (GB300) for maximum throughput in this format. To achieve this, the engineers have upgraded the Exponent 2 function, which plays a major role in the attention layer of all AI models with Transformer technology. As these run outside the tensor cores specialized in AI throughput in the SFU units (Special Function Units), they have already become a bottleneck in Blackwell, as the EX2 performance compared to Hopper has barely increased. Blackwell Ultra doubles EX2 throughput over Blackwell from 5 to 10.7 trillion exponential calculations per second.

An NVL72 GB300 rack achieves around 1.1 ExaFLOPS in NVFP4, Rubin NVL144 is projected by Nvidia at 3.6 EFlops and a Rubin CPX rack at a whopping 8 EFlops.

By the end of 2026, a Ruby CPX is expected to achieve three times the exponential throughput of 30 PFlops NVFP4 compared to today's GB300. As the context phase requires less fast RAM and is mainly limited by the calculations, Nvidia relies on 128 GByte GDDR7 memory for Rubin CPX.

(csp)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.