Innovative and almost completely open-source: Nvidia Nemotron 3 Nano

Most recently, successful, more transparent AI language models came from Chinese developers. With Nemotron 3 Nano, Nvidia is now following suit.

(Image: gguy/Shutterstock.com)

Dec 19, 2025 at 11:29 am CET

12 min. read

iX Magazin

By

Dr. Christian Winkler

Shortly before Christmas, there was an unexpected surprise for the LLM community: Nvidia released a new model named Nvidia-Nemotron-3-Nano-30B-A3B. The Reddit community was informed a bit earlier because an attentive reader had discovered that a less attentive Nvidia employee had accidentally pushed a parent directory to Hugging Face. The Nvidia model, released on December 15, 2025, contains a lot of new ideas, making it worth a closer look – also because it is only the first model in an entire family.

is a data scientist and machine learning architect. He holds a PhD in theoretical physics and has been working in the field of big data and artificial intelligence for 20 years, with a particular focus on scalable systems and intelligent algorithms for mass text processing. He is professor at Nuremberg Institute of Technology since 2022. His research focuses on the optimization of user experience using modern methods. He is the founder of datanizing GmbH, a speaker at conferences and author of articles on machine learning and text analytics.

Overview of the Architecture

Previously, Nemotron models were often finetunes of other models like Llama 3.1. Given the similar parameter count of a Qwen3 model, one might have suspected this for Nemotron 3 as well. With Nemotron 3, Nvidia has retrained the models from scratch and devised a new architecture for it. Nvidia alternates between the previous Mixture-of-Experts layers (MoE) and Mamba layers, which, strictly speaking, do not use a Transformer architecture. The advantage is significantly higher execution speed and lower memory consumption because the key-value cache, which remembers the context, does not grow with the context length in the Mamba layers. This is likely precisely why Nvidia was able to increase the context length to one million tokens. The model is therefore suitable for very long documents.

Although the model has “Nano” in its name, it is not really small but has 31.6 billion parameters, of which it uses 3.6 billion for each token prediction. This makes the model fast, and the more easily calculable Mamba layers also contribute to this. Nvidia speaks of a factor of 3.3 compared to comparable models. Such figures cannot be easily verified, which also applies to the best accuracy for reasoning, coding, tool usage, and multi-step agent tasks mentioned by Nvidia. Here, the model still has to prove itself in practice.

Compared to competitors Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B-A4B, Nemotron 3 Nano performs very well in Nvidia's benchmarks.

(Image: Nvidia)

In contrast, the ability to switch reasoning on and off or to limit the number of generated tokens is directly verifiable. This is particularly important for agent tasks, as otherwise uncontrollably high costs can arise.

Nemotron Nano consists of 52 layers with a model dimension of 2,688 and uses 32 attention heads. The Mamba layers have 128 state dimensions in eight groups, with 64 Mamba heads each in 64 head dimensions. In total, there are 128 experts, two of which work as shared experts; furthermore, the model activates six additional experts. Since the experts only have 1,856 dimensions, this explains the number of active parameters of 3.6 billion. Apart from the Mamba layers, other models use similar MoE architectures.

Videos by heise

Training Data Almost Completely Published

What truly distinguishes the Nemotron model from almost all other models, however, is the training data. Nvidia has published almost all of it, as well as the algorithms used in training. Apart from the Olmo and Apertus models, only a few providers have achieved this so far. The data can be found as pre- and post-training datasets on Hugging Face.

Some data appears to be from the future, with a modification date of December 20, 2025, an obvious error. Regardless, the data extends to June 2025. However, when asked about the knowledge cutoff, the model responds with June 2024 – another inconsistency. In total, 10 trillion tokens are available for download in the datasets. This is hardly trainable (or finetunable) with affordable hardware, yet it is very exciting to take a look at the datasets or at least use parts of them. In any case, this makes the models significantly more transparent. The Nano model is under the Nvidia Open Model License, which also permits commercial use and modification.

According to the Artificial Analysis Index, which aims to capture the openness and intelligence of models, Nemotron 3 Nano achieves good scores in both categories.

(Image: https://blog.vllm.ai/2025/12/15/run-nvidia-nemotron-3-nano.html)

Significantly more information can be found in Nvidia's blog post on Hugging Face, in the Nvidia blog, an associated white paper, or in the technical report. A GitHub project contains cookbooks that show how to use the model with frameworks like SGLang or vLLM.

Pre-Training

Pre-training, or base training, refers to the part where the model is trained on very large amounts of data to predict the next token. Typically, pre-training consumes by far the most computing power, although this is currently changing with some providers (like Qwen).

For pre-training, Nvidia uses the Warmup-Stable-Decay as a scheduler and trains the model with 25 trillion tokens in 15 different categories. The developers have divided this pre-training into two phases: the first phase uses 23.5 trillion tokens from relatively simple texts, and a second phase follows with 1.5 trillion tokens of significantly higher quality.

The largest of the 15 components are data from web crawling (Common Crawl), which are divided into five sub-areas with different quality levels (crawl-medium, crawl-medium-high, syn-crawl-medium-high, crawl-high, syn-crawl-high). In addition to crawling data, the mix also includes mathematical data, data from Wikipedia, specific programming code, and various others.

In pre-training, Nvidia also uses synthetic data, which is normally used for supervised finetuning. Nvidia understands Crawl++ to mean data coming from OpenWebText, BigScience, and Reddit. Pre-training is done in 19 languages: Arabic, Chinese (presumably Mandarin), Czech, Danish, Flemish, Finnish, French, German, Hebrew, Hindi, Italian, Japanese, Korean, Portuguese, Polish, Russian, Spanish, Swedish, and Thai. Nvidia assigns a higher weight to higher-quality data during training; however, the technical report is silent on how quality is determined.

In the different phases, Nvidia works with different context lengths. While Nvidia does not use RoPE (Rotational Position Embeddings for context expansion) due to the Mamba layers, the context is increased to up to 512K tokens with the higher-quality content. Nvidia trains Nemotron-3-Nano in bfloat16; for the larger variants, which have not yet been released, the much more compact NVFP4 is used. Nvidia claims that this does not result in significant quantization errors. Nvidia has also released the base models created after pre-training, which are not yet fine-tuned.

Post-Training

Nvidia divides post-training into three phases: supervised finetuning (SFT), reinforcement learning with verifiable rewards (RLVR) for different environments, and finally reinforcement learning with human feedback (RLHF).

In the SFT phase, Nvidia uses a context length of 256k tokens and training data from chats, agent dialogues, and reasoning. The latter is intended to help train the model to have a limit for reasoning or to turn it off completely so that it does not generate too many tokens and thus costs. At this stage, the model learns to perform reasoning with tools. Nvidia has separated the data into different areas: mathematical problems with formal proofs, programming code, scientific and software topics, and various languages. Nvidia also considers safety here, so that the model does not exceed its limits. CUDA training data is very specific to Nvidia, so Nemotron also masters this programming language.

In RLVR, Nvidia trains with data in very similar areas in parallel. The focus here is on verifiable results; for example, programs must pass unit tests. Unfortunately, Nvidia does not explain whether it verifies individual process steps, similar to DeepSeek V3.2; this might be an optimization that can be applied at a later stage. The context in RLVR is slightly smaller than in SFT, at 256k tokens.

Nvidia introduces new ideas in RLHF and uses a generative reward model (GenRM), interestingly a large Qwen3 model (Qwen3-235B-A22B-Thinking-2507). The Qwen3 model is first finetuned: during the training process, it uses its reasoning capabilities to evaluate two answers based on how helpful they are. Nvidia checks correctness using a synthetic dataset and the HelpSteer3 dataset. Once GenRM is trained, it is used in the actual reinforcement learning and evaluates 16 possible answers from Nemotron. To avoid evaluating all 120 possible combinations, GenRM only compares one answer with the next (and the last with the first), resulting in 16 evaluations. Nvidia has thus replaced human feedback with GenRM feedback, allowing for much better scaling – there will be no shortage of necessary hardware. It is almost astonishing that Nvidia does not perform all 120 comparisons.

Finally, Nvidia quantizes the bfloat16 model to FP8 and shows that almost no quality is lost. It is likely that they also tried this with NVFP4 and achieved worse results, which is why the larger models were trained directly in this data format.

Trying it out

Both vllm and SGLang already support the new Nemotron model. However, the model can also be used with llama.cpp, as the architecture with Mamba layers is very similar to what was found in Qwen3-Next. This allows the model to be run on moderate hardware and works without a GPU on the CPU at an acceptable speed.

When asked about Heise Verlag, the model answers a bit too creatively, but at least in correct German:

The model can count the number of “e”s in “Erdbeere” (strawberry) excellently and answers concisely – much more concisely than almost all other models tested so far:

The speed of Nemotron-Nano is high; on a MacStudio (M2 Ultra), it achieves about 80 tokens/s when generating answers.

Large Nemotrons

The larger Nemotron models, which have not yet been released, are said to have even more tricks up their sleeves. Nvidia announces LatentMoE, for example, and explains that the associated expert layer design has been optimized for hardware. This, like the NVFP4 format used, will likely only work well with Nvidia GPUs. Because these capabilities are only supported by Nvidia's latest hardware generation.

Multi-token prediction is already mastered by some models, and the Super and Ultra models are also said to be able to do this. Nvidia expects this to lead to improved generation of long texts and overall higher model quality. It is not yet known how large the further models will be and when they are expected – Nvidia speaks of “in the coming months.”

Conclusion

Nvidia has delivered. With the Nemotron family, even in the smallest Nano version, a model is finally available that competes with Chinese providers of open-weight models. If Nvidia's evaluations are to be believed, its model is currently leading when considering cost per token versus accuracy (frontier of accuracy-inference-throughput). At the same time, the model is available with open weights and can be used commercially. Additionally, Nvidia has published a large portion of the training data, creating an almost open-source model. It will be exciting to see how good the announced larger models will be.

Most recently, Nvidia also released the framework with which they measured the performance of the models; this is also freely available and is called Open Evaluation Standard. This is certainly another contribution to the transparency of the models and may also motivate other providers to benchmark their models with it.