Alibaba's LLM Qwen3 at the next level
Thanks to a new architecture, the Qwen3-Next model is significantly slimmer than the original models, but similarly powerful.
(Image: Erstellt mit KI durch iX-Redaktion)
- Dr. Christian Winkler
A new Qwen3 model appeared on 10 September 2025 with relatively little fanfare. The marginal data sounds unspectacular: it has 80 billion parameters, three billion of which are always active. However, the changes are quite something and could indicate a possible direction in which language models will continue to develop.
New model architecture
The Qwen team identifies the total number of parameters and the context length as the biggest bottlenecks in both training and inference. Compared to the Qwen3 models that have been available for some time, the new model features the following innovations in particular:
- Hybrid attention mechanism
- Lean mixture-of-experts structure
- Training optimisations
- Prediction of multiple tokens
(Image: Golden Sikorka/Shutterstock)
The online conference LLMs in the enterprise on 29 October will show how to select the right model, set up the infrastructure and keep security under control. The theme day organised by iX and dpunkt.verlag will also provide an outlook on Liquid Foundation Models as the next generation of LLMs.
Hybrid attention mechanism: The new model uses a form of so-called linear attention (Gated DeltaNet) in 75 percent of the layers, which requires significantly less memory and computing time. The remaining layers work according to the standard attention mechanism. You can read in the blog that this hybrid architecture achieves better results than using the same attention mechanism in all layers. This change means that the model can no longer be described as a pure transformer architecture.
Lean mixture-of-experts structure: Mixture-of-experts (MoE) models only ever use some parameters and can therefore predict tokens more quickly. MoE models have been around for several years, and DeepSeek in particular has implemented innovations with its V3 architecture. It offers significantly more experts: 256 instead of the usual eight, but only eight are active at any one time. This means that only 37 billion of the 671 billion parameters are required for each prediction. Qwen3-Next goes even further and works with "only" 80 billion parameters with a total of 512 experts, ten of whom are always consulted. This means that each prediction only requires three billion parameters.
Training optimizations: Training large language models is enormously complex and takes hundreds of GPU years. Data scientists therefore pay great attention to optimizing this process as much as possible. While Moonshot.ai, for example, uses the Muon optimiser, the Swiss Apertus model uses Goldfish Loss to make training more efficient. Qwen3-Next has several other optimizations at the ready. Firstly, the hybrid attention mechanism also helps here, but the developers also use a zero-centered RMS (Root Mean Square) norm for the layer weights because the previously used QK (Query Key) norm exploded. In addition, they implement an undefined procedure that provides all MoE experts with unbiased training data. It is possible that the Auxiliary-Loss-Free method published by DeepSeek is used here, but the Qwen authors are keeping quiet about the details.
Prediction of multiple tokens: Several models have already experimented with multiple prediction, but so far mainly as an optimization in the training process. Here too, Qwen3-Next goes one step further and allows prediction in inference mode. As the predicted tokens are not always correct, the process is also called speculative decoding. What was previously only possible with tricks and the combination of small and large models, Qwen3-Next offers directly.
Videos by heise
The Qwen team claims that these optimizations enabled it to train the model with only 80 percent of the effort required for the significantly smaller Qwen3-30B-A3B. Compared to the dense Qwen3-32B, this means less than ten per cent of the effort. The optimisations also help in the inference phase: the model is significantly faster than comparably sized models, especially for long contexts.
Qwen3-Next in practice
Trying out the new model is not so easy, as the significantly changed architecture leads to problems with the popular llama.cpp tool, which will probably not work with it for the time being. Things look better with the Transformers library, and vLLM also works with Qwen3-Next and, surprisingly, also for the MLX framework provided by Apple.
Execution works most reliably with quantisation, i.e. reduced accuracy in favour of memory requirements, because the models would otherwise require more than 160 GB of RAM. On runpod.io, for example, you can rent an RTX 6000 Pro with 96 GB of VRAM for just under two Euros per hour and at least play around with the AWQ model (Activation-aware Weight Quantisation for LLM Compression and Acceleration). The same applies to Apple hardware, which should have at least 64 GB of RAM. Alternatively, you can use OpenRouter, where the model is available from various providers.
The answer from Qwen3-Next-Instruct is similar. Reasoning therefore brings hardly any improvements. Overall, the Instruct model is rated slightly better on lmarena.ai and livebench.ai. The German version of the strawberry challenge with the question about the number of "e "s in strawberry can be answered correctly by the Instruct model after an initial incorrect guess:
Qwen3-Next is extremely restrictive when it comes to political questions. It is only with difficulty that you can elicit something from it (especially in quantized models). What is interesting about the output is the repeated hint that the model is not allowed to say anything about the topic. It almost looks as if the model has blabbed, but then reverts to the indoctrinated texts:
The model works extremely fast. With the (less efficient) AWQ, you can achieve around 20 tokens per second on an RTX 6000 Pro, the 4-bit quantized model achieves almost 50 tokens per second on an M2 Ultra, and OpenRouter lists it at just under 150 tokens per second. This is remarkable for such a model.