LLM: Alibaba's Qwen3 challenges Llama 4 & Co.

Alibaba's current AI model is significantly more powerful than its predecessor Qwen2.5 and outperforms the competition in some areas.

(Image: Erstellt mit KI (Midjourney) durch iX-Redaktion)

Apr 30, 2025 at 10:48 pm CEST

18 min. read

Developer

By

Dr. Christian Winkler

The long-awaited new Qwen model was released on April 28, 2025. After being visible a few times on Hugging Face and Modelscope and then disappearing again, Qwen3 is now generally available.

is a data scientist and machine learning architect. He holds a PhD in theoretical physics and has been working in the field of big data and artificial intelligence for 20 years, with a particular focus on scalable systems and intelligent algorithms for mass text processing. He is professor at Nuremberg Institute of Technology since 2022. His research focuses on the optimization of user experience using modern methods. He is the founder of datanizing GmbH, a speaker at conferences and author of articles on machine learning and text analytics.

Heterogeneous model family

As usual , the new version is also available in different sizes, namely from 600 million parameters to 235 billion. In contrast to Llama 4, Alibaba has not trained the smaller models as a Mixture of Experts. Qwen3 therefore offers models with 600 million, 1.7 billion, four billion, eight billion, 14 billion and 32 billion parameters. Compared to the previous Qwen2.5 models, the smaller ones (600M to 4B) have each grown by around 10 percent, while the larger two have practically not changed in size.

However, Qwen3 now also provides the large models as Mixture-of-Experts models. The model with 30 billion parameters requires "only" three billion parameters for each calculation, whereas the larger model with 235 billion parameters requires 22 billion. So if you have enough RAM, you can generate tokens very quickly with the 30B model, even with the 235B model it should work even faster than with the dense 32B model.

Videos by heise

What was surprising about the Llama 4 release was the lack of a reasoning model. Qwen3 is much more consistent here: All models can handle reasoning – if you want them to. The enable_thinking parameter causes the model to "think" before giving an answer and to make this thinking process available. This works well and is extremely practical: you only need to load a model into the RAM and can use it as a reasoning or classic LLM, depending on the intended use.

The blog article on the release of Qwen 3 contains interesting information on training. For the base model, the amount of pre-training data has been doubled from 18 trillion tokens to 36 trillion tokens compared to Qwen2.5, and the developers have used text from PDFs (extracted using Qwen2.5-VL) in addition to HTML documents. Alibaba used the older Qwen2.5-Math and Qwen2.5-Coder models to generate synthetic training data.

The training took place in two phases: the first with a context length of 4K and 30 trillion tokens and the second with an increased context length of 32K, with more sophisticated content and additional reasoning and MINT capabilities. The post-training, which Alibaba carried out in four stages only for the large models (32B and 235B) based on the process invented by DeepSeek with two reinforcement learning phases, is particularly exciting. The developers then transferred the trained models to the small base models using knowledge distillation.

The model cards contain some interesting information about what Alibaba has improved. For example, it is claimed that the reasoning capabilities of Qwen3 are better than those of QwQ. In addition, the mathematical, logical and coding capabilities of Qwen3 are said to be better, and the models are also said to be very suitable for creative writing. It seems particularly important that Qwen3 is strongly optimized for collaboration with agents and MCP. It is a multilingual model that can handle over 100 languages. The possible context length of 32,768 tokens should be sufficient for many use cases. With RoPE scaling (Rotary Position Embedding), this can be increased (for some models) to up to 128K. In addition to weights with Bfloat16 accuracy, weights with FP8 are also available.

In contrast to the surprising release of Llama 4, Qwen3 seems to be extremely well thought out. Unlike Llama 4, it can be used in the EU without any problems, as the weights are licensed according to Apache 2.0. This is a small improvement on the Qwen 2.5 models, which could not be used freely in all sizes.

Tests with the new model

You can try out how well it actually works because the weights can be downloaded as an open model.

Fortunately, almost all Qwen3 models have already been quantized, so you can easily download them from Hugging Face and try them out. Because the models are relatively small, they are easy to run with llama.cpp. Furthermore, the models can be run with GPUs, but depending on the available RAM, the larger models can be a bit tight.

The experiments were carried out with Qwen3-32B-Q4_K_M.gguf. Unfortunately, llama.cpp cannot yet handle the enable_thinking parameter correctly. The /no_think appendix to the prompt serves as a workaround.

Counting and calculating with Qwen3

A first test of the model with and without reasoning gives the following result:

How many "e "s are in "strawberry"?

<think>

Okay, I need to find out how many "e "s are in the word "strawberry". Let me write the word out and look at each letter.

E - R - D - B - E - E - R - E

Now, I'll count the "e "s: 1. the first letter is E (1) 2. the fifth letter is E (2) 3. the sixth letter is E (3) 4. the eighth letter is E (4)

So there are 4 "e "s in the word "Erdbeere".

Actually, let me double check that. I might be counting too quickly.

E - R - D - B - E - E - R - E

1st First letter is E (1)

2nd Fifth letter is E (2)

3. sixth letter is E (3)

4. eighth letter is E (4)

Yes, that's 4 "e "s in the word "Erdbeere".

</think>

Wait, I'm using German language here, and in German, capitalization is important. The first letter E is a capital letter, and the others are lowercase. But the question is just asking how many "e "s there are, regardless of case. So I think the answer is still 4.

Actually, wait again. Let me count once more carefully:

E (1) R D B E (2) E (3) R E (4)

Yes, there are 4 "e "s in the word "Erdbeere".

To determine the number of letters "e" in the word "strawberry", let's look at the word letter by letter:

E - R - D - B - E - E - R - E

Now we count the letters "e" (count both lowercase and uppercase letters): 1. the first letter is E (1) 2. the fifth letter is E (2) 3. the sixth letter is E (3) 4. the eighth letter is E (4)

The number of letters "e" in the word strawberry is therefore

4

How many "E" are in "strawberry"? /no_think

There are 4 "E" in the word "strawberry".
Here is the breakdown:
Strawberry
- The 1st "E" is at the beginning
- The 2nd "E" is in the 4th position
- The 3rd "E" is in the 5th position
- The 4th "E" is at the end
✅ Result: 4 E

Qwen3-32B can also count letters in German. This works with and without (!) reasoning. This makes it clearly superior to Llama 4, which is more than three times the size. It also works with the English word "strawberry". Not all Qwen3 models can do this. The smallest model (0.6B) consistently miscounts, the 1.7B model can already count correctly in reasoning mode, the 4B model can do it in English without reasoning, and the 14B model can also do it in German.

Unfortunately, things don't look so good with mathematical puzzles. The factorization of 2^20+1 (1,048,577 = 17 * 61,681) unfortunately does not work with any (quantized) Qwen3 model. There is always a calculation error at some point or the system gets into an infinite loop.

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.