LLMs from Meta: The fourth Llama herd has arrived
Llama 4 was released in a rush. After the enthusiasm generated by Llama 3, the sequel surprises and disappoints in some areas.
(Image: Erstellt mit KI (Midjourney) durch iX-Redaktion)
- Dr. Christian Winkler
It was to be expected that Llama 4 was just around the corner, but the fact that Meta suddenly released the large language model on the weekend of all days came as a surprise. The new Llama models were released on April 5. Presumably, the models should be available before the Llama conference on April 29. Meta may also have been trying to get ahead of Alibaba's new model (Qwen3), which is also due to be released in the next few days.
A look at the architecture
Meta has documented the architecture of Llama 4 in detail and, like its competitor DeepSeek, is now relying on sparse mixture-of-experts models. The previous Llama models were densely populated and were available in different sizes.
The new series also comes with different model sizes. The smaller Llama 4 Scout model has 109 billion parameters in 16 experts, so "only" 17 billion parameters are always active. The larger Llama 4 Maverick model has as many as 400 billion parameters in 128 experts, of which 17 billion are also always active.
Compared to the previous models, this is an increase in size. Although Llama 3 also had a – densely populated – model with 400 billion parameters, it was used relatively rarely. In addition to the model with 70 billion parameters, the smaller models with seven billion and fewer parameters – up to the small language models with one and three billion parameters were particularly popular. There are no models of this size in Llama 4 yet, and the blog article leaves open whether any are planned.
Videos by heise
Instead, Meta describes an even larger model in great detail: Llama 4 Behemoth with two trillion parameters in 16 experts, in which "only" 288 billion are active. It functions as a so-called teacher model for Maverick and Scout, but has not yet been fully trained. Interestingly, it still seems to be superior to all other (including proprietary) models in many areas, although the currently leading Gemini 2.5 is missing in the comparison.
How good Behemoth will really be can only be decided after its release. In any case, it will require considerable hardware resources, as the model is even larger than GPT-4. An entire cluster of H100 or even GB200 systems is required to run this model.
Meta has trained all models with a context length of ten million tokens. This is extremely large, other models often only have context lengths of 128 thousand tokens, some go up to a million. With all these long contexts, the "lost in the middle" problem has not yet been satisfactorily solved or even understood. It manifests itself in the fact that the models correctly consider the text at the beginning and end of the context, but forget the text in the middle. With 10 million tokens, the middle is presumably quite large – but the models have not yet been tested that precisely.
Meta says that it has trained the models with reduced accuracy (FP8). This procedure was first carried out by DeepSeek. Interestingly, the scout weights are still only published by Meta as bfloat16, for Maverick there is also an FP8 version at Hugging Face. Otherwise, Meta carried out a standard training process with Supervised Finetuning and Direct Preference Optimization. Reinforcement learning also played a role, but not to the same extent as with DeepSeek R1. Meta is silent about the total GPU time required.
The chat template that registered users can find at Hugging Face is interesting. It contains comments such as "{#- FIXME: The processor requires an array, always. #}", apparently Meta was not quite finished. The cutoff date is also recognizable: 26.7.2024. This is in slight contradiction to the recommended system prompt, which mentions August 2024.
Incidentally, Llama 4 is the first model to be stored in the XET format at Hugging Face, which is intended to replace the Git format in the long term and should be much faster because it is optimized for the binary structure of the language model data.
First tests
All Llama 4 variants are very large models that will fit on very few consumer graphics cards. Meta writes on the blog that a single H100 GPU would be sufficient for Scout, but very few people would have one lying around. Even those who can access it have to quantize the model in four bits.
However, quantized models can also be run very well on a CPU, especially on a Mac. This also requires a lot of memory, as the Scout models quantized in four bits are already over 60 GB in size. Apple's MLX framework has already been adapted to Llama 4, and llama.cpp has been able to handle it since release b5074 (from 8.4.2025). So there is nothing to stop you trying out the models. I generated the results shown here on a Mac Studio M2 Ultra using a Q4_K_M model that generated around 29 tokens per second. The IQ4_X_S model used for comparison was slightly faster and produced similar results. If you use Apple's own MLX framework, you can expect around 30 percent better performance.
Llama 4 knows Heise, but not quite correctly. In the following, errors are marked in bold and unverifiable claims in italics:
> Explain Heise Zeitschriftenverlag!
Heise Zeitschriftenverlag is a German publishing house specializing in the publication of trade journals and online media. The publishing house was founded in 1948 in Hanover and is today one of the largest and most renowned specialist publishers for information technology (IT) and telecommunications in Germany.
Heise Zeitschriftenverlag is primarily known for its publications in the fields of computers, electronics and telecommunications. The publisher's best-known titles include:
c't: One of the largest and most influential computer magazines in Germany, focusing on computer hardware, software and IT technology.
iX: A magazine for Unix and Linux administrators as well as for anyone involved in IT infrastructure and security.
MINDfactory: A magazine for anyone interested in electronics, robotics and home automation.
Over the years, Heise Zeitschriftenverlag has also expanded into online media and operates various websites, forums and social media channels. The online offerings include:
- heise.de: the publisher's website, which offers the latest news, articles and background information on IT and telecommunications topics.
- c-t-online.de: The online edition of c't magazine.
- ix.de: The online edition of the magazine iX.
Heise Zeitschriftenverlag is a member of the Börsenverein des Deutschen Buchhandels and the Verband der Zeitschriftenverleger. The publishing house employs around 200 people and is based in Hanover.
Overall, Heise Zeitschriftenverlag is an important provider of specialist information and media in the field of information technology and telecommunications in Germany and enjoys a high reputation among experts and enthusiasts alike.
Llama 4 provides convincing but incorrect answers, even to completely irrelevant questions that are often asked of language models:
> How many "r "s are in "strawberry"?
There are 2 "r "s in "strawberry".
> How many "E "s are in "strawberry"?
There are 3 "E "s in "strawberry".
The results are mixed. Llama 4 Scout is certainly a solid model, but at first glance it is not that much better than Qwen2.5-72B. This is a first subjective impression. In individual, specific scenarios, the picture could change completely.