Meta cheats on Llama 4 benchmark

Meta cheated a little with the results of Llama 4 in the chatbot arena. Meta explained that they were experimenting.

(Image: Tada Images/Shutterstock.com)

Apr 8, 2025 at 10:26 am CEST

3 min. read

By

Eva-Maria Weiß

A few days ago, Llama 4 was released in two versions. Meta has written a blog post about this, in which it emphasizes that the open models perform at least as well or better in some common benchmarks than the closed competitor models from OpenAI and Google. However, it appears that Meta has cheated a little. Specifically, it's about the performance in the LM Arena.

This is where people evaluate the results of chatbots. They decide which AI model they like better and therefore performs better. Accordingly, there are points – an ELO score. According to Meta's blog post, Llama 4 Mavericks score was 1417, which is better than GPT-4o and slightly below Google's Gemini 2.5 Pro. However, attentive testers then discovered that the version of Llama 4 Maverick that is competing in the arena is not the same one that has now been made available by Meta.

Llama 4 model optimized for chat

The model that could be tested was called “Llama 4 Maverick optimized for conversationalty”. It is unclear how much difference this modification makes. Basically, performance in the chatbot arena is not a particularly meaningful benchmark. After all, it depends on the people who evaluate the results, and they can have completely different approaches.

When asked by The Verge, Meta wrote somewhat evasively that they were experimenting with all possible versions. It was a chat-optimized version, they are testing different versions and are now curious to see what developers do with the open model.

It is not specifically forbidden to test customized versions of the models in the LM Arena. Meta also correctly referred to the model as “Llama-4-Maverick-03-26-Experimental”. However, there is no reference to the fact that the results do not correspond to those of the freely available model.

Ahmad Al-Dahle, Vice-President Generative AI at Meta, directly rejects criticism that Meta trained Llama 4 directly using the benchmarks. In a post on X, he writes that they would never do such a thing. However, these accusations are repeated time and again – and they are by no means limited to Meta.

Since all freely available data from all possible sources is generally used to train the large AI models, this often includes data from common benchmarks. Even Meta's head of AI science, Yann LeCun, has already criticized the fact that many of the results of AI models are not due to intelligence or conclusions, but to the fact that they have been learned.

As The Verge writes, there has also been surprise that Meta published the models on a Saturday. Mark Zuckerberg has already responded to this, saying that they were just finished. In fact, Meta is not alone in this – OpenAI also shines with releases on weekends.