Humanity's Last Exam: New AI test that all models fail

Even the most powerful models only manage 10 percent of the tasks in a new AI benchmark: Humanity's Last Exam.

(Image: photoschmidt/ Shutterstock.com)

Jan 24, 2025 at 12:04 pm CET

3 min. read

By

Eva-Maria Weiß

According to the providers, the latest and most powerful AI models easily achieve 90 percent of the common benchmarks. This simply means that they can pass such a high proportion of tasks in a standardized test. However, there is now a new test including a scientific paper: Humanity's Last Exam. Even the most advanced models fail it.

The benchmark was developed by the two US organizations Scale AI and the Center for AI Safety (CAIS). They collected questions on their respective fields from almost 1000 experts from 50 countries. 70,000 questions were collected. Of these, people took a closer look at 13,000 questions in a review process, of which 3,000 were included in the test. The test covers mathematics, natural sciences, humanities and more. The tasks vary from pure text tasks to multimodal skills required to understand diagrams and images. As the name of the test suggests, the experts believe they have developed the ultimate test.

Videos by heise

One of the questions is: "Hummingbirds within the Apodiformes have a unique, bilaterally paired oval piece of bone, a sesamoid, embedded in the caudolateral part of the expanded, cruciform aponeurosis of the insertion of the depressor caudae muscle. How many paired tendons are supported by this sesamoid? Give a number." (Editor's note. If there is an error in the translation of the question, it is because I am not a bird expert, just like the common AI models). Further sample questions are published at lastexam.ai .

OpenAI, Google, Anthropic – AI models reach 10 percent

The models that were asked to complete the Last Exam included OpenAI's GPT-4o, Anthropic Claude 3.5 Sonnet, Google Gemini 1.5 Pro and OpenAI's o1. They all ended up with less than ten percent correct answers, as the authors write. Nevertheless, they expect that due to the rapid speed at which AI models are improving, it will be possible to pass this test much better by the end of the year. It should be noted that AI models learn such tasks. It is not always clear whether they can solve a task because they have deduced or understood something or whether an answer is more likely to have been memorized and reproduced.

A diagram shows the performance of various AI models in different benchmarks. — A diagram shows the performance of different AI models in various benchmarks.

(Image: Paper CAIS)

In addition, the authors write in the conclusion that these are academic tasks, not tasks that require particular creativity or open-ended results. These areas require other tests. However, the aim of the paper is to contribute to providing scientists and policy makers with a common reference point for assessing AI capabilities.

Scale AI and CAIS are both based in San Francisco. The former offers data sets for AI training. CAIS is a non-profit organization that works in the field of AI safety and ethics. Dan Hendrycks, co-founder of CAIS, has already published another math benchmark. With another math benchmark, FrontierMath, it only recently came to light that OpenAI, of all things, had co-financed the development by EpochAI. Their o3 model performed best in this very test – with 25.2 percent of solved problems.