New AGI test overwhelms AI models
The ARC Prize Foundation has published a new test for AI models. While many people can solve it, the AI models fail.
(Image: Anggalih Prasetya/Shutterstock.com)
Human intelligence beats artificial intelligence (AI): The ARC Prize Foundation has developed a test to assess the performance of current AI models. While humans usually pass the test, the AI models fail.
The test involves solving sample tasks that also appear in common intelligence tests. For example, geometric figures have to be assigned colors according to certain criteria. In another task, such figures have to be put together. These tasks force the AI models to adapt to problems with which they have not previously been confronted.
Empfohlener redaktioneller Inhalt
Mit Ihrer Zustimmung wird hier ein externes YouTube-Video (Google Ireland Limited) geladen.
Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (Google Ireland Limited) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.
What is easily solvable for humans with a little thought – around 60 percent of a control group of more than 400 test subjects were able to do this - the AI models were completely overwhelmed: Reasoning models such as o1 from OpenAI or R1 from DeepSeek managed 1 percent and 1.3 percent respectively. Other models such as GPT-4.5, Claude 3.7 Sonnet or Gemini 2.0 Flash completed the test with a score of 1 percent.
The test, called ARC-AGI-2, was developed by the ARC Prize Foundation and is intended as a benchmark for the capabilities of Artificial General Intelligence (AGI). It is the successor to ARC-AGI-1, for whose solution the non-profit organization awarded one million US dollars last year.
The ARC-AGI-1 data set is five years old
The ARC-AGI-1 data set dates back to 2019, and the competition required 85% of the test to be solved. A jump in performance from 33 to 55.5 percent was recorded by the end of 2024. However, the set target was not achieved.
According to the initiators, the old data set had several weaknesses and was therefore replaced by the new one. Efficiency was also introduced as a new criterion for ARC-AGI-2. “Intelligence is not only determined by the ability to solve problems or achieve high scores. The efficiency with which these skills are acquired and applied is a crucial, determining component,” writes Greg Kamradt, one of the two founders of the ARC Prize Foundation, in a blog post. “The key question that arises is not only: 'Can AI acquire the ability to solve a task?' but also: “With what efficiency or at what cost?”
Videos by heise
The conditions for the new edition of the competition have been adapted accordingly: The AI model must not only solve the tasks 85 percent of the time, but should also be very efficient in doing so, i.e., generate low costs per task. The target is 42 US cents.
(wpl)