Apple paper: Why reasoning models probably don't think

They need a lot of power, but do not always deliver better results: Large Reasoning Models are supposed to revolutionize AI. An Apple study criticizes this.

listen Print view
Head model with phrenology imprint

Head model with phrenology imprint: Pseudoscience in reasoning also existed in humans. Does this also apply to reasoning models?

(Image: life_in_a_pixel / Shutterstock)

3 min. read

In a research paper on large reasoning models (LRMs), Apple's machine learning research group comes to the conclusion that the "thinking" of LRMs could be at least partly an illusion. Another problem is that reasoning models require significantly more energy and performance, as evidenced by the longer response times.

LRMs are AI models designed to give regular language models the ability to think logically. The systems attempt to break down tasks into different thought steps, which are then output to the user. However, it is not yet clear whether the system really "thinks" internally or whether the reasoning is not just additional generated content that has little influence on the result.

Apple's AI researchers looked at two LRMs for their paper: Claude 3.7 Sonnet Thinking as well as DeepSeek-R1. The tasks used were mainly puzzles, including the River Crossing problem and the Tower of Hanoi – with varying degrees of complexity. It was found that the two LRMs performed more accurately and efficiently on simple tasks compared to their variant without reasoning – with lower power consumption. Moderately difficult tasks seemed to suit the reasoning models. As the level of complexity increased, it did not matter how much power was available to the LRMs: accuracy dropped dramatically.

"We found that LRMs have limitations when it comes to exact calculation: They do not use explicit algorithms and reason inconsistently across puzzles," said the Apple researchers. However, LRMs are by no means only used for puzzles – they can at least be helpful in other subject areas.

Videos by heise

The Apple study was received differently in the AI scene. Some experts consider it to be too short-sighted, while others praised the approach. In fact, the researchers have not found a real explanation for the decline in performance of LRMs in more difficult tasks. This is also difficult, however, as it is just as difficult to "look inside" LRMs as it is to "look inside" regular large language models (LLMs). There is also the question of the extent to which the results can be generalized: The tasks chosen were very specific.

However, the Apple researchers also admit this: "We are aware that our work has limitations. While our puzzle environments enable controlled experiments with detailed control of problem complexity, they only represent a small sample of thinking tasks and may not capture the diversity of real or knowledge-intensive thinking tasks." How illusory the "thinking" of LRMs is therefore remains to be seen.

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmung wird hier ein externer Preisvergleich (heise Preisvergleich) geladen.

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (heise Preisvergleich) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

(bsc)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.