AI models learn by heart and do not reason

Even if providers claim that AI models are good at reasoning, a study suggests that they are only receiving.

Save to Pocket listen Print view
Robot shows itself upside down

(Image: Tatiana Shepeleva/Shutterstock.com)

3 min. read
This article was originally published in German and has been automatically translated.

According to a study, large language models often reproduce memorized solutions rather than actually reasoning – this is shown by a study by the Massachusetts Institute of Technology (MIT) and Boston University. Providers of language models often like to claim that their models are particularly good at reasoning. This refers to the ability to think logically, which according to many is a major sticking point in the development of Artificial General Intelligence (AGI).

The abilities of the language models were examined using counterfactual reasoning tasks. These are tasks that deal with events that have not occurred, i.e. making assumptions about what would have happened or would happen if certain events had occurred or had not occurred. A total of eleven tasks were devised in which the rules or conditions deviated slightly from certain standard tasks. For example, the models were asked to perform additions in number systems other than the decimal system. They were supposed to represent a cup of bubble tea on the head, evaluate a chess move and more.

(Image: [Link auf https://arxiv.org/pdf/2307.02477])

While GPT-4, for example, can solve almost all the standard tasks according to the study published on arXiv, it performs significantly worse on the modified tasks. The researchers point out that the frequency of correct answers suggests that the models have a certain ability to generalize, i.e. to think logically within a certain framework. However, the results are nowhere near as good as providers and common benchmarks would suggest. The result indicates that the language models memorize a lot and reproduce what they have learned, but that they also develop a small part.

The study also includes a graphic showing drawings and associated tasks from GPT-4. The model was asked to draw a house, a penguin, a cake and a unicorn. All four objects were initially shown correctly. The task was then to mirror the objects, rotate them by 90 degrees and turn them upside down. This hardly worked at all.

(Image: arxiv)

Finally, the authors of the study also ask whether people might be able to answer the deviating questions with similar difficulty. They come to the conclusion that although humans could take longer to answer the questions, they would then answer better than the AI models.

This is not the first study to show that large language models (LLMs) do not perform particularly well in reasoning tasks - contrary to what the providers claim. For example, another study indicated that LLMs could not solve tasks that were not very challenging for primary school pupils.

(emw)