Reasoning Fail: Common LLMs fail at a very simple task

Researchers have once again proven that current AI models are simply not good at logical thinking.

Save to Pocket listen Print view
Letters made of wooden blocks show the words F(AI)L.

Letters made of wooden blocks show the words F(AI)L.

(Image: Shutterstock/FrankHH)

3 min. read
This article was originally published in German and has been automatically translated.

The task is actually effortless: "Alice has N brothers and M sisters. How many sisters does Alice's brother have?" While most adults – and, according to the authors of one study, children too – can probably solve the task, the standard large language models (LLMs) fail. Even worse, as the researchers find because the AI models still firmly claim to have found the right answer when it was the wrong one, and they argue logically, but also incorrectly. This is a well-known problem with language models, but one that always comes as a surprise - especially as providers often shout loudly about how good their models are at reasoning.

OpenAI's GPT 3.5, 4 and 4o, Claude 3 Opus from Anthropic, Google's Gemini as well as the open models Llama 2 and 3 from Meta and Mistral and Mixtral from Mistral AI, Dbrx from Mosaic and Command R+ from Cohere were tested. The answers were recorded statistically and show a "severe breakdown in logical thinking and an inability to answer the simple question formulated above". Exceptions were GPT-4 and Claude 3, which at least sometimes answered correctly, according to the paper published by researchers from the Juelich Supercomputing Center, the Research Center Juelich, the School of Electrical and Electronic Engineering at the University of Bristol and Laion. Laion is a non-profit organization from Germany that provides data sets and models.

If you take the widely known metaphor that LLMs are stochastic parrots that only reproduce what they have picked up, it is not surprising that they fail at such tasks. The comparison with the parrot comes from a paper by leading AI researchers and critics, including Emily M. Bender and Timnit Gebru. However, the providers of current AI models repeatedly make big promises about how well their models perform in logical reasoning tests.

In contrast, the researchers of the "Alice in Wonderland" paper, as they call the problem with answering questions about Alice and her siblings, consider the lack of ability to be dangerous: "This breakdown can be considered dramatic not only because it happens on such an ostensible problem, but also because the models tend to label their incorrect solutions as correct, while often providing confabulations to further explain the given answer, mimicking an argument-like tone, but using nonsensical arguments as support for the
equally nonsensical, false, definitive answers." This is why the scientists also suggest that previous benchmarks should be reconsidered, as they failed to detect such simple reasoning deficits.

(emw)