Apple study: Logical thinking of AI hardly verifiable and "very fragile"
"Reasoning" is the new hype term for large language models. A study by researchers from the iPhone company has now taken a closer look at this.
Let there be intelligence: How logically can LLMs think?
(Image: sdecoret/Shutterstock.com)
A team from Apple's AI research department has looked at the current abilities for logical reasoning in large language models (LLMs) and has come to the conclusion that there are still a few issues here – or that it is even difficult to prove this "reasoning" in the long term. The mathematical capabilities of LLMs were examined, as stated in the preprint study.
"Advanced pattern matching" instead of math genius
Among other things, the researchers wanted to find out whether LLMs actually understand mathematical concepts or only appear to do so. As it turned out, LLMs usually do exactly what is expected of them: they use (very) advanced pattern matching to find answers. This also applies to current systems such as newer versions of Meta's Llama models or OpenAI's o1, which is particularly advertised with its "thinking" capabilities.
Videos by heise
The Apple researchers see problems above all when users do not formulate their queries precisely enough or use content that can distract the model. The results then also change, sometimes significantly. In a simple problem involving the collection of pieces of fruit over several days, the introduction of irrelevant information about the size of some of these pieces of fruit led to the answer being off by 10 percent. Apparently, pattern matching is very fragile here, according to the Apple Group. In some cases, the results were off by as much as 65 percent.
New logic benchmark for LLMs from Apple
Overall, according to the Apple researchers' hypothesis, there is no logical thinking in the models, which becomes apparent when the models are "confused" by additional information, which in turn worsens the result. "We hypothesize that this decline is due to the fact that the current LLMs are not capable of true reasoning; instead, they attempt to mimic the reasoning steps observed in their training data." This means that even a "thinking AI" is always guided by what it knows from the training data.
In order to better evaluate the math skills of large language models in the future, the Apple researchers are also introducing a new benchmarking system called GSM-Symbolic in their study, which is intended to replace the previous GSM8K benchmark (at elementary school level) in order to evaluate LLMs more accurately. Until further improvements are made, it is particularly helpful to formulate requests more precisely – and, above all, to leave out unnecessary components that could send the model on the wrong track.
Empfohlener redaktioneller Inhalt
Mit Ihrer Zustimmung wird hier ein externer Preisvergleich (heise Preisvergleich) geladen.
Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (heise Preisvergleich) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.
(bsc)