Vision language models fail in simple image tests

Even the largest vision language models from OpenAI, Google and Meta cannot solve very simple tasks, according to a study.

Save to Pocket listen Print view
Letters made of wooden blocks show the words F(AI)L.

Letters made of wooden blocks show the words F(AI)L.

(Image: Shutterstock/FrankHH)

4 min. read

Children can already solve similar problems in kindergarten: In which direction does a spiral open? Which elements are aligned vertically and which horizontally? What sounds trivial to humans presents even the largest vision language models (VLMs) with major to impossible challenges. This is shown by a study conducted by TU Darmstadt, Eindhoven University, the German Research Center for AI and hessian.ai.

OpenAI assures that GPT-4o has improved in "logical thinking", the study states, but the "depth of these advances in language-guided and abstract thinking have not yet been sufficiently researched". It is unclear whether the models can keep their ambitious promises. For this reason, the researchers explain, we are entering the "wonderland of Bongard problems". Mikhail Moiseevich Bongard was a Soviet computer scientist. In the 1960s, he designed a series of tasks that represent small puzzles. They are about pattern recognition. For example, the VLMs were supposed to recognize which objects were convex and which were concave or belonged together.

The researchers describe the result as follows: "And even when asked to explicitly focus on and analyse these concepts, they continue to fail, indicating not only a lack of understanding of these elementary visual concepts, but also an inability to generalize to unseen concepts." From this, they also conclude that there is a significant difference between human thinking and machine cognition.

All major AI providers are currently using their VLMs to bring AI agents onto the market. Google, for example, is dreaming of these agents doing internet research, shopping and even booking flights for people. OpenAI and Microsoft are also working on AI agents. At Microsoft, the initial focus is on creating individual agents with specialized tasks, not generalists. Anthropic has already made a general AI agent for Claude available to developers, which can control the mouse, fill in fields and act fairly autonomously. All these agents evaluate screenshots in order to act on them.

Scientists see visual understanding as an important basis for how humans can find their way around their environment and interact with objects. AI would now try to recreate this. Because of their ability to respond in a very human-like way, VLMs would also often appear to be intelligent. In fact, they showed "dramatic deficiencies" in reasoning and visual perception.

Horizontal and vertical alignment are not recognizable for VLMs.

(Image: TU Darmstadt)

Even the attempt to give the VLMs multiple-choice solutions (100 answers) was of little help. Only the further restriction of this selection option to 10 answers led to better results. However, this still only means a hit rate of around 60 to 70 percent at best. According to the researchers, the reason for the failure is partly the lack of ability to recognize a picture. Added to this is the lack of logical thinking and reasoning.

Although the models tested performed significantly better in other benchmarks, they were often trained for these directly. There are other studies that show that even the smallest deviations in the tasks lead to a significant deterioration in the results. The authors of this study point out that the current benchmarks may not be all that useful for testing the logical reasoning abilities of AI models. Others even doubt whether they can think logically at all.

(emw)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.