Can AI deliver new mathematical insights? Top researchers put it to the test

With a new method, ten researchers are putting the mathematical "creativity" of large language models to the test. The preliminary result is sobering.

listen Print view
Hand writes mathematical formulas on the blackboard

Can AI outrank human math researchers?

(Image: New Africa/Shutterstock.com)

4 min. read

Whether in complex calculations or logical proofs: language models like ChatGPT and Gemini are now considered extremely proficient in mathematics. It is far less certain so far how they perform off the beaten path. Are they capable of tackling unsolved scientific questions through their creativity, or are they just good at reproducing what they have already learned?

Ten renowned mathematicians are investigating this question in an experiment. The researchers each contributed a test question from various mathematical fields that stems from their own, as yet unpublished, research. Since there are no answers to these questions on the internet or from other sources, the language model cannot access already learned knowledge to find a solution. The goal is to test how far an AI can go beyond its training data and develop its own approaches to solutions.

To this end, the group confronted the language models ChatGPT 5.2 Pro from OpenAI and Gemini 3.0 Deep Think from Google with the research questions. The AI systems were granted unrestricted access to internet search.

In an interview with the New York Times, the researchers share initial impressions from pre-tests. Mathematician Martin Hairer is impressed by how confidently and correctly the AI can string together a series of known arguments and the calculations in between. However, when it comes to doing real research, a different picture emerges: According to Hairer, the AI's attempts resemble the work of a bad student who knows roughly where to start and where to go, but has no real idea how to get there.

Videos by heise

“I haven't seen a plausible example yet of a language model coming up with a truly new idea or a fundamentally new concept,” says the recipient of the Fields Medal, the most prestigious award in mathematics. Hairer therefore believes that mathematical research is “quite safe” from being taken over by large language models.

Some of Hairer's colleagues have had similar experiences in their tests. Mathematician Tamara Kolda, who also contributed a question, criticizes that the AI has no opinion of its own and is therefore not a good partner for genuine collaboration. Quite unlike her human colleagues.

Hairer, in turn, criticizes that the AI appears too self-confident. It requires a lot of effort to verify whether the answers are correct or not. Here too, Hairer says, the comparison with a student comes to mind, where one doesn't know exactly whether they are truly brilliant or just good at producing “bullshit.”

The experiment is intended as an attempt at an independent and public AI benchmark, separate from the usual testing procedures of large LLM providers. Beyond purely technical verification, the scientists are also working against the myth that mathematics has already been “solved” by AI. This counters the fear that an academic career in this field has become superfluous for students.

The ten questions have been available online since last week. The goal is for the research community to experiment with the tasks and form their opinion before the solutions are published on February 13th.

However, the experiment does not end there: after a certain maturation period, the group intends to formulate a second round of tasks in a few months. Considering the feedback received, these will enable an even more objective AI benchmark.

(vza)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.