SimpleQA: OpenAI develops benchmark for hallucinations

AI models have difficulties with factual accuracy and short, precise answers. OpenAI now wants to make this measurable and comparable.

(Image: Shutterstock/ioda)

Nov 1, 2024 at 4:25 pm CET

3 min. read

By

Eva-Maria Weiß

SimpleQA is designed to measure how well AI models perform on questions for which there is only one correct answer. OpenAI had 4326 questions created for this purpose. However, even GPT-4o and o1-Preview do not perform particularly well – they only achieve around 40 percent.

The test aims to get a better handle on the problem of hallucinations. This means that large language models (LLMs) give answers that are incorrect – because the models draw incorrect conclusions. OpenAI makes SimpleQA available as open source. The AI company is obviously hoping that the new benchmark will be widely used.

Videos by heise

AI trainers have collected questions for the test. More than 4000 of these were included in the set. They only have to allow a single correct answer, which was tested using two AI trainers who had to answer the same question. The tasks also had to be diverse, covering many different subject areas – from movies to science, geography and technology. OpenAI also says that the intention was for the questions to challenge frontier models, i.e. the current best models, more than previous tests have done. It was also important to make the test as accessible and quick as possible.

Better answers and more security for the large LLMs

Another AI trainer was then used to check again whether it gave the same answers as the creators had specified. Any final inaccuracies were then eliminated. A customized version of ChatGPT can monitor and evaluate how well the AI models themselves perform in the tests. It compares the AI answers with those of the AI trainers and can deduce whether the answers are correct, incorrect or whether the model has given no answer (for example, asking the questioner to do their own research on the internet).

As expected, the small AI models from OpenAI perform significantly worse. But even GPT-4o only achieves around 40 percent correct answers, o1-Preview is only slightly higher.

How OpenAI's LLMs perform in the SimpleQA benchmark.

(Image: OpenAI Blogbeitrag)

SimpleQA should also be able to test the so-called calibration. This phenomenon means that AI models vary in the certainty of their answers. On the one hand, this can simply be queried: Tell me how sure you are that your answer is correct. This assessment can be compared with the actual result. Or you can ask the same question 100 times. Deviations in the answer can also provide information about the reliability of the AI model.

Self-promotion specialized service heise KI PRO

More identical answers also show more certainty that the answer is correct. Once again, GPT-4o and o1-Preview are significantly more reliable and are also correct more often than the small models. Results on how other AI models from other providers perform are not yet available.