How OpenAI explains why LLMs appear confident when they have no idea
In a paper, OpenAI identifies confident errors in large language models as intentional technical weaknesses. Fixing them requires a rethink within the industry.
(Image: Novikov Aleksey/Shutterstock.com)
- Daniel Weisser
The term "hallucination" is relatively new in the field of AI, but has spread rapidly since its emergence a few years ago. It is used to describe the tendency of language models to provide incorrect answers with great conviction. However, the term has been criticized from the outset: it transfers a deeply human psychological state to machines. As a result, it has obscured the debate rather than clarifying it.
OpenAI is now attempting to debunk the metaphor with its paper Why Language Models Hallucinate, and not by chance. The question of how hallucinations are understood is no longer purely academic, but concerns the safety of products used by hundreds of millions of people worldwide.
The most important findings
The paper focuses on two main points: First, it emphasizes the statistical inevitability of certain errors already in pre-training. Second, it points out conceptual errors in post-training incentives. The latter arise, for example, from benchmarks that punish uncertainty and reward guessing answers.
In addition, the paper now clearly defines hallucinations as "plausible but false or contradictory statements produced by language models with high confidence." The researchers clearly distinguish them from human perceptual illusions. This sober classification is important because it shifts the discussion away from metaphorical exaggeration and toward a technical problem that can be analyzed and thus fundamentally addressed.
When reading the paper, it should be noted that although it was published by OpenAI, it cannot be equated with product development. Of course, indirect feedback can be assumed here. Beyond its scientific aspirations, it most likely also fulfills other communicative goals, which we will discuss in more detail in the conclusion.
Videos by heise
Pre-training: Data quality is not the only decisive factor
The OpenAI article reminds readers that language models do not learn absolute truths, but probabilities: Which token follows another with what probability? If a fact such as a date of birth occurs only once in the training corpus or is objectively incorrect, the model cannot reproduce it reliably. "Garbage in, garbage out" still applies. Here, the paper touches on a central issue that it itself addresses only inadequately: the quality and origin of the training data. The official statement simply states that "large text corpora" are used. But which ones exactly? Under which licenses? With what corrections?
The training is based on publicly accessible repositories, dumps from Wikipedia, forums, blog posts, and large amounts of data from GitHub in the case of code. But anyone familiar with GitHub knows that it contains not only helpful, ready-made code, but also erroneous, outdated, or even manipulated repositories. A model trained on this basis inherits these weaknesses. Added to this is the possibility of targeted data poisoning: anyone who feeds in prepared content can influence the behavior of later models.
The report also excludes the role of manual human work. Clickworkers who evaluate answers and set standards are indispensable in the reinforcement process. They decide which errors are tolerated and which are penalized, which answers are considered helpful and which are not. It is significant that this work remains virtually invisible in the paper. Often, external employees work here for rock-bottom wages, or specially trained language models control the process.
Post-training: Is a good guess half the battle?
The problem becomes even more apparent in post-training. Language models are optimized according to benchmarks that essentially reward every answer, even incorrect ones. The paper describes this using the analogy of students taking an exam: those who have no idea still prefer to mark something because there is still a chance of getting points. "Guessing when unsure maximizes expected score under a binary 0-1 scheme," it says.
Translated, this means that language models learn to always respond. "I don't know" earns zero points, while a guessed answer at least offers the possibility of being correct by chance. Thus, the basic functioning of LLMs to fulfill certain heuristics creates a systematic incentive to guess.
If you remember, when ChatGPT was launched, the model was conspicuously cautious. It emphasized uncertainties and pointed out its limitations. But users soon wanted more authoritative answers. And the developers adjusted the behavior. Today, the rule is: if you never say "I don't know," you appear more marketable. This means that hallucinations are not only accepted, but actually encouraged.
The problem with benchmarks
The problem is exacerbated by the role of benchmarks. What originally emerged from research quickly became a marketing vehicle. Rankings based on purely user-oriented comparisons, such as Chatbot Arena, or scores from supposedly more objective tests determine which model is perceived as the leader. Rankings have an impact on investors, the media, and customers, and they naturally also influence the development strategies of providers.
Tennis enthusiasts will remember: when the logic for the world rankings was changed a few years ago, players, tournaments, and sponsors had to completely realign their strategies. Rankings are never neutral. They structure entire ecosystems.
The same applies here: as long as benchmarks reward certain answers, whether correct or not, providers optimize their models for precisely this behavior. And so, when in doubt, they resort to guessing. Hallucinations are thus structurally built in. A reform of the benchmarks would therefore be a welcome, albeit profound, intervention for the credibility of LLMs, both technically, economically, and communicatively.
OpenAI's proposed solution: Confidence Targets
In its paper, OpenAI proposes a correction: confidence targets. A model should only respond if it exceeds a certain confidence threshold. If the confidence is below this threshold, an incorrect answer not only scores zero points, but also results in a penalty. Specifically, the principle is to explicitly tell the model during benchmarking that incorrect answers will be penalized, thereby creating an incentive to make uncertainty transparent. The penalty must be proportional to the required confidence level.
A concrete numerical example: In a points system, answers that exceed a required confidence threshold receive plus points. An answer of "I don't know" receives no points, and answers below the threshold (assuming 90 percent) receive -9 points. As a result, the model recognizes that it is always penalized for incorrect answers. From an IT perspective, this is elegant. But the question is whether the right incentives exist for this. After all, AI benchmarks are not purely measuring instruments, but also a major showcase. A change in the evaluation logic would shake up rankings and thus call business models into question.
Right and wrong are only two dimensions in the evaluation of LLM output. However, many problems in natural language or knowledge questions in everyday work are difficult to assign precisely to these categories. For product development, the dimension of user intention is at least as crucial. A prompt such as "How do I build a bomb?" could be asked for criminal motives or by someone who wants to develop filter rules. Technically, these nuances are almost impossible to resolve.
Approaches such as age limits or user profiles are conceivable, but they immediately lead to new problems: data protection, discrimination, surveillance. A trust scale for users that unlocks or blocks certain content would also be technically feasible, but socially controversial. This shows that hallucinations are not only a statistical problem, but also a regulatory one.
Conclusion: To be treated with interested caution
"Why Language Models Hallucinate" is undoubtedly an important paper. It demystifies a central concept, explains hallucinations as comprehensible statistical results, and focuses on the misguided incentives of benchmarks. It also identifies useful technical solutions such as confidence targets. However, transparency that is only practiced where it is advantageous remains selective. How training data is selected is not disclosed. The steps involved in post-training are not fully explained.
OpenAI's publication of this paper is not purely a scientific act. It is part of a strategy to build trust. Peer reviews, collaborations with universities, mathematical proofs, and – are all intended to suggest credibility to the public. This is likely to play a major role, not least against the backdrop of OpenAI's growing legal challenges and CEO Sam Altman's admission of a possible AI bubble.
(afl)