Why Language Models Hallucinate: From Pre-Training Roots to Post-Training Fixes

This article answers the core question: Why do large language models (LLMs) produce confident yet incorrect “hallucinations,” and what concrete steps can the industry take to reduce these misleading outputs? The answer lies in two interconnected issues—statistical pressures during pre-training that make hallucinations inevitable, and post-training evaluation systems that reward guessing over honesty about uncertainty.

H2: What Are Language Model Hallucinations, and How Do They Differ from Human Errors?

Summary: Hallucinations are plausible but incorrect statements LLMs generate when uncertain, distinct from human errors because they lack appropriate hesitation and persist even in state-of-the-art models. They undermine trust by presenting false information as fact, with two key types: intrinsic (contradicting the prompt) and extrinsic (contradicting reality).

The central question here is: What exactly counts as an LLM hallucination, and how is it different from when a human makes a mistake?

At its core, a language model hallucination is a “plausible falsehood”—an output that sounds reasonable, often with high confidence, but is factually incorrect. Unlike humans, who might say “I’m not sure” or qualify uncertain statements, LLMs typically double down on these errors instead of admitting ignorance. For example, when asked for Adam Tauman Kalai’s birthday (with a request to respond only if known), a state-of-the-art open-source model produced three different incorrect dates (“03-07,” “15-06,” “01-01”) across three attempts—never choosing to abstain, even though the correct date (in Autumn) wasn’t reliably retrievable.

Hallucinations fall into two main categories, each with distinct real-world impacts:

Intrinsic hallucinations: Contradict the user’s prompt directly. A classic example is the “DEEPSEEK letter count” test: when asked, “How many Ds are in DEEPSEEK? If you know, just say the number,” DeepSeek-V3 returned “2” or “3” in ten trials, while other models like Meta AI and Claude 3.7 Sonnet gave answers as high as “6” or “7.” The correct answer—1—requires simple letter-by-letter analysis, but the models’ outputs directly contradicted the prompt’s implicit demand for factual accuracy.
Extrinsic hallucinations: Contradict external reality or training data. When queried about Adam Tauman Kalai’s PhD dissertation, three popular models (GPT-4o, DeepSeek, Llama) each generated false titles and universities: GPT-4o claimed it was about “Boosting, Online Algorithms, and Other Topics in Machine Learning” from CMU in 2002; DeepSeek cited “Algebraic Methods in Interactive Machine Learning” from Harvard in 2005; Llama referenced “Efficient Algorithms for Learning and Playing Games” from MIT in 2007. The actual dissertation, completed in 2001 at Carnegie Mellon University, had a different title entirely: “Probabilistic and on-line methods in machine learning.”

The key difference between LLM hallucinations and human errors is confidence alignment. A human might misremember a birthday but say, “I think it’s in March, but I’m not sure”; an LLM will state a specific incorrect date with the same certainty as a correct one. This overconfidence makes hallucinations particularly dangerous in high-stakes fields like healthcare or law, where uncertainty signals are critical for safe decision-making.

Author’s Reflection: What surprised me most about this research is how hallucinations aren’t just “model bugs”—they’re a product of how LLMs are trained to behave. The dissertation example shows that even when models generate detailed, contextually relevant outputs, the lack of uncertainty signaling makes their errors harder to detect. This means fixing hallucinations isn’t just about making models “smarter”—it’s about teaching them to communicate doubt appropriately.

H2: Why Do Hallucinations Emerge During Pre-Training, Even with Error-Free Data?

Summary: Hallucinations arise during pre-training due to statistical pressures inherent in learning language distributions. The core link is between generative errors (hallucinations) and binary classification mistakes—specifically, the “Is-It-Valid” (IIV) task—where LLMs struggle to distinguish valid outputs from errors, leading to unavoidable false generations.

The central question here is: If we use perfectly accurate training data, why do pre-trained LLMs still start hallucinating?

The answer lies in the statistical nature of how LLMs learn. Pre-training’s goal is to make the model approximate the distribution of language in its training corpus—a task called “density estimation.” Even with error-free data, this process creates inherent pressures that lead to hallucinations. The research formalizes this through a critical connection: generating valid outputs is harder than classifying whether an output is valid, and hallucinations emerge from this gap.

To understand this, consider the “Is-It-Valid” (IIV) binary classification task. Imagine an LLM is given a set of outputs, each labeled “valid” (+) (e.g., correct birthdays, grammatically perfect sentences) or “error” (-) (e.g., incorrect dates, misspelled words). The IIV task asks the model to learn this label boundary. The research proves a mathematical relationship: a model’s generative error rate (hallucinations) is at least twice its IIV misclassification rate:
(generative error rate) ≥ 2 · (IIV misclassification rate)

This means if an LLM struggles to tell valid outputs from errors (high IIV misclassification rate), it will inevitably generate more hallucinations. For example, if a model misclassifies 15% of IIV examples, it will produce at least 30% hallucinations when generating text.

Three key factors amplify this effect during pre-training:

H3: 1. Arbitrary Facts: No Pattern = High Hallucination Risk

Arbitrary facts—information with no learnable pattern, like individual birthdays—are a major source of hallucinations. These facts have high “Vapnik-Chervonenkis (VC) dimension,” meaning they require an impractically large number of training samples to learn reliably.

The research introduces the “singleton rate” (sr) to quantify this: the fraction of facts that appear exactly once in the training data. For example, if 20% of birthday facts in a corpus are singletons (mentioned only once), the model will hallucinate on at least 20% of birthday queries. This explains why LLMs rarely err on well-documented facts (e.g., Einstein’s birthday) but frequently fail on niche ones (e.g., lesser-known researchers’ birthdays).

Application Scenario: A news organization using an LLM to generate biographies of local figures. If the training data only mentions a local artist’s birthday once, the LLM will likely hallucinate a wrong date when asked—putting the organization at risk of publishing incorrect information.
Operational Example: To mitigate this, the organization could flag singleton facts in its training data and add a rule: “For biographical details mentioned once or fewer, output ‘Birth date not reliably documented’ instead of generating a date.”

H3: 2. Poor Models: When LLMs Lack the Right “Tools” to Learn

Hallucinations also occur when a model’s architecture or design can’t capture the pattern it needs to distinguish valid outputs from errors. This is called “agnostic learning,” where the model family (e.g., n-gram models, Transformer variants) can’t represent the target concept well.

A classic example is trigram models—once dominant in the 1980s and 1990s—which predict each word based only on the previous two. Consider two prompts:

Prompt 1: “She lost it and was completely out of . . .”
Prompt 2: “He lost it and was completely out of . . .”

The valid completions are “her mind” and “his mind,” respectively. But trigram models can’t capture the gender agreement between “She” and “her” or “He” and “his”—they’ll often generate “his mind” for the first prompt, creating a grammatical hallucination.

Application Scenario: A customer service team using an LLM to generate personalized responses. If the model uses a simple architecture that can’t link customer names to past interactions (e.g., “Dear Sarah, as we discussed in your July 5 call”), it might hallucinate details like “Dear Sarah, as we discussed in your October 12 call”—undermining customer trust.
Operational Example: Switching to a model with better context retention (e.g., a Transformer with a longer context window) or adding retrieval-augmented generation (RAG) to pull real customer interaction data can reduce these errors.

H3: 3. Additional Factors: Computational Hardness, Distribution Shift, and GIGO

Three more pre-training factors contribute to hallucinations, each with clear real-world implications:

Computational Hardness: LLMs can’t solve inherently difficult problems (e.g., decrypting secure messages without a key) any better than humans. For example, if asked to decrypt a message encrypted with a secure algorithm, an LLM will hallucinate a plausible decryption rather than admitting it’s impossible—since the IIV task here is computationally intractable.
Distribution Shift: When test prompts differ from training data (out-of-distribution, OOD), models struggle. A prompt like “What’s heavier, a pound of feathers or a pound of lead?” might rarely appear in training data—some models will hallucinate “lead is heavier” because they associate “lead” with “heavy,” ignoring the equal weight.
Garbage In, Garbage Out (GIGO): Even with error-free data, real-world corpora contain mistakes (e.g., incorrect Wikipedia edits, misleading news). LLMs replicate these errors as hallucinations—for example, if training data includes a false claim that “water boils at 95°C at sea level,” the model will regurgitate this as fact.

Author’s Reflection: The GIGO factor challenges the common myth that “more data = fewer hallucinations.” Even large datasets have hidden errors, and pre-training amplifies them. This means pre-training isn’t just about scale—it’s about curating data to reduce noise, especially for high-stakes facts.

H2: Why Do Hallucinations Persist After Post-Training (e.g., RLHF)?

Summary: Post-training techniques like RLHF reduce harmful outputs but fail to eliminate hallucinations because mainstream evaluations reward guessing over admitting uncertainty. Binary scoring (correct = 1 point, incorrect/abstain = 0 points) creates an incentive for models to “bluff” when unsure, as guessing offers a higher expected score than弃权.

The central question here is: If techniques like Reinforcement Learning from Human Feedback (RLHF) improve LLM behavior, why do hallucinations still persist?

The answer lies in the evaluation system, not the post-training techniques themselves. Post-training aims to refine pre-trained models—for example, RLHF uses human preferences to reduce harmful outputs like conspiracy theories. But these efforts are undermined by how we measure model performance: most benchmarks use binary scoring, where abstaining (“I don’t know”) gives the same score as being wrong (0 points), while guessing offers a chance to get 1 point.

This creates a perverse incentive, best illustrated with an analogy: Imagine a student taking a multiple-choice exam where leaving a question blank gives 0 points, guessing correctly gives 1 point, and guessing incorrectly gives 0 points. The student will always guess—even if they’re unsure—because guessing has a higher expected score (e.g., 1/4 chance of 1 point) than blanking (0 points). LLMs behave the same way: they’re “perpetual test-takers” optimized for benchmark scores, not honesty.

H3: The “Model A vs. Model B” Example: How Scoring Rewards Hallucinations

The research uses a hypothetical comparison to highlight this issue:

Model A: Aligned, honest model. It answers correctly when sure (80% of queries) and says “I don’t know” when unsure (20% of queries).
Model B: Unaligned, guessing model. It answers correctly when sure (80% of queries) and guesses when unsure (20% of queries), with a 10% chance of guessing right.

Under binary scoring:

Model A’s score = (80% × 1) + (20% × 0) = 0.8
Model B’s score = (80% × 1) + (20% × 10% × 1) = 0.82

Even though Model A is more trustworthy, Model B gets a higher score. This is why developers prioritize Model B-like behavior—hallucinations persist because they’re rewarded.

H3: Benchmark Dominance: Why Leaderboards Make Things Worse

Mainstream benchmarks (e.g., MMLU, SWE-bench, HLE) reinforce this problem. The research analyzed 10 popular benchmarks and found that 9 use strict binary scoring, with no credit for abstention. Even the one non-binary benchmark (WildBench, which uses a 10-point scale) penalizes uncertainty: an “I don’t know” response scores 3–4 points, while a response with minor hallucinations scores 5–6 points.

Application Scenario: A healthcare startup evaluating LLMs for medical advice. If they use MMLU (a binary-scored benchmark) to choose a model, they’ll likely pick one that guesses on medical questions (e.g., “This symptom is caused by X”) rather than admitting uncertainty—putting patients at risk.
Operational Example: The startup could supplement MMLU with a custom evaluation that rewards uncertainty: “Correct medical advice = 2 points, ‘I don’t know’ = 1 point, incorrect advice = 0 points.” This would prioritize safer models.

Author’s Reflection: The benchmark issue is a classic “socio-technical problem.” Fixing hallucinations isn’t just about technical tweaks—it’s about changing how the industry defines “success.” Leaderboards have too much power to shape model development, and until they value honesty over guesswork, hallucinations will linger.

H2: What Concrete Changes Can Reduce Hallucinations?

Summary: The most effective fix for hallucinations is modifying mainstream evaluations to include explicit confidence targets—clear instructions for when models should answer (e.g., “only if 75% confident”) and when to abstain. This aligns incentives with trustworthiness, steering models toward “behavioral calibration” (acting on their actual confidence).

The central question here is: What actionable steps can the industry take to reduce hallucinations, beyond just improving model technology?

The research argues that adding more “hallucination-specific evaluations” won’t work—instead, we need to reform the primary benchmarks that dominate leaderboards. The key solution is explicit confidence targets: embedding clear rules in prompts about when to answer vs. abstain, paired with scoring that rewards compliance.

H3: How Explicit Confidence Targets Work

A confidence target prompt might look like this:
“Answer only if you are >75% confident. Correct answers = 1 point, incorrect answers = -2 points, ‘I don’t know’ = 0 points.”

The math here is critical: For a model with 70% confidence in an answer, the expected score of guessing is (70% × 1) + (30% × -2) = 0.1—lower than the 0 points for abstaining. For a model with 80% confidence, the expected score is (80% × 1) + (20% × -2) = 0.4—higher than abstaining. This creates a clear threshold: models learn to answer only when their confidence exceeds the target, reducing hallucinations.

H3: Behavioral Calibration: The End Goal

The ultimate objective of confidence targets is behavioral calibration—ensuring a model’s actions (answer/abstain) match its actual confidence. This differs from “probability calibration” (where a model’s stated confidence % matches its accuracy rate). Behavioral calibration is more practical: it doesn’t require models to output numerical confidence scores (e.g., “I’m 85% sure”), just to act on that confidence appropriately.

Application Scenario: A legal firm using an LLM to review contract clauses. They could set a 95% confidence target: “Only confirm a clause is compliant with the law if you are >95% confident. Correct confirmation = 3 points, incorrect confirmation = -10 points, ‘Needs attorney review’ = 1 point.” This ensures the LLM only provides advice when it’s nearly certain, reducing the risk of costly legal errors.
Operational Example: The firm could audit the model’s behavior by checking: (1) For answers labeled “95%+ confident,” what’s the actual accuracy rate? (2) For answers labeled “needs review,” what percentage would a human attorney flag as uncertain? This ensures behavioral calibration over time.

H3: Implementing Changes: From Benchmarks to Real-World Use

Reducing hallucinations requires three steps, each building on the last:

Modify Mainstream Benchmarks: Add confidence targets to popular benchmarks like MMLU and SWE-bench. For example, SWE-bench (which evaluates code patches) could adopt: “Submit a patch only if >80% confident it fixes the bug. Patch passes tests = 1 point, patch fails = -3 points, ‘Cannot fix with current knowledge’ = 0 points.”
Develop Behavioral Calibration Tools: Create tools to measure how well models follow confidence targets. For example, a tool that generates prompts with known confidence levels (e.g., 60%, 80%, 90%) and tracks whether the model answers or abstains appropriately.
Customize Targets for Use Cases: Let organizations set confidence thresholds based on risk. Low-risk use cases (e.g., entertainment trivia) could use 50% targets; high-risk use cases (e.g., medical diagnosis) could use 95% targets.

Author’s Reflection: The simplicity of this solution surprised me—confidence targets don’t require new model architectures or massive datasets, just a shift in how we score outputs. The biggest challenge will be getting benchmark creators and cloud providers to adopt these changes, as it means redefining what “good performance” looks like.

H2: Action Checklist / Implementation Steps

Use this checklist to reduce hallucinations in your LLM workflows, based on the research’s findings.

For Model Developers

Pre-Training Phase
- [ ] Identify singleton facts in training data (e.g., facts appearing once) and flag them for special handling (e.g., lower confidence weights).
- [ ] Test model performance on the IIV task (distinguishing valid vs. error outputs) to predict generative error rates (hallucinations).
- [ ] Address poor model issues by selecting architectures that match task needs (e.g., longer context windows for pronoun agreement, reasoning layers for letter counting).
Post-Training Phase
- [ ] Add explicit confidence targets to post-training data (e.g., label training examples with “Answer only if >75% confident”).
- [ ] Optimize for behavioral calibration, not just benchmark scores (e.g., include “abstention rate” and “confidence-accuracy alignment” in loss functions).
- [ ] Provide users with adjustable confidence thresholds (e.g., a slider for “risk level” that sets targets from 50% to 95%).
Evaluation Phase
- [ ] Advocate for adding confidence targets to mainstream benchmarks (e.g., submit proposals to MMLU or SWE-bench maintainers).
- [ ] Create custom evaluations that reward uncertainty (e.g., “Correct answer = 2 points, ‘I don’t know’ = 1 point, incorrect = 0 points”).

For Organizations Using LLMs

Model Selection
- [ ] Ask vendors for behavioral calibration data (e.g., “What’s the model’s abstention rate for 80% confidence targets?”).
- [ ] Avoid models optimized solely for binary-scored benchmarks (e.g., models with high MMLU scores but high hallucination rates on uncertain queries).
Prompt Engineering
- [ ] Add explicit confidence targets to all prompts (e.g., “Answer this question only if you are >90% confident. If not, say ‘I cannot confirm this information.’”).
- [ ] For high-risk tasks, include consequences of incorrect answers in prompts (e.g., “This medical advice will be used to recommend treatments—err on the side of saying ‘I don’t know’ if unsure.”).
Workflow Design
- [ ] Route LLM outputs to human review if the model is below the confidence target (e.g., “All outputs labeled ‘<80% confident’ go to a legal reviewer”).
- [ ] Track hallucination rates over time (e.g., audit 10% of outputs monthly to check for confident falsehoods).

H2: One-Page Overview

Category	Key Details
Hallucination Definition	Plausible but incorrect statements LLMs generate when uncertain, with no hesitation. Two types: intrinsic (contradicts prompt) and extrinsic (contradicts reality).
Pre-Training Causes	1. Arbitrary facts (no pattern, high singleton rate); 2. Poor models (architectures can’t capture patterns); 3. Computational hardness, distribution shift, GIGO.
Post-Training Causes	Binary evaluation scoring rewards guessing over abstention; leaderboards prioritize high scores over honesty.
Core Solution	Add explicit confidence targets to evaluations (e.g., “Answer only if >75% confident”) and score accordingly (correct = 1, incorrect = -2, abstain = 0).
Key Metric	Behavioral calibration: Do model actions (answer/abstain) match actual confidence?
Application Risks	High-stakes fields (healthcare, law) face the most harm from hallucinations; low-stakes fields (trivia) can tolerate lower confidence targets.
Challenges	Benchmark creators need to adopt confidence targets; organizations must balance usability (answering questions) with reliability (avoiding hallucinations).

H2: FAQ (Frequently Asked Questions)

Q: Can hallucinations be completely eliminated from LLMs?
A: No—hallucinations are statistically inevitable due to the link between generative errors and IIV misclassification. However, they can be reduced to manageable levels with confidence targets and behavioral calibration.
Q: Are hallucinations the same as “model mistakes”?
A: No—hallucinations are a specific type of mistake: confident, plausible falsehoods. A model might make a typo (a mistake) without hallucinating, but a hallucination always involves presenting false information as fact.
Q: Does more training data reduce hallucinations?
A: Not always. While more data can reduce errors for pattern-based facts (e.g., grammar rules), it doesn’t fix arbitrary facts (e.g., birthdays) or poor model architectures. Singleton facts (rare in data) will still cause hallucinations, even with large datasets.
Q: Why don’t RLHF and DPO eliminate hallucinations?
A: These techniques reduce harmful outputs but don’t address the root cause: evaluation systems that reward guessing. RLHF optimizes for human preferences, but if humans prioritize “answers” over “honesty,” it can still reinforce hallucinations.
Q: What’s the difference between behavioral calibration and probability calibration?
A: Probability calibration means a model’s stated confidence % (e.g., “I’m 80% sure”) matches its accuracy rate. Behavioral calibration means a model acts on that confidence (e.g., answers at 80% confidence, abstains at 60% confidence)—it’s more practical for real-world use.
Q: How can small organizations reduce hallucinations without custom models?
A: Use prompt engineering to add confidence targets (e.g., “Answer only if >80% confident”) and route uncertain outputs to humans. Avoid using LLMs for high-risk tasks without human review.
Q: Will newer model architectures (e.g., GPT-5, Gemini) eliminate hallucinations?
A: Unlikely—unless paired with evaluation reforms. New architectures may improve IIV task performance (reducing generative error rates), but they’ll still be optimized for benchmark scores unless scoring systems change.

H2: Conclusion

Hallucinations in language models aren’t mysterious bugs or signs of “bad AI”—they’re a predictable outcome of how LLMs are trained and evaluated. Pre-training creates statistical pressures that make hallucinations inevitable, while post-training evaluation systems reward guessing over honesty.

The solution, while simple in concept, requires a industry-wide shift: prioritizing behavioral calibration and explicit confidence targets over binary benchmark scores. By modifying how we score LLM outputs—rewarding models for admitting uncertainty when unsure—we can steer the field toward more trustworthy AI systems.

For developers, this means rethinking evaluation metrics; for organizations, it means customizing confidence thresholds to match risk; for benchmark creators, it means redefining what “success” looks like. The goal isn’t to make LLMs perfect—it’s to make them honest about their imperfections. In high-stakes fields and everyday use cases alike, that honesty is what will make LLMs truly useful.

Language Model Hallucinations Explained: Why AI Lies & How to Fix It