LESSWRONG
LW

A recent OpenAI preprint (and blogpost) examines the sources of AI hallucinations. This reminded me of a 2024 preprint which was similar in scope. These two papers use different mathematical lenses but arrive at complementary conclusions about the necessity of uncertainty expressions in AI systems. I briefly review and compare them here.

[disclaimer: I am not an author of any of those papers, so my understanding may not be fully correct]

Paper 1: The Consistent Reasoning Paradox

The first paper examines AI systems through the lens of theoretical computer science. The authors define an AGI as a system (implemented on a Turing machine) that passes the Turing test, which implies it must reason consistently—answering the same question identically regardless of how it's phrased—as, at least ideally, humans should behave too.

Their formal setup considers arithmetic problems where numerical constants can be expressed in different ways, representing different "sentences" for the same problem. Changing the values of the numerical constants leads to a "collection" of problems.

Working within the framework of Turing machines and computable numbers, they establish five main results, which together constitute what they call the Consistent Reasoning Paradox (CRP). The following are my summaries of the results; see the paper for the precise formulation:

CRP I: For any family of sentences, consisting of a single problem for each problem family, there exists one narrow AI system that answers them correctly (but it may not answer the other cases).

CRP II: An AGI that attempts to answer all sentences (or formulations) consistently will necessarily produce incorrect answers infinitely often on these same problems.

CRP III: There exist problems where an AGI (in the sense above) cannot determine with >50% confidence whether its own solution is correct

CRP IV: There are cases where both the narrow AI and the AGI will be correct, but where neither will be able to provide a logical explanation for the answer.

CRP V: Given a computational budget M, there exists an AI that can either provide a correct answer with "I know" or abstain with "I don't know." The thinking time affects the proportion of questions where abstention occurs.

The paper is not concerned with how the AI/AGI systems mentioned in the statements above are trained (or programmed)—the conclusions are absolute and come from the assumptions and the considered set of problems (coincidentally, I am not clear if the results apply beyond problems including mathematical constants, or that have a computational/mathematical nature).

The paper's conclusion is that trustworthy AI must implicitly compute an "I don't know" function to avoid the inevitable errors that come with consistent reasoning.

Paper 2: Statistical Inevitability of Hallucinations

The second paper considers AI models that do density estimation (which includes modern auto-regressive large language models, but not only those). Then, they use statistical learning theory to establish theoretical bounds on the rate of hallucinations, showing that this is non-zero in case the loss function used is cross-entropy from a training dataset.

Their analysis has two main components:

Pre-training inevitability: The authors establish a reduction from the density estimation problem to a binary classification, where the model's assigned probability to a sample is used to classify it as valid or erroneous. Using this reduction, a bound on the probability of generation of erroneous samples can be established; this bound depends on the missclassification error and on constants. Then, the authors argue that the constants are small in the case of cross-entropy loss from a training dataset. Therefore, the conclusion is that a pure base model will necessarily produce hallucinations, even with error-free training data. Of course, the training data is not perfect, it is hard to find the global optimum of the loss function, and we may expect distribution shift between training and test setups, all of which is likely to make the problem worse.

These findings are not really surprising and come from a relatively simple application of well-understood learning theory concepts and results. However, their clear application to the case of density estimation models is illuminating.

Post-training persistence: Of course, existing state-of-the-art models are not pure base models: post-training techniques could alleviate hallucinations. Nevertheless, the authors argue that evaluation benchmarks reward overconfidence and therefore do not incentivise post-training to eradicate hallucinations. Indeed, current benchmarks and evaluation procedures reward models for guessing when uncertain, similar to how standardised tests reward students for attempting every question. The authors argue that fixing hallucinations requires modifying all benchmarks to explicitly penalise errors and reward appropriate abstention. In their view, it is not sufficient to have hallucination evaluations complementary to the primary ones. Moreover, the penalisation should be explicitly mentioned in the instructions, to make it clear to the model. As a consequence, models will be incentivised to become more calibrated and learn to abstain.

Overall message

The joint takeaway message is: models that never fail (and answer consistently) must be able to say "I don't know" (paper 1), but this does not occur by default within the current training pipeline (paper 2).

In particular, paper 1 says that an AI with an "I don't know" function that operates within a computational budget exists. The statistical paper suggests modifying benchmark scoring to explicitly reward abstention when uncertain, essentially training models to develop this capability.

Interestingly, these results do not rely on the current model architecture: paper 1 assumes that the AI system is a computable program, while paper 2 assumes only that it is trained to perform density estimation.

Appendix: side-by-side comparison of approaches and findings

The papers differ significantly in their theoretical frameworks:

Modelling assumptions:

CRP treats AI as Turing machines and focuses on problems involving computable numbers with different representations
The statistical paper considers finite-domain density estimation models trained with standard loss functions

Mathematical tools:

CRP uses theoretical computer science and computability theory to derive absolute impossibility results
The statistical paper uses learning theory to derive probabilistic lower bounds on error rates

Scope of results:

CRP makes existence claims ("there exists an AI" or "will hallucinate infinitely often") about theoretical possibilities
The statistical paper provides quantitative bounds applicable to models trained through standard procedures

4

Two Mathematical Perspectives on AI Hallucinations and Uncertainty

4

4

Paper 1: The Consistent Reasoning Paradox

Paper 2: Statistical Inevitability of Hallucinations

Overall message

Appendix: side-by-side comparison of approaches and findings