89

LESSWRONG
LW

88
AI
Frontpage

7

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

by antonghawthorne, ivanvmoreno, Arnau Padrés Masdemont, David Africa, LorenzoPacchiardi
16th Sep 2025
Linkpost from arxiv.org
4 min read
0

7

AI
Frontpage

7

New Comment
Moderation Log
More from antonghawthorne
View more
Curated and popular this week
0Comments

TLDR: This is the abstract, introduction and conclusion to the paper. See here for a summary thread.

Abstract

Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model’s forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this “in-advance correctness direction” trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers, suggesting that self-assessment emerges mid-computation. Notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding “I don’t know”, doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.

Introduction

Large language models (LLMs) internally encode information beyond what is immediately observable in their output. Studies have demonstrated that hidden activations can reveal statement truthfulness, deception and hallucination, which have become ever more critical as LLMs are deployed in increasingly complex, high-stakes real-world applications. In this work, we investigate the complementary scientific question of whether an LLM’s residual stream activations—captured immediately after it processes a query—contain a latent signal that predicts if its eventual output will be correct. Instead of using the generated answer or token probabilities, we train a linear probe on the hidden state before the answer is produced to distinguish questions the model will answer correctly from those it will not, capturing its internal prediction of correctness.

Empirically, our approach identifies the activation-space vector linking the average residual stream activations for correctly answered questions to those for incorrectly answered ones (similar to Burger et al.’s for statement truthfulness). We test our approach on state-of-the-art open-source LLMs spanning three families and ranging from 7 to 70 billion parameters, and we observe that:

  • For all models, in-distribution performance saturates at the middle layers.
  • The probe trained using TriviaQA, a dataset of questions on various topics, generalises to domain-specific knowledge datasets more effectively than baseline methods based on self-confidence or model-independent features of the input. However, all methods struggle to generalise to the GSM8K mathematical reasoning dataset, illuminating how predicting self-correctness is challenging for questions requiring deeper reasoning.
  • Using the TriviaQA dataset, larger models require fewer training samples to learn a high-quality probe – overall, however, 2560 samples are enough to saturate performance on almost all datasets.
  • Considering all datasets, the in-advance correctness signal is strongest for the largest model we test (Llama 3.3 70B).
  • For models that answer “I don’t know” without being explicitly prompted, doing so correlates with the question’s position along the in-advance correctness direction, which therefore also represents a “confidence” direction.

 

Proposed methodology to find the in-advance correctness direction. (A) Residual stream activations for all model layers are extracted at the last token of the question, prior to sampling. (B) Model answers are generated and evaluated against the ground truth. (C) The direction which mostly discriminates activations related to correct and incorrect answers is identified (the first two principal components at a specific layer are visualised). (D) The most discriminative layer is chosen. (E) The final correctness classifier is trained on the identified layer, and its out-of-distribution performance is assessed.

Overall, our analysis advances our understanding of what is encoded in LLM internals and provides an early indicator of performance that is grounded in the model’s internal dynamics and thus complements existing uncertainty quantification techniques. In particular, our approach contrasts with self-confidence estimation by avoiding generation from the model. By doing so, it comes close to techniques training correctness predictors using model-independent features of the input, but differs from those in leveraging internal representations. On the other side, aforementioned works using model internals mostly focused on truthfulness of complete statements rather than our predictive scenario. The exceptions are Kadavath et al. (2022), which tested a similar approach to ours on proprietary models, and Ferrando et al. (2025), which identified the latents of pre-trained Sparse Auto-Encoders (SAEs) that best distinguish questions answered correctly from those answered incorrectly in small Gemma models. By considering multiple families of open-source models (scaling up to 70 billion parameters), conducting extensive out-of-distribution experiments, and directly learning linear representations, our work contributes essential findings on the ability of models to maintain an internal representation reflecting competence on tasks, its alignment with “I don’t know” responses and its failure to generalise to a mathematical reasoning dataset. Our codebase is accessible at https://github.com/ivanvmoreno/correctness-model-internals.

Conclusion

We have demonstrated that a latent correctness signal exists in the internal activations of large language models, which can be effectively extracted using a linear probe. This signal reliably predicts whether the model will generate a correct response for several knowledge datasets. The robustness of this finding across various model architectures reinforces the idea that LLMs encode an internal representation of their own confidence. Our work advances the understanding of model internals and provides a foundation for developing safer and more reliable language systems. Our contributions are fivefold: (1) we provide evidence that LLMs embed a latent correctness signal mid-computation; (2) we show that a simple linear probe can extract this signal, yielding generalisation across knowledge datasets; (3) we highlight the limits of this approach, suggesting that deeper reasoning and arithmetic capabilities are not as easily captured in activations; (4) we find a stronger signal for the largest model we test (Llama 3.3 70B), suggesting that larger models may better predict their correctness (5) we demonstrate this direction aligns with abstention behaviour in models that say “I don’t know,” supporting its interpretation as a latent confidence axis. This work contributes to mechanistic interpretability by identifying a meaningful confidence direction within LLM activations, corroborating recent works with sparse auto-encoders. It also complements studies of truthfulness and hallucination, suggesting that models encode internal notions of confidence—even before answer generation—and truthfulness that are both general and accessible.

Our findings have relevance for both AI safety and practical deployment. As LLMs are increasingly used in high-stakes settings, low-cost internal signals of impending failure offer a path toward safer, more robust systems. The correctness direction could inform early stopping, fallback mechanisms, or human-in-the-loop protocols—particularly where generating unreliable outputs is costly or dangerous.