TLDR: This is the abstract, introduction and conclusion to the paper. See here for a summary thread.
Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model’s forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this “in-advance correctness direction” trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers, suggesting that self-assessment emerges mid-computation. Notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding “I don’t know”, doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.
Large language models (LLMs) internally encode information beyond what is immediately observable in their output. Studies have demonstrated that hidden activations can reveal statement truthfulness, deception and hallucination, which have become ever more critical as LLMs are deployed in increasingly complex, high-stakes real-world applications. In this work, we investigate the complementary scientific question of whether an LLM’s residual stream activations—captured immediately after it processes a query—contain a latent signal that predicts if its eventual output will be correct. Instead of using the generated answer or token probabilities, we train a linear probe on the hidden state before the answer is produced to distinguish questions the model will answer correctly from those it will not, capturing its internal prediction of correctness.
Empirically, our approach identifies the activation-space vector linking the average residual stream activations for correctly answered questions to those for incorrectly answered ones (similar to Burger et al.’s for statement truthfulness). We test our approach on state-of-the-art open-source LLMs spanning three families and ranging from 7 to 70 billion parameters, and we observe that:
Overall, our analysis advances our understanding of what is encoded in LLM internals and provides an early indicator of performance that is grounded in the model’s internal dynamics and thus complements existing uncertainty quantification techniques. In particular, our approach contrasts with self-confidence estimation by avoiding generation from the model. By doing so, it comes close to techniques training correctness predictors using model-independent features of the input, but differs from those in leveraging internal representations. On the other side, aforementioned works using model internals mostly focused on truthfulness of complete statements rather than our predictive scenario. The exceptions are Kadavath et al. (2022), which tested a similar approach to ours on proprietary models, and Ferrando et al. (2025), which identified the latents of pre-trained Sparse Auto-Encoders (SAEs) that best distinguish questions answered correctly from those answered incorrectly in small Gemma models. By considering multiple families of open-source models (scaling up to 70 billion parameters), conducting extensive out-of-distribution experiments, and directly learning linear representations, our work contributes essential findings on the ability of models to maintain an internal representation reflecting competence on tasks, its alignment with “I don’t know” responses and its failure to generalise to a mathematical reasoning dataset. Our codebase is accessible at https://github.com/ivanvmoreno/correctness-model-internals.
We have demonstrated that a latent correctness signal exists in the internal activations of large language models, which can be effectively extracted using a linear probe. This signal reliably predicts whether the model will generate a correct response for several knowledge datasets. The robustness of this finding across various model architectures reinforces the idea that LLMs encode an internal representation of their own confidence. Our work advances the understanding of model internals and provides a foundation for developing safer and more reliable language systems. Our contributions are fivefold: (1) we provide evidence that LLMs embed a latent correctness signal mid-computation; (2) we show that a simple linear probe can extract this signal, yielding generalisation across knowledge datasets; (3) we highlight the limits of this approach, suggesting that deeper reasoning and arithmetic capabilities are not as easily captured in activations; (4) we find a stronger signal for the largest model we test (Llama 3.3 70B), suggesting that larger models may better predict their correctness (5) we demonstrate this direction aligns with abstention behaviour in models that say “I don’t know,” supporting its interpretation as a latent confidence axis. This work contributes to mechanistic interpretability by identifying a meaningful confidence direction within LLM activations, corroborating recent works with sparse auto-encoders. It also complements studies of truthfulness and hallucination, suggesting that models encode internal notions of confidence—even before answer generation—and truthfulness that are both general and accessible.
Our findings have relevance for both AI safety and practical deployment. As LLMs are increasingly used in high-stakes settings, low-cost internal signals of impending failure offer a path toward safer, more robust systems. The correctness direction could inform early stopping, fallback mechanisms, or human-in-the-loop protocols—particularly where generating unreliable outputs is costly or dangerous.