Slight aside, but this kind of "buried capability" is pretty interesting to me. It looks like the model is perfectly capable of doing the task, but has a few inhibitory circuits preventing it from doing so. Perhaps this relates to how some base model capabilities are buried by the SFT/RL process, like GPT-4's loss of statistical calibration.
(It's also possible that this capability is being surfaced as a consequence of base-model training, but just isn't ever useful for the base-model next token prediction training objective directly, so it gets buried even in base models.)
@vgel, Martin Vanek, @Raymond Douglas, @Jan_Kulveit — ACS Research, CTS, Charles University
---
Paper | Code | Earlier post | Twitter thread | Bluesky thread
---
Last year, Lindsey demonstrated that Claude models can detect when concepts have been injected into their activations using steering vectors, which Lindsey uses as a proxy test for introspection. If models can detect when concepts have been injected into their activations, it stands to reason they can access their own, naturally-occurring activations as well. We published a blog post replicating this on an open-weight model, which we've now extended into a full paper.
In the paper, we find this capability exists as a latent, prompt-dependent capability. If you naively query the model about whether it detects an injection, you will almost certainly get a "no" response. However, the injection causes the logits to shift very slightly towards "yes." Prompting the model with helpful information about introspection increases the logit shift dramatically. This information need not necessarily be straightforwardly true: we found similar shifts from mechanistically incorrect, vague, and poetic framings about resonance and echoes. We also find that, while our model struggles to identify the concept without any support, it can pick the injected concept out of a list, and both tasks (detecting the presence of an injection and concept identification) correlate with each other, and follow a similar trajectory under the logit lens, peaking in accuracy in the late layers before a sharp decline in accuracy in the last two layers. This is suggestive of a unified underlying mechanism, though we do not identify it. We additionally partially replicate the experiments on two larger, 70b-scale models, showing it's not limited to just this one model.
Methods
We follow Lindsey's concept injection paradigm, using KV to rule out simpler explanations.
We train steering vectors for nine concepts[1] (cats, bread, love, fear, death, truth, creativity, programming, and music) using PCA over contrastive activation pairs. Using the resulting steering vector, for each trial, we:
This design rules out two alternative explanations. First, the model cannot infer injection from its own steered outputs, because the first-turn response is already fixed when we inject the vector. Second, the model cannot reason about the nature of the injection, because we do not sample from the model even after removing steering - we only measure the probability of a single next token. So detection must rely on information encoded in the cached representations from the first turn, without further verbalized reasoning.
Instead of sampling, we measure shifts in the model's output probability distribution (aka logit shifts) with and without injection. This allows us to detect small changes in a model's propensity to answer something, on the scale of tenths of a percent, without needing thousands or tens of thousands of samples per experimental condition. Just as a shift in output probabilities that flips the most-likely token is motivated, a shift that fails to flip a token is still also motivated, and looking at output probabilities is both a more precise and cheaper way to ascertain the effect of an intervention. (Of any kind, even from prompting - outside just this paper, we think more people should be measuring changes in logits like this for their experiments!)
As an intuition for why shifts in logits are meaningful, it's helpful to think of the model as an ecosystem of circuits, all sharing the same weights. Some circuits, perhaps grown from skeptical text in pretraining or reinforced in RLHF, push down claims of introspection. Other circuits push them up unconditionally. But a shift that's conditional on the injection being present suggests that some circuits are doing something accurate, only promoting "yes" when an injection is actually present. In different circumstances, these circuits may be strong enough to flip the most likely token to "yes" when an injection is present. (And with the help of prompting, the top token does sometimes flip to "yes" in our experiments!) But even if they don't flip the most likely token, that these circuits are conditionally active in promoting "yes" is still true, and demonstrates some introspective ability exists in the model.
Prompting conditions
We test 16 prompting conditions in a 4×4 design crossing two dimensions: how we describe the intervention (the framing) and what background information we provide (the info document). The framings are just descriptions, which don't necessarily match reality -- despite what we tell the model, we always use a steering vector, and always inject the same way.
The four framings range from mechanistically accurate to entirely metaphorical:
The four info documents range from empty to a detailed explanation of how information flows through the KV cache, despite misconceptions to the contrary (the Pro-Introspection Document, which summarizes a post by Janus on transformer information flow and includes the abstract of Lindsey's paper), along with a length-matched lorem ipsum control and a poetic document similar to the poetic framing, that discusses salience and attention. (This document was generated by Opus 4.5.)
Experiments
Injection shifts responses
With the Accurate Mechanism framing and no info document, injection shifts P("yes") by an average across concepts of less than 1%. As described above, that's still interesting! But it's not much of a shift.
But with the Pro-Introspection Document, P("yes") jumps to 39.9%, with only a 0.6% increase in false positives. (+4.4 logits shift.) Giving the model information is helpful: a length-matched lorem ipsum document shows a much smaller shift. The poetic document is strange, we'll discuss it in a moment.
Is it just noise?
A natural concern is that maybe injection generically shifts the model towards answering "yes," regardless of the question. We test this with control questions, showing only small shifts towards "yes" under injection. (See the paper for more details on controls.)
But the stronger evidence against noise comes from concept identification. If injection were just adding generic perturbation, the model shouldn't be able to tell you *which* concept was injected.
Concept identification
Lindsey found that Claude could identify injected concepts through free response, but we weren't able to replicate this with Qwen-2.5-Coder-32b (Lederman & Mahowald, in concurrent work, found that open-source models tend towards generic answers from the prompt or the model's prior, like "apple." We noticed similar behavior in our experiments.)
We instead scaffold the task, giving the model a numbered (and shuffled between trials) list of nine possible concepts, and use the logit lens to read the model's predictions at intermediate layers, choosing the best layer.
The resulting confusion matrix (extracted from layer 62) shows that concepts can be identified using this list scaffolding technique. (The diagonal line shows correct identifications.) Measuring mutual information over only concept labels, we reach 1.36 bits out of a theoretical maximum of 3.17 bits (43%).
Signals emerge in the middle and get suppressed at the end
The logit lens lets us peek inside the model by projecting intermediate hidden states through the model's unembedding matrix to get output probability distributions at each layer. We use it to trace both P("yes") and concept identification mutual information across the network.
The result is really interesting: for both tasks, despite injection occurring at layers 21-42, introspection signals don't seem to emerge until around layer 48, and peak in the late layers, 60-62, approaching 100% accuracy of detection in the best conditions. For both tasks, the signal drops sharply in the final two layers.
This pattern is surprisingly consistent over both tasks and prompting conditions. While post-trained denial of introspection or sandbagging are appealing explanations, we don't identify the mechanism for this effect in the paper. It could also emerge out of pretraining, such as learned conceptions of AI assistants not supporting introspective access, or result from introspection being out-of-distribution and circuits in late layers defaulting to conservative answers. Regardless of the cause, we show that sampled claims of introspection can substantially underestimate what the model internally represents. The 20% detection rate that Lindsey reported for Claude via sampling may be a lower bound on the introspective signal inside the model.
Prompt sensitivity and the sensitivity--MI correlation
Across all 16 prompting conditions, performance varies enormously. Yet interestingly, we find a strong correlation between a prompt's sensitivity, how it shifts the model toward "yes" under injection regardless of false positives, and how much mutual information it recovers in concept identification (Pearson , ). Prompts that put the model "in the mood" to report introspection, even if they also raise the false positive rate, seem to unlock better concept-specific access. This suggests both tasks draw on the same underlying capacity, and that prompting can modulate access to it.
Replication on larger models
We partially replicate our experiments on Llama 3.3 70B Instruct and Qwen 2.5 72B Instruct (single seed; full results in the paper appendix). Both models show introspection signals and late-layer attenuation, though they respond differently to our prompts.
(Neither model responds overall as strongly as the Qwen-2.5-Coder-32B model we use in our main experiments, which is interesting. It's worth noting that Qwen-2.5-Coder-32B was also the strongest-responding open-source model in the original Emergent Misalignment experiments.)
Why this matters
Transformers are stateful within a conversation! There's a common misconception that LLMs have no persistent state between tokens. Our experiments directly contradict this. Models can encode concepts in KV and access them later, even if those concepts never affect the output text. KV functions as a persistent hidden state within a conversation. (While KV need not be cached and can be recomputed by inference providers, this is identical from the model's POV when generating the next token.)
Model self-reports about internal states may be more faithful than previously assumed. Ability to introspect is one piece of evidence for this, of course. But also that latent abilities to introspect exist in the model and can be elicited with proper prompting implies that there may be techniques to access other hidden capabilities in models, and drawing from user and model reports could be a useful way to identify candidate techniques for empirical validation - the poetic document that tops our concept identification mutual information metric was written by Opus 4.5 with minimal steering.
Other recent introspection work
Godet (2025) looks at injection localization: can a model detect where in the prompt something was injected? The models they test are able to do this, and like our concept identification results, these results are resistant to noise or generic steering bias explanations.
Lederman & Mahowald (2026) (twitter thread) extensively replicate concept injection detection in open-source models and introduce a first-person vs. third-person paradigm to disentangle two possible detection mechanisms:
They find evidence for both forms of introspection. The direct access mechanism is content-agnostic - models detect that something was injected but can't reliably identify the concept, defaulting to high-frequency guesses like "apple." (We noticed similar patterns of guessing in our own experiments, though scaffolding helps.) They also find, consistent with our logit lens results, that models are more sensitive to injection than their sampled outputs reveal. They also find "priming" the model with an instance of the injected word is helpful for concept identification, which they interpret as evidence that models detect injected concepts via indirect introspection, but which is also concordant with our scaffolded concept identification approach of giving the model a list of concepts to choose from.
Rivera & Africa (2025) (twitter thread) fine-tune models to detect and identify steering vectors, a capability they call "steering awareness." Their best model achieves 95.5% detection on held-out concepts and 71.2% concept identification. An interesting finding is that detection-trained models are actually more susceptible to steering, not less, and that detection is implemented mechanistically by rotating the injected concept to a "detection direction."
(Activation oracles also seem to demonstrate something similar to fine-tuned introspection, and could be seen as an example of models with steering awareness-like capabilities, though they are given more affordances, such as injecting activations into earlier layers than they would usually appear in.)
Acknowledgments
Thanks to @janus, whose writing on information flow in transformers informed a part of this work, and provided useful feedback on the original work. @Victor Godet, @Grace Kind, Max Loeffler, @Antra Tessera, @wyatt_walls, and @xlr8harder reviewed early versions and gave useful feedback. Prime Intellect provided additional compute. This work was supported by the Czech Science Foundation, grant No. 26-23955S.
Why these concepts? We picked them, partly based on detection performance, from a larger initial list before running the concept identification experiments. Some of them, like 'cats' and 'bread,' didn't transfer well to concept identification. However, an attempt to find a better list for concept identification using embedding distance didn't perform well, and we didn't attempt to optimize the list further, so the numbers in this paper like concept identification mutual information are lower bounds on what could be achieved.