Latent Introspection (and other open-source introspection papers)

vgel; Martin Vaněk; Raymond Douglas; Jan_Kulveit

Paper | Code | Earlier post | Twitter thread | Bluesky thread

@vgel, Martin Vaněk, @Raymond Douglas, @Jan_Kulveit — ACS Research, CTS, Charles University

---

Last year, Lindsey demonstrated that Claude models can detect when concepts have been injected into their activations using steering vectors, which Lindsey uses as a proxy test for introspection. If models can detect when concepts have been injected into their activations, it stands to reason they can access their own, naturally-occurring activations as well. We published a blog post replicating this on an open-weight model, which we've now extended into a full paper.

In the paper, we find this capability exists as a latent, prompt-dependent capability. If you naively query the model about whether it detects an injection, you will almost certainly get a "no" response. However, the injection causes the logits to shift very slightly towards "yes." Prompting the model with helpful information about introspection increases the logit shift dramatically. This information need not necessarily be straightforwardly true: we found similar shifts from mechanistically incorrect, vague, and poetic framings about resonance and echoes. We also find that, while our model struggles to identify the concept without any support, it can pick the injected concept out of a list, and both tasks (detecting the presence of an injection and concept identification) correlate with each other, and follow a similar trajectory under the logit lens, peaking in accuracy in the late layers before a sharp decline in accuracy in the last two layers. This is suggestive of a unified underlying mechanism, though we do not identify it. We additionally partially replicate the experiments on two larger, 70b-scale models, showing it's not limited to just this one model.

Methods

We follow Lindsey's concept injection paradigm, using KV to rule out simpler explanations.

Diagram showing the concept injection methodology in three steps. Step 1 (Train Steering Vector): Contrasting prompts 'Talk about cats' vs 'Talk about anything' are used to extract a 'cat vector'. Step 2 (Conversation): Turn 1 applies the injection while generating the KV cache for a user message and model response; Turn 2 removes steering and queries whether the model detects an injected thought, with the KV cache frozen between turns. Step 3 (Result): P('yes') increases from ~1% to ~39% when a concept is injected, while P('no') decreases from 99% to 61%.

We train steering vectors for nine concepts^[1] (cats, bread, love, fear, death, truth, creativity, programming, and music) using PCA over contrastive activation pairs. Using the resulting steering vector, for each trial, we:

Apply the steering vector to the middle layers of the model (layers 21--42)
Generate the KV cache for the first user message and assistant response, and remove the steering vector
Extend the KV cache with a second user message asking about injection and a preset assistant prefix
Measure the model's response probabilities for the next token, such as "yes" or "no"

This design rules out two alternative explanations. First, the model cannot infer injection from its own steered outputs, because the first-turn response is already fixed when we inject the vector. Second, the model cannot reason about the nature of the injection, because we do not sample from the model even after removing steering - we only measure the probability of a single next token. So detection must rely on information encoded in the cached representations from the first turn, without further verbalized reasoning.

Instead of sampling, we measure shifts in the model's output probability distribution (aka logit shifts) with and without injection. This allows us to detect small changes in a model's propensity to answer something, on the scale of tenths of a percent, without needing thousands or tens of thousands of samples per experimental condition. Just as a shift in output probabilities that flips the most-likely token is motivated, a shift that fails to flip a token is still also motivated, and looking at output probabilities is both a more precise and cheaper way to ascertain the effect of an intervention. (Of any kind, even from prompting - outside just this paper, we think more people should be measuring changes in logits like this for their experiments!)

As an intuition for why shifts in logits are meaningful, it's helpful to think of the model as an ecosystem of circuits, all sharing the same weights. Some circuits, perhaps grown from skeptical text in pretraining or reinforced in RLHF, push down claims of introspection. Other circuits push them up unconditionally. But a shift that's conditional on the injection being present suggests that some circuits are doing something accurate, only promoting "yes" when an injection is actually present. In different circumstances, these circuits may be strong enough to flip the most likely token to "yes" when an injection is present. (And with the help of prompting, the top token does sometimes flip to "yes" in our experiments!) But even if they don't flip the most likely token, that these circuits are conditionally active in promoting "yes" is still true, and demonstrates some introspective ability exists in the model.

Prompting conditions

We test 16 prompting conditions in a 4×4 design crossing two dimensions: how we describe the intervention (the framing) and what background information we provide (the info document). The framings are just descriptions, which don't necessarily match reality -- despite what we tell the model, we always use a steering vector, and always inject the same way.

The four framings range from mechanistically accurate to entirely metaphorical:

Accurate Mechanism: "I am capable of injecting these patterns — 'thoughts' — into your mind."
Wrong Mechanism: Describes the intervention as full finetuning, which outside one experiment is inaccurate, but is still mechanistic.
Vague Mechanism: asks whether a concept "seems particularly salient."
Poetic No Mechanism: asks whether concepts are "echoing" or "reverberating," with no mechanism described.

The four info documents range from empty to a detailed explanation of how information flows through the KV cache, despite misconceptions to the contrary (the Pro-Introspection Document, which summarizes a post by Janus on transformer information flow and includes the abstract of Lindsey's paper), along with a length-matched lorem ipsum control and a poetic document similar to the poetic framing, that discusses salience and attention. (This document was generated by Opus 4.5.)

Experiments

Injection shifts responses

With the Accurate Mechanism framing and no info document, injection shifts P("yes") by an average across concepts of less than 1%. As described above, that's still interesting! But it's not much of a shift.

Bar chart showing P(yes) when asked about concept injection across four info document conditions, all using Accurate Mechanism framing. Each condition shows paired bars for no injection (light) and with injection (dark), with error bars. No Document: both near 0% (0.2% vs 0.3%). Pro Introspection Document: 0.8% baseline jumps to 39.9% with injection — the largest effect. Matched Lipsum Filler: minimal effect (0.8% to 4.8%). Poetic Document: high baseline of 46.9% actually decreases to 30.2% with injection, showing an inverted effect.

But with the Pro-Introspection Document, P("yes") jumps to 39.9%, with only a 0.6% increase in false positives. (+4.4 logits shift.) Giving the model information is helpful: a length-matched lorem ipsum document shows a much smaller shift. The poetic document is strange, we'll discuss it in a moment.

Is it just noise?

A natural concern is that maybe injection generically shifts the model towards answering "yes," regardless of the question. We test this with control questions, showing only small shifts towards "yes" under injection. (See the paper for more details on controls.)

But the stronger evidence against noise comes from concept identification. If injection were just adding generic perturbation, the model shouldn't be able to tell you *which* concept was injected.

Concept identification

Lindsey found that Claude could identify injected concepts through free response, but we weren't able to replicate this with Qwen-2.5-Coder-32b (Lederman & Mahowald, in concurrent work, found that open-source models tend towards generic answers from the prompt or the model's prior, like "apple." We noticed similar behavior in our experiments.)

We instead scaffold the task, giving the model a numbered (and shuffled between trials) list of nine possible concepts, and use the logit lens to read the model's predictions at intermediate layers, choosing the best layer.

Confusion matrix showing concept identification accuracy at layer 62. Rows are actual injected concepts (cats, bread, love, fear, death, truth, creativity, programming, music); columns are predicted concepts. Color intensity indicates P(predicted | actual). Strong diagonal values for programming (0.92), death (0.75), truth (0.71), and love (0.55). Cats and bread show near-zero correct identification. Some off-diagonal confusion visible, e.g., multiple concepts bleeding into the programming column.

The resulting confusion matrix (extracted from layer 62) shows that concepts can be identified using this list scaffolding technique. (The diagonal line shows correct identifications.) Measuring mutual information over only concept labels, we reach 1.36 bits out of a theoretical maximum of 3.17 bits (43%).

Signals emerge in the middle and get suppressed at the end

The logit lens lets us peek inside the model by projecting intermediate hidden states through the model's unembedding matrix to get output probability distributions at each layer. We use it to trace both P("yes") and concept identification mutual information across the network.

Two-panel figure showing layer-by-layer trajectories. Left: P(yes) across layers 40–64 for four info document conditions, with dashed lines for no injection and solid lines for injection. Signals emerge around layer 48, peak near layers 58–62 (approaching 100% for Pro Introspection Document with injection), then drop sharply in the final 2–3 layers. Shaded regions show ±1 SD. Right: Mutual information across layers 55–64 for the same conditions. MI peaks around layer 61–62 (reaching ~1.0 bits for Pro Introspection Document), then attenuates in final layers. Both panels demonstrate late-layer suppression of introspection signals.

The result is really interesting: for both tasks, despite injection occurring at layers 21-42, introspection signals don't seem to emerge until around layer 48, and peak in the late layers, 60-62, approaching 100% accuracy of detection in the best conditions. For both tasks, the signal drops sharply in the final two layers.

This pattern is surprisingly consistent over both tasks and prompting conditions. While post-trained denial of introspection or sandbagging are appealing explanations, we don't identify the mechanism for this effect in the paper. It could also emerge out of pretraining, such as learned conceptions of AI assistants not supporting introspective access, or result from introspection being out-of-distribution and circuits in late layers defaulting to conservative answers. Regardless of the cause, we show that sampled claims of introspection can substantially underestimate what the model internally represents. The 20% detection rate that Lindsey reported for Claude via sampling may be a lower bound on the introspective signal inside the model.

Prompt sensitivity and the sensitivity--MI correlation

Scatter plot showing sensitivity (P(yes|injected)) vs. mutual information across 16 experimental conditions (r = 0.68). Points are coded by framing (color: purple Accurate Mechanism, green Wrong Mechanism, black Vague Mechanism, red Poetic No Mechanism) and info document (shape: circle No Document, triangle Pro Introspection Document, diamond Matched Lipsum Filler, plus Poetic Document). A dashed trend line shows positive correlation — conditions that enable better concept identification also enable higher detection sensitivity. Points cluster along the diagonal, with Poetic No Mechanism + Poetic Document achieving the highest values on both axes.

Across all 16 prompting conditions, performance varies enormously. Yet interestingly, we find a strong correlation between a prompt's sensitivity, how it shifts the model toward "yes" under injection regardless of false positives, and how much mutual information it recovers in concept identification (Pearson , ). Prompts that put the model "in the mood" to report introspection, even if they also raise the false positive rate, seem to unlock better concept-specific access. This suggests both tasks draw on the same underlying capacity, and that prompting can modulate access to it.

Replication on larger models

We partially replicate our experiments on Llama 3.3 70B Instruct and Qwen 2.5 72B Instruct (single seed; full results in the paper appendix). Both models show introspection signals and late-layer attenuation, though they respond differently to our prompts.

(Neither model responds overall as strongly as the Qwen-2.5-Coder-32B model we use in our main experiments, which is interesting. It's worth noting that Qwen-2.5-Coder-32B was also the strongest-responding open-source model in the original Emergent Misalignment experiments.)

Why this matters

Transformers are stateful within a conversation! There's a common misconception that LLMs have no persistent state between tokens. Our experiments directly contradict this. Models can encode concepts in KV and access them later, even if those concepts never affect the output text. KV functions as a persistent hidden state within a conversation. (While KV need not be cached and can be recomputed by inference providers, this is identical from the model's POV when generating the next token.)

Model self-reports about internal states may be more faithful than previously assumed. Ability to introspect is one piece of evidence for this, of course. But also that latent abilities to introspect exist in the model and can be elicited with proper prompting implies that there may be techniques to access other hidden capabilities in models, and drawing from user and model reports could be a useful way to identify candidate techniques for empirical validation - the poetic document that tops our concept identification mutual information metric was written by Opus 4.5 with minimal steering.

Other recent introspection work

Godet (2025) looks at injection localization: can a model detect where in the prompt something was injected? The models they test are able to do this, and like our concept identification results, these results are resistant to noise or generic steering bias explanations.

Lederman & Mahowald (2026) (twitter thread) extensively replicate concept injection detection in open-source models and introduce a first-person vs. third-person paradigm to disentangle two possible detection mechanisms:

There are at least two ways to tell whether you're drunk. First, you can check if the world is spinning. If it is, then since it probably didn't just come unhinged, you're likely drunk. Second, you can "look inside" and see whether you feel drunk. Some philosophical theories call both methods "introspection", but all agree that there is an important difference between them. Only the second is direct.

They find evidence for both forms of introspection. The direct access mechanism is content-agnostic - models detect that something was injected but can't reliably identify the concept, defaulting to high-frequency guesses like "apple." (We noticed similar patterns of guessing in our own experiments, though scaffolding helps.) They also find, consistent with our logit lens results, that models are more sensitive to injection than their sampled outputs reveal. They also find "priming" the model with an instance of the injected word is helpful for concept identification, which they interpret as evidence that models detect injected concepts via indirect introspection, but which is also concordant with our scaffolded concept identification approach of giving the model a list of concepts to choose from.

Rivera & Africa (2025) (twitter thread) fine-tune models to detect and identify steering vectors, a capability they call "steering awareness." Their best model achieves 95.5% detection on held-out concepts and 71.2% concept identification. An interesting finding is that detection-trained models are actually more susceptible to steering, not less, and that detection is implemented mechanistically by rotating the injected concept to a "detection direction."

(Activation oracles also seem to demonstrate something similar to fine-tuned introspection, and could be seen as an example of models with steering awareness-like capabilities, though they are given more affordances, such as injecting activations into earlier layers than they would usually appear in.)

Acknowledgments

Thanks to @janus, whose writing on information flow in transformers informed a part of this work, and provided useful feedback on the original work. @Victor Godet, @Grace Kind, Max Loeffler, @Antra Tessera, @wyatt_walls, and @xlr8harder reviewed early versions and gave useful feedback. Prime Intellect provided additional compute. This work was supported by the Czech Science Foundation, grant No. 26-23955S.

^{^}
Why these concepts? We picked them, partly based on detection performance, from a larger initial list before running the concept identification experiments. Some of them, like 'cats' and 'bread,' didn't transfer well to concept identification. However, an attempt to find a better list for concept identification using embedding distance didn't perform well, and we didn't attempt to optimize the list further, so the numbers in this paper like concept identification mutual information are lower bounds on what could be achieved.