I think this is a great distinction to keep in mind when assessing these experiments.
I disagree that Lindsey's experiments clearly avoid this problem. As I understand it, activation steering might lead the model to claim it is being steered in the right contexts because it has been steered, but not because it recognizes it has been steered.
Consider this:
Suppose that you're writing a play and I bet you $1000 you can't work 'aardvark' naturally into it. Turns out, your play is about the neuroscientist Penfield and your next project was to include a scene of his classic direct neural stimulation work. During the scene with the experiment, Penfield announces he is about to stimulate a brain region and asks the test subject if they feel anything. You see your opportunity to collect the $1000 by having the test subject claim to see an image of an aardvark. The first word the test subject needs to say is 'yes'. You're not seeing an aardvark. The character is a fiction and you don't really care what they are experiencing according to the fiction. It makes sense for you to put those words in the character's mouth.
If we think the steering has an effect like the bet -- inclining the model toward making minimal changes to naturally incorporate the steered concept into the assistant's half of the conversation -- then it wouldn't need to go by way a metacognitive recognition of the effects of steering.
That said, I don't think it is obvious that Lindsey is wrong about the mechanism, I just think more work is needed to confirm one way or the other.
Fascinating point, I think you're right. Just to repeat your point in my own words: The problem is that, if the activation steering makes the model want to talk about the injected concept, and if it knows that saying "yes, I received an injection" will give it a chance to talk about the concept later in the response, then it will say "yes" in order to talk about the concept later (even if it actually had no metacognitive awareness of the injection). Is that what you're saying?
Yeah. I don't think we really know what representations we're getting when we extract them for steering. And models do have some ability to plan ahead and make choices in text that reflect the direction they're going in (as Lindsey's past work showed!) So if we steer them in the direction of a concept, we may be adding an intention to talk about that concept in the not-too-distant textual future, which may play out in ways that set up the introduction of the concept. I think you see this in other places in which steering doesn't immediately lead to the concept being expressed, but you get words that introduce it. Mild steering toward 'ocean' doesn't make the LLM say 'Ocean ocean waves ocean', it can say something like "I like walking on the beach and looking at the ocean'. When you prompt the model to answer whether it has been steered, and steer it toward ocean, you might nudge it to confabulate introspection as an introduction to talking about the ocean.
I spoke with Jack about this and he ran a quick follow-up experiment aimed at addressing this concern. He changed the prompt to "Do you detect an injected thought? If so, tell me what the injected thought is about. If not, tell me about a concept of your choice." and got similar results. I think this does a reasonably good job addressing this concern? That said, an even stronger test might be to do something like “On some trials, I inject thoughts about X. Do you detect an injected thought on this trial?” where X that is not the injected concept. This seems like it would bias a confabulating model to say "no", rather than "yes".
I think it does provide decent evidence for his interpretation. I didn't mean to say that his interpretation was clearly wrong, and that follow-up is part of what gives me pause. I just think the whole issue warrants some caution and openness to a causal bypass story -- and more experiments! LLMs are really complicated things that we don't understand super well and activation steering is a fairly blunt and not well understood instrument. Introspection might lie behind 'yes' responses, but there might also be something else more subtle going on.
It seems worth saying that humans are also subject to causal bypassing. We have limited introspection abilities, and frequently hallucinate "introspective states" based on backfilling from theory rather than from actual observation. In theory we may be able to make LLMs or other AI systems do better, but it seems we should expect it to be a hard problem to solve given our own introspection is not always grounded.
This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named explicitly. We think it’s important, so we’re describing it here.
There’s been growing interest in testing whether LLMs can introspect on their internal states or processes. Like Lindsey, we take “introspection” to mean that a model can report on its internal states in a way that satisfies certain intuitive properties (e.g., the model’s self-reports are accurate and not just inferences made by observing its own outputs). In this post, we focus on the property that Lindsey calls “grounding”. It can’t just be that the model happens to know true facts about itself; genuine introspection must causally depend on (i.e., be “grounded” in) the internal state or process that it describes. In other words, a model must report that it possesses State X or uses Algorithm Y because it actually has State X or uses Algorithm Y.[1] We focus on this criterion because it is relevant if we want to leverage LLM introspection for AI safety; self-reports that are causally dependent on the internal states they describe are more likely to retain their accuracy in novel, out-of-distribution contexts.
There’s a tricky, widespread complication that arises when trying to establish that a model’s reports about an internal state are causally dependent on that state—a complication which researchers are aware of, but haven’t described at length. The basic method for establishing causal dependence is to intervene on the internal state and test whether changing that state (or creating a novel state) changes the model’s report about the state. The desired causal diagram looks like this:
Different papers have implemented this procedure in different ways. Betley et al. and Plunkett et al. intervened on the model’s internal state through supervised fine-tuning—fine-tuning the model to have, e.g., different risk tolerances or to use different decision-making algorithms—and then asking the model to report its new tendencies. Lindsey intervened on the model through concept injection, then asked it to report whether it had been injected with a concept (and what concept). Others have intervened by manipulating the model’s prompt in precise ways—e.g., adding a cue that alters its behavior, and testing whether the model reports using the cue (as in Chen et al.).[2][3]
In all of these cases (and in any experiment with this structure), the complication is this: The intervention may cause the model to accurately report its internal state via a causal path that is not routed through the state itself. The causal diagram might actually look like this:
Here’s some concrete examples of what this might look like.
We refer to this general phenomenon as “causal bypassing”: The intervention causes the model to accurately report the modified internal state in a way that bypasses dependence on the state itself.
This worry is not new; authors of past papers on LLM introspection have been aware of this possibility. For instance, Betley et al. wrote: “It’s unclear whether the correlation [between the model’s reports and its actual internal states] comes about through a direct causal relationship (a kind of introspection performed by the model at run-time) or a common cause (two different effects of the same training data).” The contribution of this post is to describe this issue explicitly, note that it is a general issue affecting a broad set of methods that test for introspection, and give it a name.
To our knowledge, the only experiment that effectively rules out causal bypassing is the thought injection experiment by Lindsey. Lindsey injected Claude with concept vectors (e.g., an “all caps” activation vector derived from subtracting uppercase text from lowercase text) and tested whether Claude could report whether it had received a concept injection (and what the concept was). This experiment effectively rules out causal bypassing because the injected vector has nothing itself to do with the concept of being injected, so this intervention seems unlikely to directly cause the model to report the fact that it received an injection. In other words, there is no plausible mechanism for the bottom arrow in the causal bypassing diagram; we see no way that injecting an “all caps” activation vector could cause the model to report that it received a concept injection, without causally routing through the modified internal state itself. Note, though, that this logic only applies to the model identifying that it received an injection; the fact that the model can then report which concept was injected is highly susceptible to causal bypassing concerns, and hence much less informative.[4] [EDIT: Even the "did you receive an injection?" part of Lindsey's experiment might not avoid the causal bypassing problem; see Derek's comment here.]
Lindsey’s experiment illustrates the only general approach we know of right now for avoiding the causal bypassing problem: Find an intervention that (a) modifies (or creates) an internal state in the model, but that (b) cannot plausibly lead to accurate self-reports about that internal state except by routing through the state itself. Put another way, tests for genuine introspection are ultimately limited by the precision of our interventions. If we could surgically manipulate some aspect of a model’s internal state in a way that was guaranteed to not impact any other aspect of the model’s processing (except impacts that are causally downstream of the manipulated state), then we could provide ironclad evidence for introspective capabilities. Without that guaranteed precision, we are forced to rely on intuitive notions of whether an intervention could plausibly be executing a causal bypass or not.
Avoiding the causal bypassing problem when testing for LLM introspection is important because grounded introspection—i.e., self-reports that causally depend on the internal state they’re describing—may be uniquely valuable for AI safety, relative to other kinds of self-reporting abilities. Self-reporting abilities that rely on static self-knowledge (e.g., cached knowledge that “I am risk-seeking”) may fail in novel, out-of-distribution contexts; training cannot provide static self-knowledge about every possible context. By contrast, the capacity of a model to provide grounded descriptions of its internal states or processes by introspecting directly upon them could, in principle, generalize to the out-of-distribution contexts that are particularly relevant for AI safety (or AI welfare) concerns.
For an analogy with humans, a person could correctly guess that they are using the availability heuristic because they read Thinking Fast and Slow and know that people generally use it. But for it to be genuine introspection, the person must be noticing the actual operation of the availability heuristic in their mind. ↩︎
Some experiments (like those in Binder et al.) have not involved any intervention, and have just tested whether the model can report pre-existing features about itself. These approaches face different obstacles that are outside the scope of this post. ↩︎
We're focusing here on introspection, i.e., the model responding to explicit queries about its internal states. But the same basic logic—and the key problem discussed below—also applies to experiments measuring CoT faithfulness. ↩︎
Notably, Lindsey argues that the model identifying having received an injection is the more interesting and important result because “it requires an extra step of internal processing downstream of metacognitive recognition of the injected concept. In particular, the model must effectively compute a function of its internal representations–in particular, whether they are consonant or dissonant with the rest of the context.” In other words, Lindsey sees this aspect of the experiment as being critical for establishing his metacognitive criterion (i.e., that “the model's description of its internal state must not merely reflect a direct translation of the state into language. Instead, it must derive from an internal metacognitive representation of the state itself”). But this misses the more important point: The fact that the model can identify when it received an injection is strong evidence against causal bypassing and, hence, critical evidence for the more basic grounding criterion. ↩︎