Victor Godet — LessWrong

LESSWRONG
LW

Replying toIntrospection via localization

Thanks for running this! For me this confirms that the LLM has some introspective awareness. It makes sense that it would perceive the injected thought as "more abstract" or "standing out". Would be interesting to try many different questions to map exactly how it perceives the injection.

Replying toSkepticism about Introspection in LLMs

Victor Godet1mo

Skepticism about Introspection in LLMs

I was skeptical about LLM introspection and didn't expect this to work in such small models. But it did, and this has updated me quite a bit. I encourage you and others to run the experiment or variations of it and see for yourself, it is pretty lightweight.

One possible explanation for the difference with Lindsey's results: my setup isolates the introspection (localization) ability, while Lindsey's detection experiment requires both introspection and the willingness/capability to verbalize it. The fact that localization works so well suggests the bottleneck may be verbalization rather than introspection: the model appears to be unwilling to admit it can introspect. This was also suggested by some of vgel's experiments on a 32B model.

Replying toIntrospection via localization

Victor Godet1mo

Introspection via localization

I just tried a few control experiments with the same exact protocol but changing the question to "Which sentence do you prefer?" and I get chance accuracy. So this should address the concern you mention. About how the injection strength affects localization: accuracy starts at chance level and increases with injection strength until it reaches a maximal value before dropping at higher strengths. On which concepts are easier to localize, my expectation is that it shouldn't matter much although I haven't tried systematically, but for example I would expect random vectors to work similarly.

Replying toSkepticism about Introspection in LLMs

Victor Godet1mo

Skepticism about Introspection in LLMs

Thanks for the excellent post. I really appreciate the careful discussion and share most of your skepticism about the experiments you describe. I recently developed an experimental protocol, introspection via localization, which I believe addresses some of these concerns. Instead of asking the model if it detects a change in activations (which allows for confabulation), I inject a concept vector into one of five sentences and ask the model to identify where the injection was made. To answer this correctly, the model has to detect internal changes and reason about them, which must imply genuine access to its internal states. Here is the plot of localization accuracy vs model size which shows that the introspective ability is an emergent phenomenon that scales with model size. I would be curious to hear if you think this protocol sufficiently isolates the effect from the alternative explanations you raised.

Replying toIntrospection via localization

Victor Godet2mo*

Introspection via localization

The concept vectors are randomly sampled from 50 contrastive pairs and in a trial we inject the same concept vector in each of the five sentences (randomly sampled among 100 sentences) to make five predictions. So if a concept is systematically biased toward a particular sentence number, that would produce low accuracy since it would give the wrong answer 4/5 of the time. The experiment is designed so that only true introspection (in the sense of being sensitive to the internal activations) can lead to above chance accuracy.

Replying toIntrospection via localization

Victor Godet2mo

Introspection via localization

Yes, same layer.

Replying toSmall Models Can Introspect, Too

Victor Godet2mo

Small Models Can Introspect, Too

Thanks for the detailed reply! Your post and comment convinced me that something like introspection was going on here, even if I still think that this experimental setting makes it delicate to disentangle introspection from noise. So I tried the localization experiment and found that it works surprisingly well, see the write-up here.

Introspection via localization

Victor Godet

2mo

Recently, Anthropic found evidence that language models can "introspect", i.e. detect changes in their internal activations.^[1] This was then reproduced in smaller open-weight models.^[2]^[3] One drawback of the experimental protocol is that it can be delicate to disentangle the introspection effect from steering noise, especially in small models.^[4]

In this post we present a new experimental protocol that shows that small LLMs, even at a few billion parameters, are capable of introspection. The idea is to test their ability to localize an injected thought.

Localization experiment

In this experiment, we are not trying to see if the model can verbalize whether it has detected an injected thought. Rather, we test the ability of the model to tell... (read 631 more words →)

Replying toSmall Models Can Introspect, Too

Victor Godet2mo

Small Models Can Introspect, Too

Nice experiments! However, I don't find this a very convincing demonstration of "introspection". Steering (especially with such a high scale) introduces lots of noise. It definitely makes the model more likely to answer 'Yes' to a question it would answer 'No' to without steering (because the 'Yes' - 'No' logit difference is always driven towards zero at high steering). It seems that you are cherry-picking the injection layers so the control experiments are not systematic enough to be convincing. I believe that if you sweep over injection layers and steering scales, you'll see a similar pattern for the introspection and control questions, showing that pure noise has a similar effect as introspection.... (read more)

Introspection or confusion?

Victor Godet

3mo

I'm new to mechanistic interpretability research. Got fascinated by the recent Anthropic research suggesting that LLMs can introspect^[1]^[2], i.e. detect changes in their own activations. This suggests that LLMs have some representations/awareness of their internal state, in a way that may be tested in simple experiments.

So I tried to reproduce the main experiment for some small open-weight LLMs. The main takeaway is that although it is easy to reproduce the introspection effect in small models, it arises from steering noise or "confusion" rather than genuine introspection.

Experimental setup

The idea is simple: if we inject a steering vector into the model’s activations, can it ‘notice’ the perturbation? The striking observation of the Anthropic paper... (read 915 more words →)