x

LESSWRONG

LW

Victor Godet — LessWrong

Victor Godet

Victor Godet

Message

Physicist at CNRS (Paris), Google Scholar

`

116

2

8

7mo

Victor Godet

Physicist at CNRS (Paris), Google Scholar

`

Introspection via localization

Recently, Anthropic found evidence that language models can "introspect", i.e. detect changes in their internal activations.[1] This was then reproduced in smaller open-weight models.[2][3] One drawback of the experimental protocol is that it can be delicate to disentangle the introspection effect from steering noise, especially in small models.[4] In this...

Dec 28, 2025•36

Introspection or confusion?

I'm new to mechanistic interpretability research. Got fascinated by the recent Anthropic research suggesting that LLMs can introspect[1][2], i.e. detect changes in their own activations. This suggests that LLMs have some representations/awareness of their internal state, in a way that may be tested in simple experiments. So I tried to...

Nov 9, 2025•43