x

LESSWRONG

LW

agastyasridharan — LessWrong

agastyasridharan

agastyasridharan

Message

12

1

4mo

agastyasridharan

12

4mo

Introspection or entropy? Re-examining concept-injection “introspection” in open models

Thanks to Joshua Joseph, Dillon Plunkett, and Julian Huang for their feedback and for helping me refine these ideas. Anthropic recently reported that language models can “introspect.” They take a steering vector for a concept like “oceans,” add it into the model’s internal activations, and then ask “are you noticing...