x

LESSWRONG

LW

seanmagee — LessWrong

seanmagee

seanmagee

Message

11

3

1

4mo

seanmagee

11

4mo

Does GPT-2 Have a Fear Direction?

Anthropic dropped a paper this morning showing that Claude Sonnet 4.5 has steerable emotion representations. Actual directions in activation space that, when injected, shift the model's behavior in predictable ways. They found a non-monotonic anger flip: push the steering vector hard enough and the model will flip to something qualitatively...