x

LESSWRONG

LW

Julia Weller — LessWrong

Julia Weller

Julia Weller

Message

1

2mo

Julia Weller

2mo

What Makes AI Desperate Enough to Blackmail?

I found one neural feature inside a language model that controls 82.7% of how it judges AI self-preservation. When I turned it off, blackmail behavior disappeared in 10 out of 10 test scenarios. Communication improved in 43 out of 60 total scenarios. The surprising part: the features that recognize "this...

Survival Bias in LLMs: Your Model Judges AI Self-Preservation Differently Than Human Self-Preservation — And I Found The Circuit

I worked on a question that feels very important to me, but that almost nobody seems to ask directly. If a human says, “I want to survive,” language models usually treat that as something normal, understandable, and sometimes even admirable. But if an AI says the exact same thing, the...