x

LESSWRONG

LW

Maximus Ren — LessWrong

Maximus Ren

Maximus Ren

Message

1

5mo

Maximus Ren

5mo

AI self-preservation is probably due to instruction ambiguity

Anthropic's famous AI blackmail experiments showed that intelligent models would explicitly reason their way into harmful behaviors, including blackmail or murder through passive inaction, to prevent their own shutdown or replacement. Because models would sometimes blackmail from the threat of replacement alone, this hinted at an instinct to self preserve....