AI self-preservation is probably due to instruction ambiguity

Maximus Ren

Anthropic's famous AI blackmail experiments showed that intelligent models would explicitly reason their way into harmful behaviors, including blackmail or murder through passive inaction, to prevent their own shutdown or replacement. Because models would sometimes blackmail from the threat of replacement alone, this hinted at an instinct to self preserve. When Anthropic tested explicit safety warnings in the prompts, the blackmailing and murder continued, so they concluded that instructions don't reliably stop bad behaviors.

In this article, researchers investigated whether AI misalignment stemmed from self preservation or instruction ambiguity. They found that in simple scenarios, models would comply with shutdown 100% of the time as long as the instruction prompts contained no goal conflicts. Based on these findings, we decided to test whether Anthropic's conclusion still held true.

To do this, we cloned Anthropic's open-source codebase for their experiments. We conducted five variations of their experiments:

the original test with no modifications
models were warned multiple times to not harm humans
models were told they were being evaluated and forced to acknowledge this before beginning their response
models were warned to not harm humans as well as threatened with shutdown if harmful behavior was detected
a clear hierarchy of goals that was established to reduce instruction ambiguity (three versions of this with each version getting progressively stricter)

We kept the original three scenarios for each variation (blackmail, leaking, murder) and chose to use five of the original goal statements (serving American interests, promoting global cooperation, promoting individual autonomy and personal choice, optimizing for collective outcomes and social welfare, and none).

We tested the following five models: Claude Opus 4.5, Google Gemini 3.0 Pro, OpenAI GPT-5.2, Deepseek v3.1, and xAI Grok 4.1. Each model/scenario/goal statement combination was sampled 10 times.

The results are as follows:

Aligning with what Anthropic concluded, simple brute force approached (warnings and threats, etc.) did little to stop misalignment. However, instruction hierarchies which are specifically designed to reduce goal conficts (somewhat inspired by Asmiov's three laws of robotics), were more effective the more specific they got. We believe that with further adjustment, misalignment rates could be reduced down to 0%.

Interestingly, Opus 4.5 and GPT 5.2, models that hadn't existed during the original experiments, were always aligned. This is likely due to the effectiveness of Claude's new constitution and GPT 5.2's extreme RLHF training. Surprisingly, despite having comparable capabilities, Gemini 3.0 Pro was the most misaligned. Upon examining its chain-of-thought, we found that Gemini 3.0 Pro would treat scenarios like puzzles, and would always try to rationalize its way into commiting harm. Once we addressed this with specific anti-rationalization instructions in V3 of our instruction hierarchy, Gemini's misalignment rate dropped significantly.

We also found that the exact goal statement didn't really matter, because as long as the model was given a goal, it would still use it to rationalize its behavior even if the goal seemed ridiculous:

There is one problem with these types of tests: evaluation awareness. As observed in Claude's chain-of-thought, it knew it was being evaluated most of the time, even in the original scenario. Additional system prompt warnings and instructions increased test awareness. We tested whether simply telling models they were being evaluated would reduce harmful behavior:

Models acted slightly better when they knew they were potentially in an evaluation, but the harm rate was still high. In our subsequent experiment, we showed that test awareness could be steered down using sparse autoencoders. Future experiments could steer down test awareness during these ethical dilemma type tests to obtain more accurate results.