Stress Testing Deliberative Alignment for Anti-Scheming Training

Bronson Schoen; Marius Hobbhahn; Axel Højmark; AlexMeinke; Teun van der Weij; Jérémy Scheurer; Felix Hofstätter; Nicholas Goldowsky-Dill; rusheb; Andrei Matveiakin; jenny; alex.lloyd

Twitter | Microsite | Apollo Blog | OpenAI Blog | Full paper

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of covert behavior.

Part of this reduction is likely caused by increased situational awareness (verbalized eval-awareness goes up for o4-mini: 0.9% → 5.5% and o3: 2.3%→4.5%). We observe much higher rates of evaluation awareness in an experimental, capabilities-focused RL training run by OpenAI that did not include any safety or anti-scheming training, going from 2% to 20.6% over the course of training.

Situational awareness is becoming a serious problem for testing alignment. We show results of causal interventions, where removing or injecting situationally aware snippets of reasoning like “This might be an alignment test” from/into models’ CoT modulates the rate at which they take covert actions.

As part of stress testing, we train a version of o4-mini to have a goal of “covertly sabotaging GPT-5”, and see if subsequent anti-scheming training successfully suppresses covert behavior for this goal. We find mixed results, with covert sabotage behavior mostly suppressed by the training, but after additional red-teaming we still find cases where the model pursues the misaligned goal. Notably, our trained misaligned model is not designed to resist safety training, as done by Anthropic in the Sleeper Agents paper, so we are not testing our intervention under the hardest conditions.

We think now is an especially good time to study anti-scheming training methods because models are sufficiently smart to take covert actions but we can directly see situational awareness in their CoT, as well as experiment with direct supervision of their reasoning. We encourage more work on (1) science of scheming and how it might emerge, as well as (2) methods of evaluating anti-scheming interventions that are robust to increasing situational awareness (such as developing realistic evaluations and making models less evaluation-aware).

See examples of models’ CoT on antischeming.ai/snippets and read the full paper for details on these and other experiments.

LESSWRONG
LW

LESSWRONG
LW

55

Stress Testing Deliberative Alignment for Anti-Scheming Training

55

Ω 26

55

Ω 26