Stress Testing Deliberative Alignment for Anti-Scheming Training
by Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, and alex.lloyd
Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...
Sep 17, 2025133