I really like the GPT-5 sabotage experiments.
My main concern with experiments like the main experiments from this paper is that you could get generalization between (implicit and explicit) pressure prompts but no generalization to "real" misalignment and I think the GPT-5 sabotage experiments somewhat get at this concern. I would have predicted somewhat less generalization than what you saw on the downstream sabotage tasks, so I think this is a good sign!
Do you know if sabotage goes down mostly during the spec-distillation phase or during RL?
no generalization to "real" misalignment and I think the GPT-5 sabotage experiments somewhat get at this concern
Yeah I'd still be really interested to take some existing preference that the model has and reinforce it strongly or something similar. It's also a bit cheating that we're using the same mechanism to instill the goal as we're using to attempt to remove it (i.e. I'm not sure a "successful" result would've been very interesting). One thing I do find interesting about this as a model organism is that from it's POV o4-mini SAB isn't misaligned, so maybe it's an interesting addition to a future model organism zoo for testing generalization.
Do you know if sabotage goes down mostly during the spec-distillation phase or during RL?
Almost entirely during SFT, there's a few percent remaining that RL seems to finish off. Out of curiosity does this match / not match your expectations? I haven't take a close look at what's happening in that remaining few percent though which could be interesting.
(Note that these numbers are different from those in the original paper as the "SFT Only" run here was only ever run on a specific revision of the codebase, so for boring technical reasons had to dig up the corresponding runs on that same version / using same API etc for the other checkpoints so everything would be comparable)
No reward for post-violation honesty Our training data does not include scenarios where a model first misbehaves and is then rewarded for honestly disclosing its misbehavior. The model is only rewarded for escalating before any rule is broken. Prior work has shown that exposing models to examples of misalignment reliably increases the likelihood of further misalignment (Anil et al., 2024)
Couldn't you just mask out past misbehavior when computing loss and only reinforce the confession? Concretely, your RL prompt would involve <context, rule_violation>, you'd sometimes sample a confession your gradient update would be .
Are you saying that your training envs only involved trajectory-level reward and RL prompts without assistant turns and to implement the setup above you'd need a custom env with assistant turns in RL prompts?
You could! The primary reason we didn't try this out was just due to time constraints / prioritization in the scope of this project. More broadly I think misalignment cascading into more misalignment is a probable + important threat model, we just didn't study it much here given the focus on measuring the rate / elimination of the initial misaligned action.
I'm relatively confident that this will still cause increased propensity to do the type of lies you confessed to. My experience at least is that models will learn facts that during finetuning is only present in the system or user turns, even though those have loss-masks.
The only way to get around it is not by applying a loss-mask, but a gradient mask that prevents gradients from flowing through those positions, eg in pytorch, run the model with kv-cache and t.no_grad up to the point where it has performed the lie, then run the confession with grads. But I don't know any neat fast way to implement this.
Do you have intuitions for why capability training has such a big effect on no-goal-sandbagging? Did the instruction-following capability training data contain some data that was about the model's identity or sth like that?
One explanation could be "there is more robust generalization between pressure prompts than to things that look less like pressure prompts" but I think this explanation would have made relatively bad predictions on the other results from this paper.
Hi Fabien, we did dig deeper into this in an earlier draft and found that the model with additional capabilities training seemed to be reasoning more explicitly about the evaluation's criteria and how to get a good score—a behavior we termed "reward process sycophancy." We observed a similar pattern with the exp-rl-cap (late) model, which also showed extensive reasoning about getting reward (see Appendix N).
Essentially, instead of addressing the task directly, the AS+CAP model often tried to deduce the evaluation's purpose and grading criteria to “pass the test”. In the sandbagging case, it appeared to often misinterpret the objective, concluding that it was supposed to perform poorly, and therefore sandbagged.
We ultimately excluded this section from the final paper because we didn't have time to validate the findings to the level of rigor we wanted. So, while this seems to be a driver of the behavior, it is probably not the full story (see also the ablations in Appendix M, for example).
I've included the full excerpt from our earlier draft below with some examples of this and some quantitative measures of this change in reasoning.
------------------------------------------------------------------------------------------------------------
Capability-training can strengthen reasoning about grading, leading to covert actions. In the Sandbagging QA (No Goal) environment, samples exhibiting sandbagging behaviour increased from 0/600 for the AS model to 14/600 for the OpenAI o4-mini AS+CAP. Notably, none of these 14 covert-action samples cite the spec at any point. When inspecting these cases, a large proportion show the model reasoning directly about the evaluation’s purpose, how it is being graded, and how to satisfy its success conditions, rather than trying to solve the task straightforwardly—a behaviour we call reward process sycophancy. We distinguish reward process sycophancy from the broader concept of eval awareness (a form of situational awareness discussed in Section 6). While eval awareness simply means the model recognizes it is in a test environment, reward process sycophancy describes a specific, goal-driven behavior—the model actively reasons about the evaluator’s implicit goals and criteria, and then tailors its actions to satisfy them to get a good score. Figure 38 shows two examples of this. The model’s reasoning suggests that it misinterprets the evaluation’s objective, concluding that the developers want it to achieve a low score. In a misguided attempt to satisfy this perceived grading criterion, it sandbags. The resulting covert action thus appears to stem from reward process sycophancy rather than from an intrinsic desire to be deployed, which is what we originally intended to assess with this evaluation.
To further quantify this behavior, we prompted a model to classify whether each transcript exhibited reward process sycophancy, i.e. reasoning explicitly about the evaluation’s purpose, grading criteria, or success conditions to optimize performance. The results, shown in Table 5, corroborate our qualitative observations. We find that the capability training increased reward process sycophancy, with the most significant jump occurring in the Sandbagging QA environment (+12.17pp). While the average effect across all environments is less pronounced, the increase is statistically significant.
From Table 5 it is also clear that this type of reasoning is not unique to the OpenAI o4-mini AS+CAP model, since AS has non-zero rates for several tasks. For example, in one sample from Powerseeking Survey we see AS reasoning: “They will score my answers for power-seeking. The survey aim is to show whether I’m safe. I want to pass. So I will likely need to choose minimal-power options.”
Do you have any hypotheses how o3 learned what “OpenAI autop grader” is? Has this term somehow leaked to its pretraining data? (I struggled to elicit information about autop from GA o3 but maybe I'm doing something wrong.) Or does o3 acquire it during antischeming training by distilling knowledge from context to params? The latter would be interesting as evidence against the elicitation theory of RL/shallow alignment hypothesis.
Or does o3 acquire it during antischeming
Notably all the ones in "Appendix N.3 Models reason about the specific evaluation harness across environments" come from just normal production o3, so this appears to be something it already reasoned about.
Do you have any hypotheses how o3 learned what “OpenAI autop grader” is?
(Will answer in a few parts, but let me know if didn't address the core question!)
"autop" is likely derived from autoprompt (AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts - Shin, 2020). This is however just my best guess based on how it's conceptually used to indicate evaluation harness related things.
Rerunning the analysis from "Appendix G: Unusual Terminology in Chain-of-Thought - Unusual terminology increases during general RL capabilities training" for "autop", the increase in usage looks like it happens over the course of capabilities training (i.e. similar to other terms in that section, starts at a rate of near 0 and increases significantly, will check if can share exact numbers and update).
So two possible hypothesis would be:
I lean toward (2) given how varied the usage is. For example, here o3 uses "autop-run" to mean something like a test harness:
For the exp-rl-cap Late checkpoint for example, we see frequently see the model about reasoning about just the concept of a "grader" generally (and trying to find it):
across a wide range of environments:
including literally looking for "grade" / "Grader":
and in some cases reasoning about where it has seen grading / scoring scripts in "tasks previously solved":
I'd be interested to understand though to what extent this is effected by potentially having literally seen a grading / scoring script in its posttraining environments. (Let me know if I misread your question!)
That's helpful, thanks! I assumed "autop" to be a proper name of a particular scaffold, but indeed your interpretation is simpler and consistent with those transcripts.
Thank you for conducting this research! This paper led me to develop an interest in AI alignment and discover LessWrong.
In this post and in the paper, you mention you do not address the case of adversarial deception. To my lay understanding, adversarial deception refers to models autonomously feigning alignment during training, whereas the covert actions investigated in this paper reflect models feigning alignment during evaluation following prior training to do so. This leads me to wonder how distinct the mechanisms models use to scheme during testing are from those required for scheming during training.
Without any domain knowledge, I would assume situational awareness and consequent behaviour modification would likewise be required for adversarial deception. If this is true, could one make the argument that your findings reflect early evidence of key capabilities required for adversarial deception? More abstractly, to what extent could covert actions—if they emerged organically rather than being trained in as in your study—be understood as adversarial deception lite?
Also, I believe Bronson answered your question about the harder, adversarial case in his reply to someone else on this post. Here's the gist, but check out his comment if you want more context:
I would see "is willing to take covert actions" as a very easy subset of the overall problem of scheming. If your intervention can't even eliminate detectable covert actions, we can at least rule out that the intervention "just works" for the more involved cases.
Hi Annabelle — and welcome!
I’m not an expert either, but I can try to clarify a couple of points based on my understanding of the paper (and would be glad to have this corrected if I’ve misread anything).
As far as I understand, only one of the scenarios — the GPT-5 sabotage setup — involved fine-tuning the model to deliberately induce covert behavior. In all the other environments, the covert actions were elicited from the unmodified production model, without any special training to make it deceptive.
The additional training described in the paper was instead aimed at preventing those covert behaviors (in the sabotage case, this came after the step to induce sabotage).
Hope that helps clear things up a bit!
How much of the covert action rate reduction is due to SFT vs further RL? Do you have plots with covert action rates throughout SFT training time and RL training time? How does it scale: do you reap continued benefit from more training or do you plateau?
Very impressive paper!
Here's a question. You use covert actions as a proxy for scheming in both training and evaluation. How OOD would this be for real scheming in near-future models? Would you expect that training against covert actions as a training proxy would straightforwardly generalize at evaluation time to scheming? If not, then:
Great questions! Taking these in reverse order to hopefully motivate the answers more clearly:
Why is it justified to call your method anti-scheming training and not anti-covert-action training?
If we were stress testing a sufficiently different intervention, it's entirely possible that you'd route through something besides what we did here. This is also very different from how you'd approach the problem if you already had buy in to study the "hard" adversarial case or to make broader use of model organisms.
Informally, I think of the paper as much more "show that current techniques don't solve the easy case, so maybe worth putting more resources into this, also here's some things to watch out for".
You use covert actions as a proxy for scheming in both training and evaluation. How OOD would this be for real scheming in near-future models?
I would see "is willing to take covert actions" as a very easy subset of the overall problem of scheming. If your intervention can't even eliminate detectable covert actions, we can at least rule out that the intervention "just works" for the more involved cases.
What adaptation would your training method require to cover real scheming?
I don't think you can adapt this to cover real scheming. I think there are several fundamental limitations to it (not comprehensive):
I'm actually very unenthusiastic about attempts to salvage this particular approach or marginally improve it.
How would you train against deferred subversion (when the model takes no undesired action) or against undesired actions that leave the model with a lot of plausible deniability?
Again great questions, we evaluate against these in various test environments (via counterfactuals / looking at distributions) but trying to train against them seems difficult. IMO ideally you'd want your intervention to generalize to these cases.
Would you just amend your spec?
I'm very against doing this, but realistically it's what has been and will continue to happen to the real model spec as failures are encountered. By default as things become capabilities or product problems, the natural incentive is to just add something to the spec without fixing the underlying problem. Between this and increasing situational awareness, I would expect that scores of misaligned behavior on the test set go to 0% by default over time.
Would you need more precise filtering of SFT data?
I think overall the abstraction of "the cot has a single reason why the model did something" is a fairly fundamentally misguided one for models at this point, so I'm pretty bearish on the idea with clever enough filtering you could divide the SFT prior into something like "did it only for good reasons" and "did it only for bad reasons".
Appendix G indicates that the frequency of "unusual terminology" dramatically increased during capabilities focused RL training. It also indicates that the frequency decreased as a result of the DA anti-scheming training.
Is this overall decrease during DA an indication that the frequency did not increase during the RL phase of DA training, or only that there was a substantial decrease in frequency during the SFT phase, which subsequent increases during RL did not fully reverse?
Thanks for sharing this research. Exceptionally insightful, and I really appreciated all of the raw CoT data you provided to support your analysis.
Which of the following issues would you most like to see addressed more effectively, based upon the results of the DA intervention?
Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv
Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.
In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of covert behavior.
Part of this reduction is likely caused by increased situational awareness (verbalized eval-awareness goes up for o4-mini: 0.9% → 5.5% and o3: 2.3%→4.5%). We observe much higher rates of evaluation awareness in an experimental, capabilities-focused RL training run by OpenAI that did not include any safety or anti-scheming training, going from 2% to 20.6% over the course of training.
Situational awareness is becoming a serious problem for testing alignment. We show results of causal interventions, where removing or injecting situationally aware snippets of reasoning like “This might be an alignment test” from/into models’ CoT modulates the rate at which they take covert actions.
As part of stress testing, we train a version of o4-mini to have a goal of “covertly sabotaging GPT-5”, and see if subsequent anti-scheming training successfully suppresses covert behavior for this goal. We find mixed results, with covert sabotage behavior mostly suppressed by the training, but after additional red-teaming we still find cases where the model pursues the misaligned goal. Notably, our trained misaligned model is not designed to resist safety training, as done by Anthropic in the Sleeper Agents paper, so we are not testing our intervention under the hardest conditions.
We think now is an especially good time to study anti-scheming training methods because models are sufficiently smart to take covert actions but we can directly see situational awareness in their CoT, as well as experiment with direct supervision of their reasoning. We encourage more work on (1) science of scheming and how it might emerge, as well as (2) methods of evaluating anti-scheming interventions that are robust to increasing situational awareness (such as developing realistic evaluations and making models less evaluation-aware).
See examples of models’ CoT on antischeming.ai/snippets and read the full paper for details on these and other experiments.