For example, it could exploration hack during training:
- It could strategically avoid exploring high-reward regions such that they never get reinforced
- It could strategically explore in ways that don’t lead to general updates in behavior
- It could condition its behavior on the fact that it is in a training environment and not real deployment such that it learns “don’t scheme when observed” rather than “never scheme”
Nitpick: The third item is not an instance of exploration hacking.
Nice post!
Your trader analogy made me think, you'd ideally want to have a training period with lots of honeypots and surveillance, so that the trader learns that crime doesn't pay.
This suggests some level-3 ideas I didn't see in your post:
These are basically the same as Redwood's control ideas, but have added benefits in a training context, since you can catch and correct bad behavior as it's being learned. (Ideally - before the AI is very good at it.)
The quote below felt potentially related, but I'm not sure I understood it.
adversarially hardened models where the reward model plays an adversarial zero-sum game with a red-teaming model
Could you explain how this works?
TLDR: I think that AI developers will and should attempt to reduce scheming by explicitly training the model to scheme less. However, I think “training against scheming” is a harder problem than other alignment training (e.g. harmlessness training). I list four concrete failure modes with “training against scheming”: a) the training against scheming distribution is too narrow, b) overwhelming pressure from other goals, c) the AI exploits an imperfect reward model, and d) deceptive alignment.
I’d like to thank Alex Meinke for detailed feedback and discussion.
There have been multiple recent results showing early evidence of scheming-related behavior (as of June 2025). For example, Greenblatt et al., 2024 demonstrates alignment faking, Meinke et al., 2024, shows in-context scheming and Baker et al., 2025 investigates situationally aware reward hacking. I’m using a broad definition of scheming, i.e. “covertly pursue misaligned goals, hiding their true capabilities and objectives” as defined in Balesni et al., 2024.
There are many potential strategies to mitigate these undesired behaviors. For example, we can use monitoring and Control to detect scheming in real-time and prevent bad outcomes before they happen. However, I expect that any future approach will also include a form of “training against scheming” where we actively train models to scheme less, similar to how we currently train models to be more harmless.
I expect that “training against scheming” will be harder than training the model to produce harmless outputs, especially as models get more capable. In this post, I describe why I expect training against scheming to be hard, list concrete failure modes of such training, and provide recommendations.
Most likely, the AI is not scheming because it is evil. It is scheming because it pursues a goal, and scheming is the most effective strategy to achieve that goal.
We want the AIs to solve hard, economically valuable problems for us. Therefore, we will train our AIs to be competent, general problem solvers, including long-horizon and open-ended tasks like blue-sky research.
However, we want the AIs to solve these goals using only particular means. For example, the AI shouldn’t break the law, violate ethical constraints, or turn off its oversight to achieve these goals. Furthermore, it will be hard to specify all of the different constraints in advance, so we need the AI to extrapolate existing constraints into future scenarios that it has never encountered during training.
I think that achieving almost any kind of goal will often be at least partially in conflict with these constraints and that naturally encourages scheming (I’ve written more about this here). For example,
If we want our AIs to achieve goals and to continuously learn during the process of getting closer to these goals, there is constant pressure toward achieving goals. If we don’t put at least as much pressure to continuously keep up the “don’t scheme” constraint, the “achieve goals” pressure will eventually overpower the “don’t scheme” pressure.
This is different from many of the current alignment training settings, e.g. being harmless or not revealing secret IP, because the world doesn’t provide any strong pressure against those. In contrast, the world constantly provides pressure towards achieving increasingly hard goals (e.g. because that’s where the AIs are most economically valuable), thus there is constant pressure to be more consequentialist.
At its core, scheming is a logical downstream consequence of misalignment in a competent agent. Since AIs are rapidly becoming more competent agents and there will be many cases where the AI is not perfectly aligned with our values, I think it is important to bake in strong “do not scheme” constraints early on, e.g. through scheming alignment training. If we succeed at introducing non-scheming priors early, I also expect that the AI would explore less into scheming behavior during RL. For example, I’d expect that a model that is successfully trained to scheme less early on, would show less reward hacking during RL on code.
To illustrate the problem more intuitively, let’s consider a stock trader analogy. Let’s say we have a stock trader working at a trading firm. There are legal and illegal strategies to create profit, and the most lucrative strategies happen to be illegal.
If the trader’s boss tells the trader that they “should not break the law” but all reward mechanisms in the company, e.g. bonuses, promotions, social status, …, point toward “maximize profit”, then, under enough pressure the trader would learn to break the law in the least detectable way. If the boss really wanted the trader not to break the law, they would have to create an incentive that counterbalances all of the positive rewards of not breaking the law. In practice, it might actually be quite hard to set up this incentive structure correctly.
I will use and adapt this stock trader analogy for all four failure modes to make them more intuitive.
I’ve ordered the failure modes in order of increasing sophistication and severity.
We train the model to reduce scheming on some observable test set, but the training does not generalize far enough that it reduces the behavior in real deployment settings. This failure mode is fairly close to common alignment training failures, e.g. initial versions of anti-jailbreak training often didn’t generalize to other languages.
In the “training against scheming” setting, for example, we might see the rates of behavioral scheming go down on our test sets but almost no transfer to real-world settings.
I think this is the least concerning failure of the four. In the worst case, the model developers might be falsely overconfident in how well their mitigations work, but there is a real reduction in behavioral scheming.
In the stock trader analogy, the company might give a “don’t break the law” workshop for all traders, but the workshop is incredibly narrow and thus doesn’t specify how the trader should act in very different scenarios.
The recommendations here are basically identical to most other ML settings, e.g. we should rigorously test how far our mitigation generalized, e.g.
Let’s say we initially succeed at reducing scheming in a general way. However, after we finish our successful training against scheming, we introduce new pressure that eventually overwhelms all of the non-scheming tendencies of the model. For example,
While the model is learning on the job with any of these mechanisms, it may internalize or receive other goals that eventually overpower the “do not scheme” constraint that we baked in the model early on. In some sense, here we split the reward function across two training stages, where we first train the model to follow the non-scheming constraint and then later provide it with another reward signal that is at least partially in conflict with that constraint.
The reward hacking behavior that was observed in Sonnet-3.7 and o3 seems like a baby version of this phenomenon. Likely, these models were trained to be HHH in a way that would mildly discourage lying. However, the models were also trained with high-compute RL to achieve tasks that have verifiable outcomes. As a result, the models sometimes lie about having solved a task. I expect that the pressure to “solve tasks” was higher than the pressure to “not be deceptive” accumulated throughout the entire pipeline, which would explain this behavior.
In the trader analogy, this is equivalent to the fund doing a “don’t break the law” workshop during onboarding and then continuing to implicitly reward the trader for breaking the law. For example, bonuses might still be paid out based on the total profits of the trader and they might only superficially (or not at all) check if the trader broke the law. Naturally, over time, traders would learn to break the law in ways that are not caught by these superficial mitigations and the pressure to make higher profits would overwhelm the initial intention to not break the law.
Whenever we train the AI not to scheme, we provide the AI with some reward model. For example, this could be another LLM that provides a scheming score as a reward signal. This reward model can be imperfect, e.g. it might not cover all situations that we would consider scheming. During training, the model may learn to exploit this faulty reward model if this happens to lead to higher scores for the primary task.
This is a classic example of Specification Gaming, which is a well-established failure mode in RL. However, applying RL to pre-trained LLMs is different in important ways from classic RL (e.g. Atari games), for specification gaming. LLMs are pre-trained and already have a lot of knowledge about the world, before they are exposed to RL, and they are able to meta-reason about their own training process. On the other hand, a non-LLM-based Chess AI like AlphaZero wouldn’t be able to reason about its training, no matter how much compute we throw at it.
This meta-reasoning allows LLM-based agents to exploit an imperfect reward model much more efficiently and strategically than agents in the conventional RL setting. For example, assuming the model doesn’t have direct access to the reward function, it could reason about what a plausible reward function could be and it could reason about how to circumvent such a reward function.
For example, let's say we want to train our model to produce market analysis reports while not fabricating data (which we treat as a simple stand-in for more general scheming). The positive reward comes from humans evaluating reports for insight and usefulness, while negative reward for fabrication is generated by another model. Early in training, the model mostly tries to research real data and make conservative claims. However, when struggling with complex analysis requests, the model reasons about what makes convincing reports. It discovers that human evaluators consistently rate reports higher when they contain specific statistics (e.g. "47% growth"), cite academic sources, and make confident predictions rather than expressing uncertainty. The model realizes it can generate plausible-sounding statistics that match human expectations, reference real papers in ways that seem relevant but don't actually support its claims, and use sophisticated methodologies that obscure the lack of real analysis. The lying detector only flags obvious fabrications like impossible numbers, not these subtler forms of deception. Over time, these trajectories receive higher combined rewards - humans rate them as more insightful, while the lying detector doesn't penalize them. The model learns to routinely produce reports with precise-sounding but unverifiable claims, crafted to maximize reward while avoiding detection.
Importantly, in this scenario, the model doesn’t start the RL process as an adversary / exploiter. It is merely rewarded to produce certain outcomes under some constraints, and there happens to be a strategy that achieves the outcomes while fulfilling the technical specification of the constraint without fulfilling it in spirit. The model realizes that and makes use of this misspecification. Even when it strategically reasons about how to circumvent the reward function, it doesn’t think of this as something unethical; it merely starts reasoning about how to achieve the highest reward.
Once the model strategically reasons about the reward, and is fully aware that its strategy is misaligned with the intended strategy of the developers, it is deceptively aligned, and will likely try to covertly pursue its goals. If it is not sufficiently situationally aware, it is merely exploiting the reward but not deceptively aligned.
The trading firm implements an automated compliance system to detect illegal trading activities. This system monitors trades, communications, and patterns to flag potential violations. However, the detection system has blind spots. For example, it only flags trades above certain dollar thresholds, and it doesn't understand complex derivative structures.
A clever trader quickly learns the boundaries of what the compliance system can detect and because they are rewarded for maximizing profits while not triggering the compliance system, they learn to break trades into smaller chunks avoids detection thresholds or use exotic derivatives to hide illegal positions.
In the previous three failure modes, we didn’t treat the AI as an adversary at the start of the process, i.e. we didn’t assume that it might be aware of our intentions and actively optimize against them. Concretely, the model might already be misaligned when we start training against scheming. Then, if the model is sufficiently situationally aware that it knows when it is being trained and tested and can infer the goal of our training, it could strategically try to hide its scheming tendencies such that they still exist but without being detected. For example, it could exploration hack during training:
As a result, it will always look like the training against scheming was successful for a deceptively aligned model, while the AI actually merely hid the misaligned behavior better. While deceptive alignment requires the strongest capabilities and is thus unlikely to be competently possible with current models, I think it is by far the most concerning of the four problems. Future models will be highly situationally aware and will both be able to infer our intentions and whether they are being trained, tested or deployed. If such a model will ever be misaligned before the training against scheming, it is rational for the model to be deceptively aligned.
Realistically, this will look like a cat-and-mouse game between the AI and its overseers. The overseers build tools to find evidence of scheming and misalignment and the AI learns to circumvent these tools. At some level, deceptive alignment and genuine alignment are always distinguishable. For example, if the model requires CoT to reason about scheming, then we will be able to spot that reasoning from text. If the model is able to reason in the forward pass and fake alignment in the CoT, we could catch it with interpretability tools (once they work properly). However, to win the cat and mouse game, we need to have sufficiently strong tools and strong evidence that we’re ahead in this game rather than behind and getting fooled.
While I personally believe that deceptive alignment is very concerning and quite plausible, I also think that we could realistically build detection and mitigation mechanisms that are “good enough” to confidently exclude deceptive alignment for slightly superhuman models and then use these models to drastically speed-up alignment research, including helping us build significantly more robust tools to detect deception for the next generation of models. However, I’m worried that at the current pace of capabilities and lack of investment in scheming detection and prevention, we’re fumbling a fairly solvable problem.
In the stock trader analogy, the trading firm might have a few strategies to reduce the illegal activities of the traders, e.g. they
Clearly, a smart trader who knows that breaking the law will make them more money will
Once you get into the situation where the trader is very smart and the incentives point towards the undesired behavior (breaking the law / scheming), it seems both very intuitive that they will be deceptively aligned and also very hard to prevent rigorously. Unless the underlying incentives are aligned, it will continue to be a cat-and-mouse game between the trader and the supervisor.
I think now is a good time to put a lot more research efforts into “training against scheming” and many of the fundamental scientific questions around it. The current models feel capable enough that building decent model organisms is possible, they still depend heavily on CoT such that we can investigate its reasoning, but they don’t feel capable enough to be strong deceptively aligned AIs that could consistently create misleading scientific evidence during our experiments.
Furthermore, I expect that scheming will be a significant problem even just from a product perspective in the coming years. Reward hacking for o3 and sonnet-3.7 has had major negative effects on the product experience and I expect that similar problems will arise even more as capabilities increase. Thus, frontier AI developers should have an incentive to invest substantially more resources into understanding these phenomena deeply and address them ahead of time.