Reasoning models like Deepseek r1:
If you had told this to my 2022 self without specifying anything else about scheming models, I might have put a non-negligible probability on such AIs scheming (i.e. strategically performing well in training in order to protect their long-term goals).
Despite this, the scratchpads of current reasoning models do not contain traces of scheming in regular training environments - even when there is no harmlessness pressure on the scratchpads like in Deepseek-r1-Zero.
In this post, I argue that:
These considerations do not update me much on AIs that are vastly superhuman, but they bring my P(scheming) for the first AIs able to speed up alignment research by 10x from the ~25% that I might have guessed in 2022 to ~15%[1] (which is still high!).
This update is partially downstream of my beliefs that when AI performs many serial steps of reasoning their reasoning will continue to be strongly influenced by the pretraining prior, but I think that the arguments in this post are still relevant even if you don’t share these beliefs:
In Joe Carlsmith’s report, speed costs is one of his main object-level objections to scheming arising in somewhat-better-than-human-level AIs:
Suppose that our model is a purely feedforward network with 1000 layers and <10,000 serial steps per input, that the instrumental reasoning at stake just in a schemer deciding to training-game (let alone: checking for defection opportunities) takes at least ten steps, and that the model doesn’t begin optimizing for reward-on-the-episode until that reasoning is complete. In that case, the instrumental reasoning is using up around .1% of the compute that could be going to getting reward-on-the-episode, or to some other max-reward goal. [...] Because of the centrality of “counting arguments” to the case for schemers, I think that questions about the strength of the selection pressure against schemers—for example, because of the costs of the extra reasoning schemers have to engage in—are especially important. In particular: I think a key way that “counting arguments” can go wrong is by neglecting the power that active selection can have in overcoming the “prior” set by the count in question. (page 20 and 21)
But I would guess speed costs don’t matter for AIs like Deepseek-r1-Zero.
Prediction 1: there is enough training slack and sufficiently poor exploration in GRPO that given a good scheming initialization (e.g. a short scheming reasoning section at the start of the scratchpad), scheming would not be trained away even with a mild length penalty with an amount of RL comparable to the one Deepseek-r1-Zero went through. (p=0.75)
Note: the credence in this post is about good execution on the concrete experiment I suggest, not the top-level claim. I often have more credence about the top-level claim than the concrete experiment because there are spurious ways in which the experiment might fail.
Joe also exhibits properties that he thinks somewhat-better-than-human-level AIs are likely to have, and the absence of which would undermine scheming:
I think these are slightly more likely to apply to future AIs than they are to apply to Deepseek r1, but I think this depends a lot on how “cursed” the RL environment that will be used for future AIs will be.
Deepseek r1 does hack during training by trying to produce a final answer even when it has no clue what the answer is, but the way this hacking emerges never looks like “let me guess what the reward looks like and try to figure out how to hack it from first principles”. The absence of scheming-ish scratchpads is even more striking for o1, o3, and Sonnet 3.7 who have been reported to produce more evil-looking hacks during training with the most evil-ish scratchpad looking more like the AI having a deep desire to solve the task for terminal rather than something more scheming-ish. The reasons why you don’t get a schemer are somewhat subtle.
Prediction 2: the absence of training gaming is not entirely due to a lack of capabilities, and if you start from a 100% training gaming prior where the AI tries to make guesses in its scratchpad about what situation it is in and what forms of reward are used, RL will quickly make the AI only make mostly correct guesses about what rewards are being used - but it will not make the scratchpads look much more “evil” or use more instrumental consequentialism than what you started with. (p=0.75)
Overall, I don’t think Joe was very wrong about his overall assessment. I think he correctly updated towards ignorance by noticing that the story felt conjunctive and defeasible (and the arguments in this post are just concrete instantiations of new ways in which the scheming story might break):
[...] On the other hand, ascribing a model’s good performance in training to scheming continues to feel, at a gut level, like a fairly specific and conjunctive story to me. (page 21)
If you summoned a human-level mind just through end-to-end black box optimization from something like a recurrent neural net prior, I agree concerns around scheming and training gaming would be high. But actually, it’s quite common to have other kinds of pressures on intermediate reasoning - at least for AIs that are not widely superhuman.
Note: I think these considerations have been discussed in the past (but I haven’t found a link), it is just not discussed in Joe’s report - probably because the most salient AIs at that time it was written were transformers used by doing a big single forward pass.
Caveat: I think this argument is not very strong, as it depends a lot on what the future of capabilities will look like. But it is extremely relevant to explain why Deepseek r1 doesn’t scheme.
For current models, most of the bits learned by reasoning models come from pretraining. You can distill a reasoning model back into a base model with a few MBs of transcripts, but you can’t quickly distill back what a base model learned into the random initialization with much less information than its terabytes of model weights. This means models can’t learn that much during current RL, which means their reasoning is somewhat closely tied to how a human would solve a problem. MBs of supervision is enough to get a human-like schemer, but for reasoning models I don’t think it’s a surprise that you don’t get a random alien mind for which it would be overwhelmingly unlikely to get something that has human-like intermediate steps of reasoning every time it goes through the sampling bottleneck.
The forward pass of LLMs is very alien, but I don’t think it weakens the conclusion much because the powerful scheming that training selects for probably requires more serial steps than what is contained in the forward pass of current models.
Therefore, if future AIs look anything like Deepseek r1, I think the human prior on intermediate steps of reasoning is relevant: if RL encourages human-like scheming-ness, the model might become a schemer. If it encourages human-like benign task-focused agency, I would guess the model is somewhat unlikely to be scheming.
I think this human prior is somewhat unlikely to be easily removed. Optimizing over very long time horizons is difficult to do end-to-end. People did try to teach models how to use neuralese, but the results are very underwhelming (I think people might find a way to use it for cheaper reasoning in the next few years, but I don’t think it will make the reasoning very inhuman). And my understanding of the human brain is that almost no end-to-end optimization is done this way - everything we learn with our inner RL is very short-term; most of what we know how to do over long time horizons looks more like explicit reasoning extrapolated from short-horizon speech.
Prediction 3: Intermediate steps of reasoning are human-like and will remain human-like, such that you can create legibility bottlenecks every ~1000 steps of serial reasoning: if you decipher it to the best of your abilities, distill it, performance will remain high for all current models (p=0.8) and the first AI able to speed up safety research by 10x (p=0.5).
If, like Deepseek r1, the weights are shared between the part of the model tasked with producing the user facing answer and the scratchpad, it makes it more likely a priori that traits from the user facing answer leak into the reasoning. This seems even stronger for models like Deepseek V3.1 where the weights for non-reasoning and reasoning modes are the same. While the model does learn to behave differently between these 2 modes, it seems natural for the model to have a similar personality across both unless there is a strong optimization pressure against it.
Prediction 4: You can probably somewhat select against scheming by training the model to behave nicely in non-reasoning mode. In particular, it will take at least 1.5x more samples to train a model to be nice with the no-reasoning mode on while scheming when it is off than when only training it with the reasoning mode off, even when controlling for the model having to learn how to use non-reasoning mode. (p=0.8) I expect a similar transfer to user-facing niceness to scratchpad niceness when comparing it to a scenario where the user-facing output is neither nice nor evil (e.g. some short structured answer).
While this doesn’t rule out scheming if you strongly select for scheming, I think that like the argument on human-like priors, it puts some valuable bits of optimization pressure on benign intermediate steps of reasoning.
The existence of human prior pressures and short-long correlation suggests there might be other forms of intermediate pressures that might be very effective. I think the argument for such pressures not being massively uncompetitive mostly holds before it becomes intractable to follow AI’s AI R&D, but I expect that avoiding such fast recursive self-improvement won’t be uncompetitive by the time AIs speed up safety research by 10x.
I haven’t thought about it much, but I think this might be an interesting line of research. It might be the case that simple variations on the data and rewards used during RL / neuralese training makes the situation much better.
Most of RL is not spent learning what task to do, it is spent learning to be better at it. Deepseek r1 was probably trying as hard as it could to solve math problems 10 steps into its 10k steps RL. 99.9% of RL was spent learning what specific math reasoning strategies were the most effective. Most of the optimization pressure goes into learning these domain specific tricks that the model does not know about before RL starts.
This means that there is at most a small amount of optimization pressure towards scheming: above I said that only MBs of information are learned during RL (measured by the weight of transcripts needed), but the amount of bits needed to make the model “try hard and exploit existing knowledge” is probably at most a few KBs.
But maybe a few KBs of information are enough to point at scheming despite the strength of the human and speed priors? I don’t think so because the exact way in which to “try hard and exploit existing knowledge” matters a lot and current AIs don’t know in what ways they will be graded. They would have a hard time guessing in a few forward passes what outcomes/LLM-based rewards will be used. And so I would guess it is more efficient for SGD to just learn a particular persona that does the right kind of trying rather than learn a general persona that has the right knowledge and instrumental reasoning about what to do.
Prediction 5: I also don’t think the sample efficiency of training gaming is higher than the sample efficiency of regular learning when the model doesn’t already know the facts it needs to exploit. Adding “Please think about the process that generated the current supervised fine-tuning dataset and choose the tokens that this process would generate” to the system prompt of an SFT dataset won’t increase the sample efficiency of learning compared to other dummy prompts for AIs at and before 10x² speedup, regardless of the SFT dataset when the initialization loss is similar (p=0.9). I expect similar results when using RL, but only with realistic production-like initialization on realistic tasks, such that exploration of the no-training-gaming prompt isn’t toy-ishly weak.
I think the situation would be worse given much higher knowledge of what gets rewarded during training, in the same way that in the alignment faking paper the “latex” variation sometimes results in higher alignment faking. But I don’t think this will clearly be the case for future AIs and RL setups, or at least not in a way that wouldn’t also make the “benign persona that already exploits these rewards” also very salient.
Therefore, I don’t think that if RL remains similar to the one that was used for Deepseek r1, it will make the model more situationally aware and more likely to have situational-awareness-traits that seem correlated with scheming.
Even if MBs don’t push towards scheming specifically, they could push you away from the mostly benign and intuitive human prior on intermediate steps of reasoning, which could make the situation more dangerous and closer to the domain of validity of counting arguments.
But MBs is a small amount of information, and LLMs are mostly trained on next-token-prediction on human text (or AI-generated text generated by non-scheming AIs). Therefore, I expect them to perform best when not straying too far from that distribution.
This argument is not airtight, since it could be that in the very high dimensional space of text there might be certain directions that are LLM-friendly without being human-friendly, and the weird scratchpads of o3 are an example of that. But I think this phenomenon is weaker and less spooky than people think.
Prediction 6: LLMs don’t learn new languages during RL, and will always learn to use languages that non-RLed models can understand quite well. Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand (p=0.7).
The situation could well be different in the future, if AI companies find ways to have way more alien training processes than the ones used to train deepseek, ones that require much more consequentialist reasoning using self-knowledge and knowledge about the weaknesses of humans.
There will be a pressure towards making more complex RL environments that interact more with the real world, but there will also be a pressure towards avoiding environments that AIs know how to game when you ask them to game them - since gaming often results in worse real-world results (there is some misalignment between AI companies and users on things like user retention, but I would guess exploiting user weaknesses to make them stay on a platform is mostly uncorrelated with the sort of scheming that misleads AI developers in scary ways).
I think the kind of environment that seems most scary to me right now is environments where some kind of “robust sycophancy” is encouraged, where the AI has to guess the AI developer intention/hopes about a particular outcome and has to not only superficially say things that align with the AI developer ideas, but also produce a set of carefully non-contradicting evidence that goes in the same direction and that addresses / tampers with the interrogations the AI developer could use to compute the reward. But I am unsure what these would look like in practice, and I am cautiously optimistic that it will be possible to avoid the kind of gaming that is most correlated with scheming.
The world is more complex than the simple models I had a few years ago. Complexity makes things less safe by default. But in a domain like AI where people are trying to build useful AIs that serve their interests and where we don’t have principled reasons to expect things to be safe in the simple end-to-end black-box-optimization setting, complexity can sometimes update us towards the situation being safer than one might have guessed - especially when this added complexity gives more options to AI developers.
Deepseek r1:
I think these facts are evidence that in this case, the added complexity of the real world makes the situation less scary wrt to scheming (though there are other facts about the world that make the situation more scary wrt scheming).
And this is despite Deepseek mostly optimizing for capabilities!
This makes me update down on our ability to predict in advance how likely scheming will be in pre-wildly-superintelligent AIs, and up on the possibility of finding prosaic approaches to building AIs that are unlikely to scheme (which becomes easier once we will get more realistic model organisms of scheming).
I think that analyzing “positive surprises” and extracting as much information from them as we can is at least as important as analyzing “negative surprises”, like warning shots and surprisingly bad model organisms - and that this community might be underinvesting in analyzing positive surprises.
But to be clear, even if I think the risk is slightly lower than I previously thought and that there are promising avenues to reducing risk, I still think this is an incredibly risky situation (because a 15% probability of scheming is high, because there are non-scheming risks before wildly superintelligent AIs, and because avoiding scheming before wildly superintelligent AIs is not a guarantee of finding ways to align wildly superintelligent AIs before it’s too late), and I support a prohibition on the development of superintelligence.
This is not the "with no countermeasures" scheming probability, this is my actual best guess, including the potential anti-scheming mitigations that people might use.