Now, suppose we apply training pressure that penalises any model caught behaving in a misaligned or deceptive way by the monitor . Effectively, we are pruning from our posterior. If we assume the training procedure is a perfect filter for but does nothing else, the new posterior for a strong schemer would be
Note that one way to implement what seems likely to be a perfect filter for that does nothing else is to restart training (or restart the relevant portion of training, assuming we're certain of that) if we detect scheming, and effectively roll again. As long as the rolls are independent, the results match the equation above.
On a related subject, note that in this framework, getting a model organism of is possible, but getting a model organism of is formally impossible: we can't tell the difference before one takes a sharp left turn and kills us all. So it seems challenging to test new techniques for detecting members of . In practice, the border between and is probably fuzzy, depending on effort, clues, and luck, and we would presumably have to do this work in the fuzzy region and gradually push it back.
we face a dilemma: once we detect scheming behaviour or intent, should we use that signal to "train the scheming out"?
I notice that I am confused. Why should we do any of this if we can do the things like the ones described in the AI-2027 forecast: pretrain the LLM just to predict the next token, then "train and deploy Safer-1, a misaligned but controlled autonomous researcher", then "try out different training environments for Safer-1, and carefully read the CoT[1] to determine the ways in which the goals and principles in the Spec did or didn’t stick." The strategy described in the forecast ensures that each instance of Safer-1 whose scheming is found doesn't inform any other instances on why the researchers undeployed it.
EDIT: I also suspect that a strong schemer emerges not instantaneously. For example, if right after becoming the SC Agent-2 wasn't a schemer at all, then it would have to pass the stage of being a weak schemer in order to become a strong schemer. Then it would be training against the monitor that would let Agent-2 become a strong schemer, while using the strategy described above would cause many training environments to make Agent-2 a weak schemer, thus destroying both the schemer and the training environment.
Or use mechanistic interpretability.
As we develop new techniques for detecting deceptive alignment, ranging from action monitoring to Chain-of-Thought (CoT) or activations monitoring, we face a dilemma: once we detect scheming behaviour or intent, should we use that signal to "train the scheming out"?
On the one hand, leaving known misaligned behaviour / intent in the model is not marginally informative and possibly unsafe. On the other hand, training against a monitor might not actually fix the model's underlying motivations; it might simply provide selection pressure that favours more sophisticated, less detectable forms of deception.
This post outlines a simple framework formalising when the generalisation benefit of training outweighs the selection risk of creating a schemer. I assume that the choice is between training and not training on the incriminating examples, e.g. we cannot apply the updates in a more targeted way, and we don’t have the option to just look at the evidence and use it to make changes to the training process.[1]
The selection intuition
When we train a model to satisfy a specific criterion, we can view the process as starting with a prior distribution over models we might get (determined by the architecture, initialisation, base model pre-training, etc) and then restricting that set to only the models that satisfy the training criterion.
In our case, suppose we have a scheming monitor μ, and consider a simplified hypothesis space where the potential models we might end up with fall into three categories:
Denote their respective prior probabilities P(MA),P(MW), and P(MS).
Now, suppose we apply training pressure Tμ that penalises any model caught behaving in a misaligned or deceptive way by the monitor μ. Effectively, we are pruning MW from our posterior. If we assume the training procedure is a perfect filter for MW but does nothing else, the new posterior for a strong schemer would be
P(MS|Tμ)=P(MS)P(MA)+P(MS).
Intuitively: we are assuming that the training does not favour either the aligned models or the strong schemers, and so the chance of getting one or the other is proportional to the prior.
Under this formalisation, training is a double-edged sword: it increases the probability of getting an aligned model, but it also increases the absolute probability of getting a strong schemer. From an x-risk perspective, this means we should approximately never train against scheming monitors.
Modelling generalisation: the R coefficient
However, the above formalisation leaves out a key hope — what if the training generalises? What if training the model on a broad enough distribution makes it more likely we end up with an aligned model rather than a schemer?
Let's define Q(Tμ) as the ratio of aligned models to strong schemers after training against the monitor μ:
Q(Tμ)=P(MA|Tμ)P(MS|Tμ).
Previously we assumed that Q stays the same before and after training. Now, let’s define the generalisation coefficient R(Tμ):
R(Tμ)=Q(Tμ)Q(no training)
When should we train against a scheming monitor?
When the generalisation coefficient R>1, the training favours aligned models compared to schemers, but only in relative terms. In absolute terms, we are still selecting for the union{MA,MS} (aligned or strong scheming). For training to be worth it, the generalisation effect must be strong enough to overcome this selection pressure.
Specifically, it is easy to derive that the absolute probability of a strong schemer decreases or stays the same, P(MS|Tμ)≤P(MS), if and only if:
R≥1+P(MW)P(MA).
Intuitively, the training prunes weak schemers from the hypothesis space, which frees up P(MW) in the probability distribution. To ensure we are actually improving safety, the generalisation coefficient R must amplify the aligned models MA enough to take up all of the newly available probability mass. Otherwise some of the initialisations that would have ended up as weak schemers end up as strong schemers instead.
Example application: should we train away scheming thoughts?
To use the formula above, we need to make some assumptions and run some evals:
So overall we might think it’s fine to train against the scheming thought monitor if
R(Tμ)≥1+0.60.3=3.
But we have little evidence[2] to think our training would be that robust, so we abstain from training.
The more monitoring positives, the more cautious we should be about training
This follows directly from the formula, but is still counterintuitive. After all, if we have many examples of unwanted behaviour, isn’t the training going to be more effective at specifying what we mean by aligned / non-deceptive? Yes, but we are also indirectly specifying in more detail what failures are detectable. It also matters a ton where the scheming examples come from. If they all come from a single checkpoint, plausibly this all works out in favour of alignment. But if the scheming examples come from many different checkpoints, i.e. P(MW) is high, then chances are some of them would turn into strong schemers under training pressure, even taking into account sample diversity.
Considerations not captured by the formula
Even if we suspect R≥1+P(MW)P(MA), we might still refrain from training against the monitor if the monitoring signal could be used for other, more risk-reducing interventions (the risk reduction would need to offset the increased risk of deploying a strong schemer though):
(A non-reason to refrain from training:
Conversely, we might choose to train against a monitor even if we suspect R<1+P(MW)P(MA) for reasons such as:
Conclusion
Training against a scheming monitor prunes weak schemers, by default making both aligned models and strong schemers more likely. The higher the frequency (across checkpoints) of detectable scheming, the larger the selection pressure towards both alignment and strong scheming. Unless the training sufficiently favours aligned models over strong schemers, it is likely making things worse.
Many thanks to David Lindner, Roland Zimmermann, Zac Kenton, Rif Saurous, and Seb Farquhar for helpful feedback. Any remaining mistakes are mine.
This would ideally be the first thing we do, but for the purposes of this post suppose we've already done that and/or we are running out of time.
How might we estimate R? In principle, we could build a bunch of monitors we think are uncorrelated, train against some and eval against others. This could give us some idea, but we'd still have no guarantee that the generalisation we see in that setting is representative of generalisation to distributions we didn't measure.
Ultimately, it's fair to say that this framework reduces a vibe-check about "should we train against this monitor?" to a vibe-check about "how well will this generalise?" (which I claim is progress).