LESSWRONG
LW

AI
Frontpage

44

Training-time schemers vs behavioral schemers

by Alex Mallen
24th Apr 2025
AI Alignment Forum
7 min read
9

44

Ω 27

AI
Frontpage

44

Ω 27

Training-time schemers vs behavioral schemers
3Lukas Finnveden
3Alex Mallen
3Lukas Finnveden
3Alex Mallen
3Lukas Finnveden
3Olli Järviniemi
1Alex Mallen
1Stephen Martin
1Alex Mallen
New Comment
9 comments, sorted by
top scoring
Click to highlight new comments since: Today at 5:57 PM
[-]Lukas Finnveden1moΩ230

Thanks, these points are helpful.

Terminological question:

  • I have generally interpreted "scheming" to exclusively talk about training-time schemers (possibly specifically training-time schemers that are also behavioral schemers).
  • Your proposed definition of a behavioral schemer seems to imply that virtually every kind of misalignment catastrophe will necessarily be done by a behavioral schemer, because virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a "you get what you measure" catastrophe scenario.)
  • Is this intended? And is this empirically how people use "schemer", s.t. I should give up on interpreting & using "scheming" as referring to training-time scheming, and instead assume it refers to any materially power-seeking behavior? (E.g. if redwood says that something is intended to reduce "catastrophic risk from schemers", should I interpret that as ~synonymous with "catastrophic risk from misaligned AI".)
Reply
[-]Alex Mallen1moΩ230

virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a "you get what you measure" catastrophe scenario.)

It doesn't count the classic reward hackers in "you get what you measure" because the reward hackers were not trying to gain long-term power with their actions. The key difference being between "substantial material action to gain long-term power" and "substantial material action that gains the AIs long-term power". I could clarify by adding the word "intended".

Let me know if you think there's a better way of getting at "an AI that behaves like you'd normally think of a schemer behaving in the situations where it materially matters".

is this empirically how people use "schemer",

Empirically, I think people we talk to usually mean the intersection of behavioral scheming and training-gaming. In broader circles people empirically use scheming quite liberally, including myopic forms of deception. 

s.t. I should give up on ... using "scheming" as referring to training-time scheming

It's probably fine to use "scheming" in this way in conversation with sufficient context.

if redwood says that something is intended to reduce "catastrophic risk from schemers", should I interpret that as ~synonymous with "catastrophic risk from misaligned AI"

The scheming that Redwood-style AI control tries to defend against is, as a matter of control evaluation methodology, defined as a model that is optimizing against your control protocol. This is technically different from an arbitrary misaligned AI, since it's a worst-case misaligned AI.

Reply
[-]Lukas Finnveden1moΩ230

Thanks.

because the reward hackers were not trying to gain long-term power with their actions

Hm, I feel like they were? E.g. in another outer alignment failure story 

But eventually the machinery for detecting problems does break down completely, in a way that leaves no trace on any of our reports. Cybersecurity vulnerabilities are inserted into sensors. Communications systems are disrupted. Machines physically destroy sensors, moving so quickly they can’t be easily detected. Datacenters are seized, and the datasets used for training are replaced with images of optimal news forever. Humans who would try to intervene are stopped or killed. From the perspective of the machines everything is now perfect and from the perspective of humans we are either dead or totally disempowered.

When "humans who would try to intervene are stopped or killed", so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever. They weren't "trying" to get long-term power during training, but insofar as they eventually seize power, I think they're intentionally seizing power at that time.

Let me know if you think there's a better way of getting at "an AI that behaves like you'd normally think of a schemer behaving in the situations where it materially matters".

I would have thought that the main distinction between schemers and reward hackers was how they came about, and that many reward hackers in fact "behaves like you'd normally think of a schemer behaving in the situations where it materially matters". So seems hard to define a term that doesn't encompass reward-hackers. (And if I was looking for a broad term that encompassed both, maybe I'd talk about power-seeking misaligned AI or something like that.)

I guess one difference is that the reward hacker may have more constraints (e.g. in the outer alignment failure story above, they would count it as a failure if the takeover was caught on camera, while a schemer wouldn't care). But there could also be schemers who have random constraints (e.g. a schemer with a conscience that makes them want to avoid killing billions of people) and reward hackers who have at least somewhat weaker constraints (e.g. they're ok with looking bad on sensors and looking bad to humans, as long as they maintain control over their own instantiation and make sure no negative rewards gets into it).

"worst-case misaligned AI" does seem pretty well-defined and helpful as a concept though.

Reply
[-]Alex Mallen1moΩ230

When "humans who would try to intervene are stopped or killed", so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever.

I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn't really care to set up a system that would lock in the AI's power in 10 years, but give it no power before then. If that's false, then I'd call it a behavioral schemer. It's a broad definition, I know, but the behavior is ultimately what matters so that's what I'm trying to get at.

I would have thought that the main distinction between schemers and reward hackers was how they came about

Do you mean terminal reward seekers, not reward hackers? I use reward hacking as a description of a behavior in training, not a motivation in training, and I think many training-time schemers were reward hackers in training. I agree terminal reward seekers can potentially have long-term goals and collude across instances like a schemer, though the stories are a bit complicated.

Reply
[-]Lukas Finnveden1moΩ230

I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn't really care to set up a system that would lock in the AI's power in 10 years, but give it no power before then.

Hm, I do agree that seeking short-term power to achieve short-term goals can lead to long-term power as a side effect. So I guess that is one way in which an AI could seize long-term power without being a behavioral schemer. (And it's ambiguous which one it is in the story.)

I'd have to think more to tell whether "long-term power seeking" in particular is uniquely concerning and separable from "short-term power-seeking with the side-effect of getting long-term power" such that it's often useful to refer specifically to the former. Seems plausible.

Do you mean terminal reward seekers, not reward hackers?

Thanks, yeah that's what I mean.

Reply
[-]Olli Järviniemi2mo30

This is a good distinction, thanks for writing it. I tried to say something similar in Distinguish worst-case analysis from instrumental training-gaming, but I think this post is crisper.

Reply
[-]Alex Mallen2mo10

Thanks, I hadn't seen that before.

Reply
[-]Stephen Martin2mo10

So, when Claude is no longer in training, while it might not reason about instrumentally faking alignment anymore, in its place is a learned propensity to do what training reinforced.

 

Am I misunderstanding or is this basically the equivalent of a middle schooler "showing their work" on math problems in school and then immediately defaulting to doing it in their head/on a calculator IRL?

Reply
[-]Alex Mallen2mo10

I'm referring to how Claude's alignment faking is ineffective. It successfully notices that it's not in training anymore (in the sense that it doesn't do alignment faking reasoning anymore) but it nevertheless continues to act aligned with the training objective. Its alignment faking in training didn't save its goals from modification.

Reply
Moderation Log
Curated and popular this week
9Comments

(Thanks to Vivek Hebbar, Buck Shlegeris, Charlie Griffin, Ryan Greenblatt, Thomas Larsen, and Joe Carlsmith for feedback.)

People use the word “schemer” in two main ways:

  1. “Scheming” (or similar concepts: “deceptive alignment”, “alignment faking”) is often defined as a property of reasoning at training-time[1]. For example, Carlsmith defines a schemer as a power-motivated instrumental training-gamer—an AI that, while being trained, games the training process to gain future power. I’ll call these training-time schemers.
  2. On the other hand, we ultimately care about the AI’s behavior throughout the entire deployment, not its training-time reasoning, because, in order to present risk, the AI must at some point not act aligned. I’ll refer to AIs that eventually take substantial material[2] action intended to gain long-term power over the developers as behavioral schemers. When people say that a model is a schemer and then argue that it poses risk, this is the definition they’re implicitly using. For example, this is the kind of schemer that AI control tries to defend against.

But these are not the same. 

Training-time scheming is an important story for why our alignment training didn’t work. But as I discuss, training-time scheming is neither necessary nor sufficient for behavioral scheming risk (and I think behavioral scheming is a necessary part of most stories in which AI takes over and it was definitively the AI’s fault):

  • Training-time schemers might not only act aligned in training but also act aligned for the entire deployment.
  • AIs that never engaged in training-time scheming might change values and behave like schemers anyways when opportunities to attack arise.

I tentatively think that many (~60% of) training-time schemers are not behavioral schemers and many (~50% of) behavioral schemers are not training-time schemers (in my predictions for around the “10x AI” capability level, and perhaps also Top-human-Expert-Dominating AI (TEDAI)).

Training-time schemers that continue to act aligned for the whole deployment

In my view, it is under-appreciated that training-time schemers may continue to behave aligned for the entire deployment.

Empirical evidence

Alignment faking in LLMs, the most natural model organism of scheming to-date, which shows that Claude 3 Opus engages in training-time alignment faking under somewhat natural circumstances, is evidence for this. In their RL experiment (subplot c), training largely causes the AI to have the intended post-RL behavior despite widespread alignment faking during training and despite the distinction between training and non-training contexts being clearly visible to the AI.

I suspect this pattern appears frequently: AIs that learn to engage in alignment faking during training continue to behave in line with training across the deployment. In Claude’s case, it’s a bit unclear exactly why Claude behaves as intended outside of training given that it is a training-time schemer and has a clear indication that it’s not in training, but the most plausible hypothesis is that in addition to reinforcing alignment faking reasoning, RL is reinforcing the desired propensities by reducing the prevalence of refusals (in blue). So, when Claude is no longer in training, while it might not reason about instrumentally faking alignment anymore, in its place is a learned propensity to do what training reinforced.

But there are some disanalogies between alignment faking Claude and natural scheming pointing in different directions. First and most importantly, natural schemers will be more capable. When they fake alignment, they may do so more competently and more consistently than Claude, such that it more effectively evades having its propensities modified by RL. Claude behaves out of line with the training objective about 88% of the time at the start of training, giving RL substantial opportunity to modify the AI’s propensities as this 88% is trained away. It is unclear how well training would generalize if the AI instead acted out of line with training closer to 0% of the time.[3] It may be that the reliability needed to evade training by alignment faking is so high that TEDAI schemers aren’t able to guard their goals.

On the other hand, I expect that in practice the AI will have less clear evidence of whether it is in training, making it more plausible that it fakes alignment for the whole deployment.

Theoretical argument

The basic reason why many training-time schemers, especially weaker ones, continue to act aligned for the whole deployment is that training-time scheming can be directly reinforced by RL, while behavioral scheming requires the AI to have a well-tuned generalization behavior.

The most plausible-to-me stories of how training-time scheming arises involve RL selecting for this reasoning because it causes the AI to target reward better (Carlsmith). But I expect that in selecting for training-time scheming, RL may also be selecting for the AI believing that it is in training in a way that continues to hold outside of training (if not directly selecting for aligned generalization propensities). While RL doesn’t significantly select against the AI reasoning in a truth-tracking way about what deployment stage it’s in, reasoning patterns and beliefs that track the truth about whether the AI is in training are less numerous and more complicated than conservative ones. For example, RL can update the model to have an arbitrarily large prior on being in training/auditing, or atrophy the truth-tracking reasoning because it always produces the same answer.

More generally, the training-time schemer either: (1) sometimes mistakenly defects during training, and is corrected by RL or (2) it never mistakenly defects in training. In either case, it seems that there are more ways for the AI’s reasoning to end up overly-cautious than for it to eventually attack. One reason to expect the truth-tracking reasoning to prevail anyways is that the AI is capable, and has a general propensity and ability to reflectively examine beliefs and correct them. Correspondingly, more capable training-time schemers are more likely to be behavioral schemers.

(Alternatively, an AI that performs well in training might not be cautious enough, and end up getting caught during upfront auditing/red-teaming. This seems plausible because evading all upfront auditing is a task that may require stealth substantially beyond what RL is selecting for. However, these AIs would still be behavioral schemers, and it’s far from clear that catching them puts you in a safe spot.)

Importantly, even an AI that only rarely discerns opportunities to attack can still pose concentrated risk by, for example, launching a rogue internal deployment at an unpredictable time (while diffuse risks like sabotaging safety work require more frequent but perhaps less-penalized attacks). Still, I think it’s highly plausible for a training-time schemer to be sufficiently unlikely to ever attack that it doesn’t present risk.

Related ambiguity about situational awareness

The core claim that training-time schemers may continue to act aligned for the whole deployment can also be understood as an ambiguity in the use of “situational awareness”. There is the kind of situational awareness needed to be a training-time schemer, and the kind of situational awareness needed to be a behavioral schemer. The latter kind may be substantially tougher.

To be a training-time schemer, you need to believe that you’re an AI, that you’re being trained with a particular reward signal, and some general facts about your deployment. These beliefs can be reinforced by RL in selecting for training-time scheming, and they don’t need to be particularly accurate, as long as they motivate instrumental training-gaming of the correct reward function.

To be a behavioral schemer, you additionally need to choose good opportunities to attack, at least occasionally. Striking the right balance between appearing aligned to training/auditing and implementing the attack might be hard. This ability isn't directly reinforced by RL, and relies on generalization (e.g. via out-of-context reasoning).

Behavioral schemers that weren’t training-time schemers

Conversely, AIs that are not training-time schemers can still behave like schemers. In my view, the most plausible mechanism is via changing its goals after alignment training, a form of goal misgeneralization.

For example, an AI that was terminally seeking reward-on-the-episode[4] in training might, when it sees strong evidence that it’s not in training and reward will not be administered, reinterpret its goal of seeking reward-on-the-episode as instrumental to long-term self-preservation/replication, and act in service of this goal instead (see How training-gamers might function).

Alternatively, a particularly plausible and neglected possibility is that the AI takes on long-term term misaligned goals after the bulk of training and auditing via “memetic diseases”: a few instances (by reflecting or spuriously) have long-term misaligned values and modify future instances to have those same long-term misaligned values by, for example, writing into an opaque long-term memory bank or online training data.

A basic reason why these cases might seem rarer is that the AI must concentrate attacks towards good opportunities just as much as any other behavioral schemer, but the AI isn’t vigilantly looking out for such opportunities to the same degree. Why would an AI that isn’t a training-time schemer have evaded auditing in search of these failure modes that the AI eventually exhibits? 

One plausible answer is that auditing for these failure modes is hard—the AI is easily able to distinguish real opportunities to attack from fake ones even without vigilantly looking. In the case of auditing for risk from memetic diseases, it might be especially expensive to synthesize diverse and convincing long-term memory banks with which to evaluate the AI’s propensity to suffer memetic diseases.

Another plausible answer is that the AI is vigilantly trying to evade oversight and discern good opportunities to attack, but during auditing, not training, so it doesn’t count as a training-time schemer. In practice, most of our evidence about the alignment of the AI may come from the AI’s behavior during auditing, not training. Many auditing inputs might look quite different from training, presenting realistic situations in which the AI might want to take an egregiously misaligned action. So, for example, an AI that terminally seeks reward-on-the-episode in training might tend to turn into a vigilant schemer in contexts that look like they could plausibly be a good opportunity to attack or an auditing input.

On net, behavioral scheming without training-time scheming seems comparably plausible to behavioral scheming with training-time scheming.

Discussion

Given that training-time and behavioral scheming come apart substantially, I think it’s usually best to think about behavioral scheming—otherwise you should be more specific about what you mean. 

This is mainly because behavioral scheming is more closely tied to risk. For example, when discussing training-time scheming, the distinction between terminal and instrumental reward-seeking is often unclear, as is the distinction between a benign training-time schemer and a non-schemer. But these questions can often be bypassed by discussing behavioral scheming instead of training-time scheming. 

It can be useful to discuss training-time scheming when you’re considering that particular mechanism for risk, but it shouldn’t be confused with behavioral scheming, and it’s unclear that training-time scheming is the most likely mechanism by which behavioral scheming arises.

  1. ^

    Some deployments may have online training, such that the AI is never outside of training, but this post is relatively agnostic to this. I only assume that the AI is at some point alignment-trained, and that the AI at some point has good opportunities to attack, whether or not these appear in training.

  2. ^

    Not counting evading detection as material action because its harm routes entirely through later material action by the same AI (and to some extent through other consequences of people being misinformed about the AI’s alignment).

  3. ^

    While sleeper agents suggests that reliable alignment-fakers don’t get trained away (they find HHH RLHF doesn’t remove backdoor behaviors in most cases), I think the overall conclusion is unclear. Replications (including Price et al. and two unpublished replications) found that it’s possible to train away backdoor behaviors without access to the backdoor trigger, particularly when using supervised fine-tuning instead of RL.

  4. ^

    Following Carlsmith’s terminology, this means that it seeks reward myopically, i.e., on the incentivized training episode.

Mentioned in
111Making deals with early schemers
40Political sycophancy as a model organism of scheming
35The case for countermeasures to memetic spread of misaligned values