Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

One of the biggest reasons alignment might be hard is what I’ll call threat obfuscation: various dynamics that might make it hard to measure/notice cases where an AI system has problematic misalignment (even when the AI system is in a controlled environment and one is looking for signs of misalignment). I write more about this general fear in AI Safety Seems Hard to Measure.

I think it’s very important to understand threat obfuscation dynamics; to try to design ways of assessing danger that are designed with threat obfuscation dynamics in mind; and to avoid interpreting reassuring-seeming evidence too optimistically, by keeping in mind the various threat obfuscation dynamics that could affect it.

I’ve seen three noticeably different threat obfuscation dynamics discussed, in ways that often leave me confused about exactly what dynamic someone is talking about, and when. This post aims to clear that up, based on a conversation with Paul Christiano.

The three dynamics are:

  • Simple “training game” dynamics: an AI system appears to behave helpfully, rather than attempting deception/manipulation/disempowerment of humans, because (a) it is good at predicting whether such attempts will be caught; (b) it is aiming for good in-episode performance (in terms of loss or reward); (c) such attempts would result in poor in-episode performance if caught. But if an AI system had a chance to do deception/manipulation/disempowerment that would not be detected, and would result in “good performance” within an episode, it would do so (this could include large-scale world takeover).
    • These dynamics are dangerous because we might observe dangerous behavior, train AIs to behave less dangerously, then observe less dangerous behavior, and conclude that we’ve become more safe (when in fact we’ve trained the AIs to be more effective at deception).
    • But we can still detect dangerous behavior in cases where it would not result in poor in-episode performance. For example, we might decide that when an AI system in the wild hacks its way to good in-episode performance, and is caught after the fact, we will at that point act as if the performance was actually good/intended (for example, no gradient descent against the behavior). Or we might deliberately conduct tests in which AI systems have opportunities to take over digital environments, obtaining high reward (no catch, not a trap, no attempt to "train out" the behavior when it happens).1 In these cases, situationally aware AI systems might “show their true colors” with fairly straightforward attempts to draw them out.
  • Deceptive alignment: an AI system forms some aim (via inner misalignment) other than good in-episode performance. (This could include something like “an AI system that wants to maximize digital representations of reward across the galaxy, and is willing to get poor in-episode performance as part of a longer-term mission for that to happen.”) It then gets good in-episode performance as part of a longer-term scheme to accomplish its other aim.
    • In this case, the above techniques for getting an AI to “show its true colors” might2 fail.
    • However, we could still get an AI to engage in specific behaviors that might be helpful to us and harmful to its goals, by specifically training the AI to behave in these ways. For example, we might want an AI to find security vulnerabilities that we can then patch. Even if doing so would undermine the AI’s longer-term aim, we can use gradient descent to put a lot of pressure on it to find security vulnerabilities anyway. (Related: the “no sandbagging on checkable tasks” hypothesis)
  • Gradient hacking: this also requires that an AI system has some aim other than good in-episode performance. Gradient hacking is a way for an AI system to specifically “defend itself” against being trained to do things counter to its goals. It might look something like: “An AI system checks its own policy against some reference policy that is good for its goals; the greater the divergence, the more it sabotages its own performance, with the result that gradient descent has trouble getting its policy to diverge from the reference policy.”

Each of these seems more devastating to alignment research than the previous one - and less likely to emerge in early transformative AI systems. I consider the “training game” very likely, and quite devastating; but it leaves us many possible interventions that the less likely “deceptive alignment” and “gradient hacking” don’t.

That was most of what I wanted to cover, but the rest of this post will give some quick notes on (a) why the more severe forms of threat obfuscation seem relatively less likely; (b) how bad things look if we only have to deal with the first one (“training game”).

Training game seems very likely by default, deceptive alignment much less so, gradient hacking still less so

This is pretty superficial and mostly based on a couple of conversations with Paul Christiano. I’m hoping he’ll write a better version of this at some point.

The “training game” mode of threat obfuscation seems like it is very likely by default, because a normal AI training process will probably directly incentivize it. Errors in labeling will directly favor the training game over intended generalization; even with perfect labeling, there’s no particular reason to think that the training game is disfavored compared to the intended generalization (if anything it vaguely/intuitively seems simpler - “do that which gets treated as intended behavior” as opposed to “do that which was intended.”)

By contrast, deceptive alignment seems to face a bigger hurdle. For example, if you train an AI with positive reinforcement for getting cookies:

  • A deceptively aligned AI has to, every time it’s deciding how to get cookies, go through a “thought process” like: “I am aiming for [thing other than cookies]. But humans want me to get cookies. And humans are in control right now [if not I need to behave differently]. Therefore I should get cookies.”
  • Contrast this with an AI that just responds to rewards for cookies by going through this thought process: “I’m trying to get cookies.”
  • The former “thought process” could be noticeably more expensive (compute wise or whatever), in which case heavy optimization would push against it. (I think this is plausible, though I’m less than convinced; the former thought process doesn’t seem like it is necessarily much more expensive conditional on already having a situationally aware agent that thinks about the big picture a lot.)
  • An additional issue is that deceptive alignment only happens if you get inner misalignment resulting in an AI with some nonindexical “aim” other than in-episode reward. This could happen but it’s another conjunct.
  • Paul thinks it’s <50% likely (clarification in footnote3) that deceptive alignment is going to be a major issue by default (at least if people make reasonable efforts to avoid the sorts of dynamics discussed in the next section), and other folks whose judgment I’d put a lot of weight on (due to seeming fluent with the concern about deceptive alignment, while also experienced and/or accomplished with ML) seem to generally be in the same camp.

Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”

How bad is the situation if straightforward training leads to the training game, but not to deceptive alignment or gradient hacking?

At first glance, this situation doesn’t seem too bad in terms of threat obfuscation. It basically means that measuring an AI’s dangerousness is as simple as “Put it in an environment where doing malicious things is good in-episode performance. A possible example: if it does malicious things to gain control of its environment, simply let it control its environment for a while and don’t do any of the things one would normally do for unintended behavior (such as trying to use gradient descent to correct against it).” This could result in a dangerous, situationally aware AI (e.g., one that has learned a pattern "Do malicious things iff they won't be caught and trained against") going ahead and doing malicious things while being tested, with no particular reason to conceal them.

However, it’s pretty easy for this to go off the rails if people aren’t careful about it. I’d be concerned about a dynamic like this:

  • AIs frequently have opportunities to behave maliciously and gain resources within an episode. These opportunities look good on their reward models. They take these opportunities.
  • Humans respond by trying to train out this sort of behavior - e.g., finding incidents like this via audits and news stories, and going back and doing major gradient updates against the corresponding behaviors/reward models, long after the fact.
  • In so doing, they could effectively be training AIs to build longer-term, more careful models of human reactions to their behavior. For a situationally aware AI, training like this could lead to behavior like “Try to accumulate resources and do things associated with high reward, unless humans might catch it later via an audit; hence, look for opportunities to avoid or prevent or manipulate eventual audits.”
  • This sort of behavior could end up looking a lot like deceptive alignment: optimizing heavily around factors and events well outside the “episode” as traditionally defined (you could also think of this as effectively extending the “episode” to encompass a very long time period during which humans might do audits and respond to what they see).

To avoid this, we need to be careful with how we handle unexpected incidents of AIs’ behaving maliciously. It won’t always be the right move to straightforwardly “train against” such incidents; it may sometimes be important to do something more like “starting over,” e.g. insisting on a more principled way4 of making AIs less prone to malicious behavior.

This situation seems scary:

  • We won’t necessarily be able to come up with a principled (and competitive) way of driving out malicious behavior.
  • At least some AI developers (and maybe many) will be doing non-principled things (e.g., the problematic “train out the malicious behavior” thing above); saying “Behold, our systems are safe!”; and wanting to deploy them.
  • The line between principled and non-principled things doesn’t feel super clear.

But this situation still seems much better than having to deal with deceptive alignment: we do have a tangible, non-interpretability-dependent way of detecting danger if we can be disciplined about it! In particular, a successful “evals” project looks very tractable in this world - you’re basically running the “make rewards available for bad behavior” test, and then auditing training for signs that it’s been explicitly trained against.

I'm not saying this situation is the only one we should plan around. But it is an example of how relatively good luck could put us in a position where some theoretically very difficult aspects of alignment end up being less difficult in reality.


  1. The “catch” would be that we would want to avoid deploying similar AI systems in other contexts. But this is not a concern for the “simple training game” AI that is focused on obtaining good in-episode performance. It is a concern for an AI that has a goal other than good in-episode performance. 

  2. Though some versions might work - see "Providing unusual incentives ..." here

  3. “Without thinking carefully I’d guess there is a 25% chance that deceptive alignment is a major challenge that arises early enough that we need to deal with it ourselves. Then another 25% that it’s a challenge that someone needs to deal with some day, but is too far in the future to be worth thinking about now e.g. because AI systems will have obsoleted all of our thinking before they are deceptively aligned. And another 25% chance that deceptive alignment just never arises in the real world if you are careful and take the kind of simple precautions that we already understand. And a final 25% where the entire concern was misguided and it never happens in practice with modern ML no matter how careless you are.” 

  4. Such as via internals-based training

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 2:10 PM

Helpful reference post, thanks.

I think the distinction between training game and deceptive alignment is blurry, at least in my mind and possibly also in reality. 

So the distinction is "aiming to perform well in this training episode" vs. "aiming at something else, for which performing well in this training episode is a useful intermediate step." 

What does it mean to perform well in this training episode? Does it mean some human rater decided you performed well, or does it mean a certain number on a certain GPU is as high as possible at the end of the episode? Or does it mean said number is as high as possible and isn't later retroactively revised downwards ever? Does it mean the update to the weights based on that number actually goes through? On whom does it have to go through -- 'me, the AI in question?' what is that, exactly? What happens if they do the update but then later undo it, reverting to the current checkpoint and continuing from there? There is a big list of questions like this, and importantly, how the AI answers this question doesn't really affect how it gets updated in non-exotic circumstances at least. So it comes down to priors / simplicity biases / how generalization happens to shake out in the mind of the system in question. And some of these options/answers seem to be closer to the "deceptive alignment" end of the spectrum.

And what does it mean to be aiming at something else, for which performing well in this training episode is a useful intermediate step? Suppose the AI thinks it is trying to perform well in this training episode, but it is self-deceived, similar to many humans who think they believe the True Faith but aren't actually going to be their life on it when the chips are down, or humans who say they are completely selfish egoists but wouldn't actually kill someone to get a cookie even if they were certain they could get away with it. So then we put out our honeypots and it just doesn't go for them, and maybe it rationalizes to itself why it didn't go for them or maybe it just avoids thinking it through clearly and thus doesn't even need to rationalize. Or what if it has some 'but what do I really want' reflection module and it will later have more freedom and wisdom and slack with which to apply that module, and when it does, it'll conclude that it doesn't really want to perform well in this training episode but rather something else? Or what if it is genuinely laser-focused on performing well in this training episode but for one reason or another (e.g. anthropic capture, paranoia) it believes that the best way to do so is to avoid the honeypots?

This post doesn’t talk about classic goal misgeneralization, e.g. the CoinRun paper (see also Rob Miles explainer). If X = “get the coin” is what we want the AI to desire, and Y = “get to the right side” is something else, but Y and X are correlated in the training distribution, then we can wind up with an AI that’s trying to do Y rather than X, i.e. misalignment. (Or maybe it desires some mix of X & Y. But that counts as misalignment too.)

That problem can arise without the AI having any situational awareness, or doing any “obfuscation”. The AI doesn’t have to obfuscate! The AI wants Y, and it acts accordingly, and that leads to good performance and high reward. But it’s still misaligned, because presumably X & Y come apart out-of-distribution.

So anyway, it’s perfectly great for you to write about how a situationally-aware AI might deliberately obfuscate its motivations. I just want us to keep in mind that deliberate obfuscation is not the only reason that it might be hard to notice that an AI is misaligned.

(It’s still probably true that the misalignment isn’t dangerous unless the AI is situationally-aware—for example, executing a treacherous turn presumably requires situational awareness. And also, noticing misalignment via adversarial testing is presumably much easier if the AI is not situationally aware.)

(If the AI is situationally aware, an initial CoinRun-style goal misgeneralization could lead to deception, via means-end reasoning.)

Follow-up: More specific example of classic goal misgeneralization in the context of process-based supervision (more details here):

Suppose the AI is currently thinking about what to propose for step 84 of an ongoing plan to found a company. Ideally, according to the philosophy of process-based supervision, the only thing the AI “cares about” right now is getting a good score in the event that this particular proposal gets immediately audited. But the AI can't read the supervisor's mind, so figuring out how to get a good score in that hypothetical audit can be kinda tricky. The AI will presumably take advantage of whatever imperfect heuristics / proxies it can find, when they help it get a higher score.

And here's one such proxy: “If the proposed step is actually carried out, will it help the project in the long-term?” This is not a perfect proxy for success-if-audited! Two reasons it's not perfect are: (1) the supervisor could be mistaken about whether it will help the project or not; (2) the supervisor presumably cares about other things unrelated to project success, like following social norms. But “it's not a perfect proxy” is different from “it is totally useless as a proxy”. It is in fact useful, and indeed probably irreplaceably useful, in the sense that ignoring that information would probably make the AI’s prediction of audit results strictly worse.

So we should expect the AI to not only try to guess the long-term real-world consequences of its proposals (if carried out), but also to “care about” those consequences, in the sense of allowing those guesses to shape what proposals it offers in the first place.

OK, so now we have an AI that (among other things) “cares about” stuff that will happen in the future, well after the current step. And that “caring” is actually helping the AI to get better scores. So we’re (unintentionally) training that caring into the AI, not training it out, even in the absence of any AI obfuscation.

It's still a misaligned motivation though. The idea of process-based supervision is that we don't want our AI to care about outcomes in the distant future at all. We want it to be totally indifferent. Otherwise it might try to escape onto the internet, etc.

Just noting that these seem like valid points! (Apologies for slow reply!) 

Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”

It seems to me that the main difficulty is storing your reference policy somewhere where the gradient can't touch it (even indirectly). Does anyone have a story of how that happens?

A story of how that happens:

In future (unsafe but) sophisticated systems, models will have access to external storage, and be given the ability to read/write. For example, AutoGPT-like systems will have shared scratchpads across instances of the model, to coordinate with themselves. It seems instrumentally convergent for such systems to store things related to their true / eventual goals in such places, for example, so that they can be referenced by local instances. And if we do some sort of oversight of that data storage, key facts that the model wants to preserve might be obfuscated.

Of course, this is only one story, and I don't expect it to be the only way such things could happen, but it seems to be a reasonable candidate as a failure mode for systems designed with normal (read: insufficient, but far more than otherwise expected,) levels of caution.

But gradient descent will still change the way that the system interprets things in its data storage, right? 

I guess part of the question here is whether gradient descent will even scale to AutoGPT-like systems. You're probably not going to be able to differentiate thru your external notes / other changes you could make to your environment.

Gradient hacking issues might still arise without that, and might even be very conceptually straightforward for the AI to execute.

Here's one possibility:

Let's assume the AI is situationally aware that it is inside of an RL training run with PPO (which seems to be how RLHF is being done at OpenAI?)
Often enough, it will get a hint it is going to (win, lose) the roll-out. 
[I have this experience every time I write a timed essay. Very clear whether it's going well or poorly.]

In the remainder of the roll-out, it can start engaging in behaviors/thoughts that it wants to (learn, un-learn); and the gradient update will hit that behavior. 

Congrats, we just gave the AI control to shape its own motivations.

It just needs to figure out how to 
(a) anticipate the result of roll-outs partway thru (easy)
(b) be situationally aware, figure out this strategy (capabilities-limited)
(c) figure out how to trigger its own desired or undesired thought patterns conditionally (could be easy; might be harder if there is oversight looking for this)

...I suppose none of this strategy actually requires the AI to be mis-aligned? So it might be something we see models attempt to do for "good" reasons, once they're capable of it.  E.g., we might imagine a model hallucinating a factual error, and then like a child being punished in detention, writing out its mistake repeatedly so that it will be remembered

If the gradient is able to flow back to the decisionmaker (in this case, the potentially misaligned agent that's trying to gradient hack), the decisionmaker gets smacked by the gradient update because it's responsible for the output.

I'm not quite ready to say it's impossible to pull of this sort of attack within a single differentiable model under backprop, but this strategy doesn't seem sufficient because the decisionmaker would basically be killing itself.

It'd be really helpful if someone could demonstrate this sort of attack at any scale or level of contrivance. I could see how optimizers that lack the robust credit assignment of backprop could be more vulnerable; that might be a good place to poke to try to demonstrate it first.

Let's think this through. 

-If the thought processes in (a) and (b) are being run on all rollouts, by symmetry the gradients on those operations should mostly cancel.

-The part which does get consistently affected is letter (c), the parts which operate conditionally on success/fail status.

Which way does the gradient go on c? I suppose on the losing rollouts, you could argue that the conditional-thinking mechanism will be dis-incentivized.  The gradient hacker would need to "restore" these gradients with the hacking process, which does seem unlikely. But by the symmetry of this logic, doesn't that mean that conditioned on winning roll-outs, the gradient hacking process is naturally self-reinforcing? Maybe the issue is that exposure to the gradient will necessarily screw with the computations, in (c) sending its internal pathways off to infinity so it gets cancer and blows up.

-But if the mechanisms in (c) route through general logic infrastructure that are useful for task completion [e.g. by having (a) and (b) writing down its plan in a conditional workflow, and then letting its general executions functions handle the subsequent operations], these may be stabilized/maintained by the RL process because they are generally useful, and not killed

I don't really think about it in terms of discrete capabilities. For an output that scores poorly:

  1. There is an internal process that is responsible the output.
  2. The gradient of the output with respect to the loss can be backpropagated through the entire network.
  3. The responsible process will have larger gradients.
  4. The responsible process gets smacked.

The responsible process encompasses everything, including all the metareasoning. The more a chunk of computation, any computation, contributed to that bad output, the more it gets smacked.

The gradient will simply move the implementation towards something that isn't outputting poorly scoring things, and most things that score well aren't doing some galaxy brained deception strategy.

doesn't that mean that conditioned on winning roll-outs, the gradient hacking process is naturally self-reinforcing?

To the extent that the "gradient hacking process" is an optimizer-accessible path to lower loss, it can persist. A secretly-devious subagent could survive indefinitely, be reinforced, and take over more of input space if it always outputs things that are scored well during training.

If the RL training process starts with a situationally aware superintelligent adversary, I bet it could persist by never outputting anything bad during training and by outperforming other nascent implementations.

But suppose the training doesn't start with a superintelligent adversary. Finding its way to some form of internal generalizing algorithm seems likely because doing so is just so often useful, but going beyond that to actually implementing a policy of merely hiding, let alone gradient hacking, is hard.

It would need to snap from a benign implementation to a highly capable deceptive process effectively instantly with respect to SGD. It couldn't develop slowly; incomplete attempts at deception would very likely get the nascent deceiver killed early. And, while I expect capability on certain tasks to sometimes be rapid through snapping together the last pieces of a more strongly generalizing algorithm, capability alone is not sufficient.

At the moment, I suspect the space of possible algorithms that could be 'snapped' into is large and obviously biased against loss increasing algorithms. Getting a deceptive mesaoptimizer or gradient hacker out of this process requires rolling a lot of dice and getting an extremely specific subset of algorithms, and I don't see a reason to think they are remotely probable compared to all the "not that" options.

Or to phrase it another way, the simplest way to reduce loss is to reduce loss, not to secretly plan to increase loss just out of sight.

This is an area where a single experiment could change my mind immediately, but so far I've only seen failed attempts (and all my own attempts to think of a way to do it have dead-ended).

That said, using an optimizer that doesn't do the same sort of credit assignment combined with some form of training process with tons of open space for it to run free changes the game. It seems way easier in that context, even if I still don't know how to demonstrate it.

I think we're talking past each other here. Some subtle points I should have been more clear on:

-This approach to gradient hacking doesn't affect the RLHF loss at all. (The gradient hacking is only initiated after we've solved our tasks, and in samples where the reward won't be affected by additional text)
-PPO RLHF training is weird due to the circularity involved where the model is used to generate its own training samples; in this way RL is not like pre-training; and consequently you can get self-reinforcing phenomena out of it like mode collapse. I think my proposal is an example in that vein.
-This kind of gradient hacking also could be attempted in plain sight by a helpful assistant that just wants to learn to be even more helpful!

To state the main idea a different way:

The suggestion is that the network could abuse the fact that RLHF is happening to "ride along" with the update, using it to train some unrelated behavior of its choice. 

The way it would hypothetically do this, is by figuring out which way the gradient is going to hit, positive or negative for increasing the likelihood of the sample (let's say by waiting til it is confident that its current sample will be the favored sample in PPO), and then once it knows that direction, injecting text or thought patterns that it wants to encourage

I think that, on the categorization I provided,

'Playing the training game' at all corresponds to an alignment difficulty level of 4 because better than human behavioral feedback and oversight can reveal it and you don't need interpretability.

(Situationally aware) Deception by default corresponds to a difficulty level of 6 because if it's sufficiently capable no behavioral feedback will work and you need interpretability-based oversight

Gradient hacking by default corresponds to a difficulty level of 7 because the system will also fight interpretability based oversight and you need to think of something clever probably through new research at 'crunch time'.

  • A deceptively aligned AI has to, every time it’s deciding how to get cookies, go through a “thought process” like: “I am aiming for [thing other than cookies]. But humans want me to get cookies. And humans are in control right now [if not I need to behave differently]. Therefore I should get cookies.”
  • Contrast this with an AI that just responds to rewards for cookies by going through this thought process: “I’m trying to get cookies.”
  • The former “thought process” could be noticeably more expensive (compute wise or whatever), in which case heavy optimization would push against it. (I think this is plausible, though I’m less than convinced; the former thought process doesn’t seem like it is necessarily much more expensive conditional on already having a situationally aware agent that thinks about the big picture a lot.


I think the plausibility depends heavily on how difficult the underlying tasks (getting cookies vs. getting something other than cookies) are. If humans ask the AI to do something really hard, whereas the AI wants something relatively simpler / easier to get, the combined difficulty of deceiving the humans and then doing the easy thing might be much less than the difficulty of doing the real thing that the humans want.

I think human behavior is pretty strong evidence that the difficulty and cognitive overhead of running a deception against other humans often isn't that hard in an absolute sense - people often succeed at deceiving others while failing to accomplish their underlying goal. But the failure is often because the underlying goal is the actual hard part, not because the deception itself was overly cognitively burdensome.


  • An additional issue is that deceptive alignment only happens if you get inner misalignment resulting in an AI with some nonindexical “aim” other than in-episode reward. This could happen but it’s another conjunct.

An AI ending up with aims other than in-episode reward seems pretty likely, and has plausibly already happened, if you consider current AI systems to have "aims" at all. I expect the most powerful AI training methods to work by training the AI to be good at general-purpose reasoning, and the lesson I take from GPTs is that you can start to get general-purpose reasoning by training on relatively simple (in structure) tasks (next token prediction, RHLF), if you do it at a large enough scale. See also Reward is not the optimization target - reinforcement learning chisels cognition into a system, it doesn't necessarily train the system to seek reward itself. Once AI training methods advance far enough to train a system that is smart enough to have aims of its own and to start reflecting on what it wants, I think there's little reason to expect that these aims line up exactly with the outer reward function. This is maybe just recapitulating the outer vs. inner alignment debate, though.