One of the biggest reasons alignment might be hard is what I’ll call threat obfuscation: various dynamics that might make it hard to measure/notice cases where an AI system has problematic misalignment (even when the AI system is in a controlled environment and one is looking for signs of misalignment). I write more about this general fear in AI Safety Seems Hard to Measure.
I think it’s very important to understand threat obfuscation dynamics; to try to design ways of assessing danger that are designed with threat obfuscation dynamics in mind; and to avoid interpreting reassuring-seeming evidence too optimistically, by keeping in mind the various threat obfuscation dynamics that could affect it.
I’ve seen three noticeably different threat obfuscation dynamics discussed, in ways that often leave me confused about exactly what dynamic someone is talking about, and when. This post aims to clear that up, based on a conversation with Paul Christiano.
The three dynamics are:
Each of these seems more devastating to alignment research than the previous one - and less likely to emerge in early transformative AI systems. I consider the “training game” very likely, and quite devastating; but it leaves us many possible interventions that the less likely “deceptive alignment” and “gradient hacking” don’t.
That was most of what I wanted to cover, but the rest of this post will give some quick notes on (a) why the more severe forms of threat obfuscation seem relatively less likely; (b) how bad things look if we only have to deal with the first one (“training game”).
This is pretty superficial and mostly based on a couple of conversations with Paul Christiano. I’m hoping he’ll write a better version of this at some point.
The “training game” mode of threat obfuscation seems like it is very likely by default, because a normal AI training process will probably directly incentivize it. Errors in labeling will directly favor the training game over intended generalization; even with perfect labeling, there’s no particular reason to think that the training game is disfavored compared to the intended generalization (if anything it vaguely/intuitively seems simpler - “do that which gets treated as intended behavior” as opposed to “do that which was intended.”)
By contrast, deceptive alignment seems to face a bigger hurdle. For example, if you train an AI with positive reinforcement for getting cookies:
Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”
At first glance, this situation doesn’t seem too bad in terms of threat obfuscation. It basically means that measuring an AI’s dangerousness is as simple as “Put it in an environment where doing malicious things is good in-episode performance. A possible example: if it does malicious things to gain control of its environment, simply let it control its environment for a while and don’t do any of the things one would normally do for unintended behavior (such as trying to use gradient descent to correct against it).” This could result in a dangerous, situationally aware AI (e.g., one that has learned a pattern "Do malicious things iff they won't be caught and trained against") going ahead and doing malicious things while being tested, with no particular reason to conceal them.
However, it’s pretty easy for this to go off the rails if people aren’t careful about it. I’d be concerned about a dynamic like this:
To avoid this, we need to be careful with how we handle unexpected incidents of AIs’ behaving maliciously. It won’t always be the right move to straightforwardly “train against” such incidents; it may sometimes be important to do something more like “starting over,” e.g. insisting on a more principled way4 of making AIs less prone to malicious behavior.
This situation seems scary:
But this situation still seems much better than having to deal with deceptive alignment: we do have a tangible, non-interpretability-dependent way of detecting danger if we can be disciplined about it! In particular, a successful “evals” project looks very tractable in this world - you’re basically running the “make rewards available for bad behavior” test, and then auditing training for signs that it’s been explicitly trained against.
I'm not saying this situation is the only one we should plan around. But it is an example of how relatively good luck could put us in a position where some theoretically very difficult aspects of alignment end up being less difficult in reality.
The “catch” would be that we would want to avoid deploying similar AI systems in other contexts. But this is not a concern for the “simple training game” AI that is focused on obtaining good in-episode performance. It is a concern for an AI that has a goal other than good in-episode performance. ↩
Though some versions might work - see "Providing unusual incentives ..." here. ↩
“Without thinking carefully I’d guess there is a 25% chance that deceptive alignment is a major challenge that arises early enough that we need to deal with it ourselves. Then another 25% that it’s a challenge that someone needs to deal with some day, but is too far in the future to be worth thinking about now e.g. because AI systems will have obsoleted all of our thinking before they are deceptively aligned. And another 25% chance that deceptive alignment just never arises in the real world if you are careful and take the kind of simple precautions that we already understand. And a final 25% where the entire concern was misguided and it never happens in practice with modern ML no matter how careless you are.” ↩
Such as via internals-based training. ↩
Helpful reference post, thanks.I think the distinction between training game and deceptive alignment is blurry, at least in my mind and possibly also in reality. So the distinction is "aiming to perform well in this training episode" vs. "aiming at something else, for which performing well in this training episode is a useful intermediate step." What does it mean to perform well in this training episode? Does it mean some human rater decided you performed well, or does it mean a certain number on a certain GPU is as high as possible at the end of the episode? Or does it mean said number is as high as possible and isn't later retroactively revised downwards ever? Does it mean the update to the weights based on that number actually goes through? On whom does it have to go through -- 'me, the AI in question?' what is that, exactly? What happens if they do the update but then later undo it, reverting to the current checkpoint and continuing from there? There is a big list of questions like this, and importantly, how the AI answers this question doesn't really affect how it gets updated in non-exotic circumstances at least. So it comes down to priors / simplicity biases / how generalization happens to shake out in the mind of the system in question. And some of these options/answers seem to be closer to the "deceptive alignment" end of the spectrum.And what does it mean to be aiming at something else, for which performing well in this training episode is a useful intermediate step? Suppose the AI thinks it is trying to perform well in this training episode, but it is self-deceived, similar to many humans who think they believe the True Faith but aren't actually going to be their life on it when the chips are down, or humans who say they are completely selfish egoists but wouldn't actually kill someone to get a cookie even if they were certain they could get away with it. So then we put out our honeypots and it just doesn't go for them, and maybe it rationalizes to itself why it didn't go for them or maybe it just avoids thinking it through clearly and thus doesn't even need to rationalize. Or what if it has some 'but what do I really want' reflection module and it will later have more freedom and wisdom and slack with which to apply that module, and when it does, it'll conclude that it doesn't really want to perform well in this training episode but rather something else? Or what if it is genuinely laser-focused on performing well in this training episode but for one reason or another (e.g. anthropic capture, paranoia) it believes that the best way to do so is to avoid the honeypots?
This post doesn’t talk about classic goal misgeneralization, e.g. the CoinRun paper (see also Rob Miles explainer). If X = “get the coin” is what we want the AI to desire, and Y = “get to the right side” is something else, but Y and X are correlated in the training distribution, then we can wind up with an AI that’s trying to do Y rather than X, i.e. misalignment. (Or maybe it desires some mix of X & Y. But that counts as misalignment too.)
That problem can arise without the AI having any situational awareness, or doing any “obfuscation”. The AI doesn’t have to obfuscate! The AI wants Y, and it acts accordingly, and that leads to good performance and high reward. But it’s still misaligned, because presumably X & Y come apart out-of-distribution.
So anyway, it’s perfectly great for you to write about how a situationally-aware AI might deliberately obfuscate its motivations. I just want us to keep in mind that deliberate obfuscation is not the only reason that it might be hard to notice that an AI is misaligned.
(It’s still probably true that the misalignment isn’t dangerous unless the AI is situationally-aware—for example, executing a treacherous turn presumably requires situational awareness. And also, noticing misalignment via adversarial testing is presumably much easier if the AI is not situationally aware.)
(If the AI is situationally aware, an initial CoinRun-style goal misgeneralization could lead to deception, via means-end reasoning.)
Follow-up: More specific example of classic goal misgeneralization in the context of process-based supervision (more details here):
Suppose the AI is currently thinking about what to propose for step 84 of an ongoing plan to found a company. Ideally, according to the philosophy of process-based supervision, the only thing the AI “cares about” right now is getting a good score in the event that this particular proposal gets immediately audited. But the AI can't read the supervisor's mind, so figuring out how to get a good score in that hypothetical audit can be kinda tricky. The AI will presumably take advantage of whatever imperfect heuristics / proxies it can find, when they help it get a higher score.
And here's one such proxy: “If the proposed step is actually carried out, will it help the project in the long-term?” This is not a perfect proxy for success-if-audited! Two reasons it's not perfect are: (1) the supervisor could be mistaken about whether it will help the project or not; (2) the supervisor presumably cares about other things unrelated to project success, like following social norms. But “it's not a perfect proxy” is different from “it is totally useless as a proxy”. It is in fact useful, and indeed probably irreplaceably useful, in the sense that ignoring that information would probably make the AI’s prediction of audit results strictly worse.
So we should expect the AI to not only try to guess the long-term real-world consequences of its proposals (if carried out), but also to “care about” those consequences, in the sense of allowing those guesses to shape what proposals it offers in the first place.
OK, so now we have an AI that (among other things) “cares about” stuff that will happen in the future, well after the current step. And that “caring” is actually helping the AI to get better scores. So we’re (unintentionally) training that caring into the AI, not training it out, even in the absence of any AI obfuscation.
It's still a misaligned motivation though. The idea of process-based supervision is that we don't want our AI to care about outcomes in the distant future at all. We want it to be totally indifferent. Otherwise it might try to escape onto the internet, etc.
Just noting that these seem like valid points! (Apologies for slow reply!)
Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”
It seems to me that the main difficulty is storing your reference policy somewhere where the gradient can't touch it (even indirectly). Does anyone have a story of how that happens?
A story of how that happens:In future (unsafe but) sophisticated systems, models will have access to external storage, and be given the ability to read/write. For example, AutoGPT-like systems will have shared scratchpads across instances of the model, to coordinate with themselves. It seems instrumentally convergent for such systems to store things related to their true / eventual goals in such places, for example, so that they can be referenced by local instances. And if we do some sort of oversight of that data storage, key facts that the model wants to preserve might be obfuscated.
Of course, this is only one story, and I don't expect it to be the only way such things could happen, but it seems to be a reasonable candidate as a failure mode for systems designed with normal (read: insufficient, but far more than otherwise expected,) levels of caution.
But gradient descent will still change the way that the system interprets things in its data storage, right?
I guess part of the question here is whether gradient descent will even scale to AutoGPT-like systems. You're probably not going to be able to differentiate thru your external notes / other changes you could make to your environment.
Gradient hacking issues might still arise without that, and might even be very conceptually straightforward for the AI to execute.Here's one possibility:Let's assume the AI is situationally aware that it is inside of an RL training run with PPO (which seems to be how RLHF is being done at OpenAI?) Often enough, it will get a hint it is going to (win, lose) the roll-out. [I have this experience every time I write a timed essay. Very clear whether it's going well or poorly.]In the remainder of the roll-out, it can start engaging in behaviors/thoughts that it wants to (learn, un-learn); and the gradient update will hit that behavior. Congrats, we just gave the AI control to shape its own motivations.It just needs to figure out how to (a) anticipate the result of roll-outs partway thru (easy)(b) be situationally aware, figure out this strategy (capabilities-limited)(c) figure out how to trigger its own desired or undesired thought patterns conditionally (could be easy; might be harder if there is oversight looking for this)
...I suppose none of this strategy actually requires the AI to be mis-aligned? So it might be something we see models attempt to do for "good" reasons, once they're capable of it. E.g., we might imagine a model hallucinating a factual error, and then like a child being punished in detention, writing out its mistake repeatedly so that it will be remembered
If the gradient is able to flow back to the decisionmaker (in this case, the potentially misaligned agent that's trying to gradient hack), the decisionmaker gets smacked by the gradient update because it's responsible for the output.
I'm not quite ready to say it's impossible to pull of this sort of attack within a single differentiable model under backprop, but this strategy doesn't seem sufficient because the decisionmaker would basically be killing itself.
It'd be really helpful if someone could demonstrate this sort of attack at any scale or level of contrivance. I could see how optimizers that lack the robust credit assignment of backprop could be more vulnerable; that might be a good place to poke to try to demonstrate it first.
Let's think this through. -If the thought processes in (a) and (b) are being run on all rollouts, by symmetry the gradients on those operations should mostly cancel.-The part which does get consistently affected is letter (c), the parts which operate conditionally on success/fail status.Which way does the gradient go on c? I suppose on the losing rollouts, you could argue that the conditional-thinking mechanism will be dis-incentivized. The gradient hacker would need to "restore" these gradients with the hacking process, which does seem unlikely. But by the symmetry of this logic, doesn't that mean that conditioned on winning roll-outs, the gradient hacking process is naturally self-reinforcing? Maybe the issue is that exposure to the gradient will necessarily screw with the computations, in (c) sending its internal pathways off to infinity so it gets cancer and blows up.-But if the mechanisms in (c) route through general logic infrastructure that are useful for task completion [e.g. by having (a) and (b) writing down its plan in a conditional workflow, and then letting its general executions functions handle the subsequent operations], these may be stabilized/maintained by the RL process because they are generally useful, and not killed
I don't really think about it in terms of discrete capabilities. For an output that scores poorly:
The responsible process encompasses everything, including all the metareasoning. The more a chunk of computation, any computation, contributed to that bad output, the more it gets smacked.
The gradient will simply move the implementation towards something that isn't outputting poorly scoring things, and most things that score well aren't doing some galaxy brained deception strategy.
doesn't that mean that conditioned on winning roll-outs, the gradient hacking process is naturally self-reinforcing?
To the extent that the "gradient hacking process" is an optimizer-accessible path to lower loss, it can persist. A secretly-devious subagent could survive indefinitely, be reinforced, and take over more of input space if it always outputs things that are scored well during training.
If the RL training process starts with a situationally aware superintelligent adversary, I bet it could persist by never outputting anything bad during training and by outperforming other nascent implementations.
But suppose the training doesn't start with a superintelligent adversary. Finding its way to some form of internal generalizing algorithm seems likely because doing so is just so often useful, but going beyond that to actually implementing a policy of merely hiding, let alone gradient hacking, is hard.
It would need to snap from a benign implementation to a highly capable deceptive process effectively instantly with respect to SGD. It couldn't develop slowly; incomplete attempts at deception would very likely get the nascent deceiver killed early. And, while I expect capability on certain tasks to sometimes be rapid through snapping together the last pieces of a more strongly generalizing algorithm, capability alone is not sufficient.
At the moment, I suspect the space of possible algorithms that could be 'snapped' into is large and obviously biased against loss increasing algorithms. Getting a deceptive mesaoptimizer or gradient hacker out of this process requires rolling a lot of dice and getting an extremely specific subset of algorithms, and I don't see a reason to think they are remotely probable compared to all the "not that" options.
Or to phrase it another way, the simplest way to reduce loss is to reduce loss, not to secretly plan to increase loss just out of sight.
This is an area where a single experiment could change my mind immediately, but so far I've only seen failed attempts (and all my own attempts to think of a way to do it have dead-ended).
That said, using an optimizer that doesn't do the same sort of credit assignment combined with some form of training process with tons of open space for it to run free changes the game. It seems way easier in that context, even if I still don't know how to demonstrate it.
I think we're talking past each other here. Some subtle points I should have been more clear on:-This approach to gradient hacking doesn't affect the RLHF loss at all. (The gradient hacking is only initiated after we've solved our tasks, and in samples where the reward won't be affected by additional text)-PPO RLHF training is weird due to the circularity involved where the model is used to generate its own training samples; in this way RL is not like pre-training; and consequently you can get self-reinforcing phenomena out of it like mode collapse. I think my proposal is an example in that vein.-This kind of gradient hacking also could be attempted in plain sight by a helpful assistant that just wants to learn to be even more helpful!To state the main idea a different way:The suggestion is that the network could abuse the fact that RLHF is happening to "ride along" with the update, using it to train some unrelated behavior of its choice. The way it would hypothetically do this, is by figuring out which way the gradient is going to hit, positive or negative for increasing the likelihood of the sample (let's say by waiting til it is confident that its current sample will be the favored sample in PPO), and then once it knows that direction, injecting text or thought patterns that it wants to encourage
I think that, on the categorization I provided,
'Playing the training game' at all corresponds to an alignment difficulty level of 4 because better than human behavioral feedback and oversight can reveal it and you don't need interpretability.
(Situationally aware) Deception by default corresponds to a difficulty level of 6 because if it's sufficiently capable no behavioral feedback will work and you need interpretability-based oversight
Gradient hacking by default corresponds to a difficulty level of 7 because the system will also fight interpretability based oversight and you need to think of something clever probably through new research at 'crunch time'.
A deceptively aligned AI has to, every time it’s deciding how to get cookies, go through a “thought process” like: “I am aiming for [thing other than cookies]. But humans want me to get cookies. And humans are in control right now [if not I need to behave differently]. Therefore I should get cookies.”Contrast this with an AI that just responds to rewards for cookies by going through this thought process: “I’m trying to get cookies.”The former “thought process” could be noticeably more expensive (compute wise or whatever), in which case heavy optimization would push against it. (I think this is plausible, though I’m less than convinced; the former thought process doesn’t seem like it is necessarily much more expensive conditional on already having a situationally aware agent that thinks about the big picture a lot.
I think the plausibility depends heavily on how difficult the underlying tasks (getting cookies vs. getting something other than cookies) are. If humans ask the AI to do something really hard, whereas the AI wants something relatively simpler / easier to get, the combined difficulty of deceiving the humans and then doing the easy thing might be much less than the difficulty of doing the real thing that the humans want.
I think human behavior is pretty strong evidence that the difficulty and cognitive overhead of running a deception against other humans often isn't that hard in an absolute sense - people often succeed at deceiving others while failing to accomplish their underlying goal. But the failure is often because the underlying goal is the actual hard part, not because the deception itself was overly cognitively burdensome.
An additional issue is that deceptive alignment only happens if you get inner misalignment resulting in an AI with some nonindexical “aim” other than in-episode reward. This could happen but it’s another conjunct.
An AI ending up with aims other than in-episode reward seems pretty likely, and has plausibly already happened, if you consider current AI systems to have "aims" at all. I expect the most powerful AI training methods to work by training the AI to be good at general-purpose reasoning, and the lesson I take from GPTs is that you can start to get general-purpose reasoning by training on relatively simple (in structure) tasks (next token prediction, RHLF), if you do it at a large enough scale. See also Reward is not the optimization target - reinforcement learning chisels cognition into a system, it doesn't necessarily train the system to seek reward itself. Once AI training methods advance far enough to train a system that is smart enough to have aims of its own and to start reflecting on what it wants, I think there's little reason to expect that these aims line up exactly with the outer reward function. This is maybe just recapitulating the outer vs. inner alignment debate, though.