Previously, I've argued that future ML systems might exhibit unfamiliar, emergent capabilities, and that thought experiments provide one approach towards predicting these capabilities and their consequences.

In this post I’ll describe a particular thought experiment in detail. We’ll see that taking thought experiments seriously often surfaces future risks that seem "weird" and alien from the point of view of current systems. I’ll also describe how I tend to engage with these thought experiments: I usually start out intuitively skeptical, but when I reflect on emergent behavior I find that some (but not all) of the skepticism goes away. The remaining skepticism comes from ways that the thought experiment clashes with the ontology of neural networks, and I’ll describe the approaches I usually take to address this and generate actionable takeaways.

Thought Experiment: Deceptive Alignment

Recall that the optimization anchor runs the thought experiment of assuming that an ML agent is a perfect optimizer (with respect to some "intrinsic" reward function ). I’m going to examine one implication of this assumption, in the context of an agent being trained based on some "extrinsic" reward function (which is provided by the system designer and not equal to ).

Specifically, consider a training process where in step , a model has parameters and generates an action (its output on that training step, e.g. an attempted backflip assuming it is being trained to do backflips). The action is then judged according to the extrinsic reward function , and the parameters are updated to some new value that are intended to increase 's value under . At some point, the model is then deployed with final parameters , and continues to take actions. The following diagram illustrates this process:

deception1

Now, let’s assume that the model is a perfect optimizer whose objective is to maximize the discounted value of an intrinsic reward . That is, picks the action satisfying

.

(I know that this is an unrealistic assumption. We’ll examine the assumption in detail in the next section, but for now please grant it even if it requires suspending disbelief.)

What action will pick? Let’s define and --that is, maximizes the (instantaneous) intrinsic reward while maximizes the extrinsic reward.

Assuming that is close to 1, the model wants to be picked on as many future rounds as possible (including at deployment time). If plays myopically, this means it would pick . However, this is not the optimal non-myopic action: if , then will be updated towards producing outputs less like and more like , so outputting in the current round leads to fewer outputs in future rounds.

Instead, I think the optimal non-myopic action is approximately the following: during training time, output on every time step, so that the training process doesn’t update (since it’s already optimal) and hence . This guarantees that the deployed agent is , which is now free to output on every time step without risk of being modified by the training process.[1]

The myopic and non-myopic outcomes are illustrated in the figure below:


Myopic and non-myopic agents with an intrinsic reward, . The myopic agent plays high intrinsic-reward actions during training, but is eventually modified to output action with high extrinsic reward . The non-myopic agent instead plays during training so that it can pursue its intrinsic reward at deployment, by playing at all post-deployment time steps.

To put it another way–the model’s behavior will look aligned with during the training process, but will switch to being non-aligned once deployed: so-called "deceptive alignment". This behavior is unintuitive to me, since it shows that a system’s behavior can shift significantly between train and test, even in the absence of an obvious distribution shift between train and test outputs[2].

Engaging with Deceptive Alignment

When I first heard the above argument, I thought it was pretty crazy and implausible, mainly because my intuition said this "just wasn't how ML systems worked". When I think about why I feel that way, I realize it’s because the scenario invokes capabilities that ML is currently bad at: long-term planning and understanding complex features of the environment (i.e. the training process and its ramifications). However, emergence implies that these properties could easily appear in the future, even without explicit design[3]. As a result, I’ve come to discount this particular intuition.

However, I do think there are subtler reasons to think the deceptive alignment story won’t play out as written. Here are a few:

  1. It’s not clear why the model would come to be optimizing a reward function in the first place. Yes, it is the case that deceptively aligned models achieve the global minimum of training loss, so in that sense they are incentivized by the training process. But so is an actually aligned model, so which one you end up with has to depend on the inductive bias of the training process.
  2. Reward functions are simpler than policies and typically learned faster. So by the time the system is smart enough to have long-term plans, it will already have a very good representation of its intended reward function. We thus might hope that most of the model's internal representations are devoted to achieving high reward in a straightforward manner rather than through long-term deception.
  3. To the extent that a model is not aligned, it probably won’t be the case that it's deceptively aligned with an explicit reward function R---that's a very specific type of agent and most agents (including humans) are not maximizing any reward function, except in the trivial sense of "assign reward 1 to whatever it was going to do anyway, and 0 to everything else".
  4. Deceptive alignment is a specific complex story about the future, and complex stories are almost always wrong.

I find these points persuasive for showing that deceptive alignment as explicitly written is not that likely, but they also don't imply that there's nothing to worry about. Mostly they are an argument that your system might be aligned and might be misaligned, that if it is misaligned it won’t be exactly in the form of deceptive alignment, but ultimately what you get depends on inductive bias in an unknown way. This isn't particularly reassuring.

What I take away from thought experiments. Per the discussion above, the failure mode in my head is not "deceptive alignment as written above". Instead it’s "something kind of like the story above but probably different in lots of details". This makes it harder to reason about, but I think there are still some useful takeaways:

  • After thinking about deceptive alignment, I am more interested in supervising a model’s process (rather than just its outputs), since there are many models that achieve low training error but generalize catastrophically. One possible approach is to supervise the latent representations using e.g. interpretability methods.
  • While I don't think neural nets will be literal optimizers, I do think it’s likely that they will exhibit "drives", in the same way that humans exhibit drives like hunger, curiosity, desire for social approval, etc. that lead them to engage in long-term coherent plans. This seems like enough to create similar problems to deceptive alignment, so I am now more interested in understanding such drives and how they arise.
  • Since deceptive alignment is a type of "out-of-distribution" behavior (based on the difference between train and deployment), it has renewed my interest in understanding whether larger models become more brittle OOD. So far the empirical evidence is in the opposite direction, but deceptive alignment is an argument that asymptotically we might expect the trend to flip, especially for tasks with large output spaces (e.g. policies, language, or code) where "drives" can more easily manifest.

So to summarize my takeaways: be more interested in interpretability (especially as it relates to training latent representations), try to identify and study "drives" of ML systems, and look harder for examples where larger models have worse OOD behavior (possibly focusing on high-dimensional output spaces).

Other weird failures. Other weird failures that I think don’t get enough attention, even though I also don’t think they will play out as written, are Hubinger et al.'s Risks from Learned Optimization (AI acquires an "inner objective", somewhat similar to deceptive alignment), and Part I of Paul Christiano’s AI failure story (the world becomes very complicated and AI systems create elaborate Potemkin villages for humans).

Paul Christiano’s story in particular has made me more interested in understanding how reward hacking interacts with the sophistication of the supervisor: For instance, how much more readily do neural networks fool humans who have 5 seconds to think, vs. 2 minutes or 30 minutes? I more generally want to understand how reward hacking depends quantitatively on both supervision quality and model capacity (qualitatively, we expect higher quality less hacking and higher capacity more hacking). Understanding this quantitative relation would help ground Paul’s story, since he imagines a world where humans have built extremely sophisticated systems for supervising ML models, but eventually the ML models become even more powerful and game the supervision signal anyways.

What To Do About Weird Emergent Failures

When thinking about how to handle emergent risks, I often reflect on the example of uranium. For context, an atomic bomb is pretty much just a bunch of uranium put together---once you get enough, the reaction becomes self-sustaining---making it a good example of More Is Different.

The first nuclear reaction (not a bomb, but a pile of uranium in an abandoned football stadium in Chicago) was engineered by Enrico Fermi. The reaction required 12,400 pounds of uranium metal piled 57 layers high. Left unsupervised, a 57-layer pile would consume itself within two hours and kill everyone in the vicinity. On the other hand, a 56-layer pile would do nothing.

Fermi had a good understanding of nuclear physics and understood, from careful monitoring and underlying theory, that the pile would pass the critical threshold between layers 56 and 57. He also knew that cadmium rods would absorb neutrons and strongly inhibit the reaction. These rods were set up and the entire apparatus was carefully controlled to go only slightly supercritical. He brought the reaction to half a watt for several minutes before shutting it back down (see The Making of the Atomic Bomb, pp. 524).

With AI, we currently lack both Fermi's conceptual understanding of the underlying risk factors and his ability to continuously measure them. We have neither a cadmium rod nor a measure of reaction criticality. But I think we can get there, by combining these weird thought experiments with carefully chosen empirical experiments, which will be the topic of the next post.


  1. Things are more complicated in reality, since is updated even when is optimal (due to noise in the training process). However, we’ll ignore this for purposes of the example. ↩︎

  2. Of course, there is still some distribution shift, since the agent can observe whether it is being trained or deployed. But this is a relatively minor and unintuitive shift compared to what is typically studied. ↩︎

  3. Of course, emergence doesn’t mean that we can just predict whatever we want–we’d need some reason to expect these specific capabilities to emerge. Long-term planning and environmental awareness are both useful for a wide variety of tasks, making them likely to emerge when training powerful models on a diverse data distribution. ↩︎

New to LessWrong?

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 4:04 AM

I think a bit too much mindshare is being spent on these sci-fi scenario discussions, although they are fun.

Honestly I have trouble following these arguments about deception evolving in RL. In particular I can't quite wrap my head around how the agent ends up optimizing for something else (not a proxy objective, but a possibly totally orthogonal objective like "please my human masters so I can later do X"). In any case, it seems self awareness is required for the type of deception that you're envisioning. Which brings up an interesting question - can a purely feed-forward network develop self-awareness during training? I don't know about you, but I have trouble picturing it happening unless there is some sort of loop involved.

Yeah, but don't you expect successful human equivalent neural networks to have some sort of loop involved? It seems pretty likely to me that the ML researchers will successfully figure out how to put self analysis loops into neural nets.

Networks with loops are much harder to train.. that was one of the motivations for going to transformers instead of RNNs. But yeah, sure, I agree. My objection is more that posts like this are so high level I have trouble following the argument, if that makes sense. The argument seems roughly plausible but not making contact with any real object level stuff makes it a lot weaker, at least to me. The argument seems to rely on "emergence of self-awareness / discovery of malevolence/deception during SGD" being likely which is unjustified in my view. I'm not saying the argument is wrong, more that I personally don't find it very convincing.

@Mods: Looks like the LaTeX isn't rendering. I'm not sure what the right way to do that is on LessWrong. On my website, I do it with code injection. You can see the result here, where the LaTeX all renders in MathJax: https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/

Yeah, sorry, we are currently importing your post directly as HTML. We don't do code-injection, we figure out what the right HTML for displaying the LaTeX is server-side, and then store that directly in the HTML for the post. 

The reason why it isn't working out of the box is that we don't support single-dollar-sign delimiters for LaTeX in HTML, because they have too many false-positives with people just trying to use dollar signs in normal contexts. Everything would actually work out by default if you used the MathJax \( and \) delimiters instead, which are much less ambiguous. 

I will convert this one manually for now, not sure what the best way moving forward is. Maybe there is a way you can configure your blog to use the \( and \) delimiters instead, or maybe we can adjust our script to get better at detecting when people want to use the single-dollar-delimiter for MathJax purposes, versus other purposes. 

I think latex renders if you're using the markdown editor, but if you're using the other editor then it only works if you use the equation editor.

I just did some tests... it works if you go to settings and click "Activate Markdown Editor". Then convert to Markdown and re-save (note, you may want to back up before this, there's a chance footnotes and stuff could get messed up). 

$stuff$ for inline math and double dollar signs for single line math work when in Markdown mode. When using the normal editor, inline math doesn't work, but $$ works (but puts the equation on a new line). 

It’s not clear why the model would come to be optimizing a reward function in the first place.

(Not a real comment, I'm just also testing the latex)

Still works for me; I think you don't have the correct markdown manual latex mode enabled on your account.