TL;DR: This is an attempt to disentangle some concepts that I used to conflate too much as just "inner alignment". This will be old news to some, but might be helpful for people who feel confused about how deception, distributional shift, and "sharp left turns" are related. I first discuss them as entirely separate threat models, and then talk about how they're all aspects of "capabilities are more robust than alignment".
Here are three different threat models for how an AI system could very suddenly do catastrophic things:
- Deception: The AI becomes deceptively aligned at some point during training, and then does what we want only for instrumental reasons (because it wants to be deployed). Once we deploy, it starts pursuing its actual objective, which is catastrophic for humans.
- Distributional shift: The AI behaves well during training, perhaps using some messy set of heuristics and proxy objectives. We deploy, and there's distributional shift in the AI's inputs, which leads to the model's proxies no longer being aligned with human values. But it's still behaving capably, so we again get catastrophic outcomes.
- Capability gains/sharp left turn: At some point (while training or in deployment), the AI becomes much more capable, including at a bunch of things we didn't explicitly train for. This could happen quite suddenly, e.g. because it learns some crucial general skill in relatively few gradient steps, or because it starts learning from something other than gradients that's way faster. The properties of the AI that previously ensured alignment are too brittle and break during this transition.
Note that these can be formulated as entirely distinct scenarios. For example, deception doesn't require a distributional shift nor capability gains; instead, the sudden change in model behavior occurs because the AI was "let out of the box" during deployment. Conversely, in the distributional shift scenario, the model might not be deceptive during training, etc. (One way to think about this is that they rely on changes along different axes of the training/deployment dichotomy).
I don't think we have any empirical examples of deception in AI systems, though there are thought experiments. We do see kind of similar phenomena in interactions between humans, basically whenever someone pretends to have a different goal than they actually do in order to gain influence.
To be clear, here's one thing that is not an example of deception in the sense in which I'm using the word: an AI does things during training that only look good to humans even though they actually aren't, and then continues to do those things in deployment. To me, this seems like a totally different failure mode, but I've also seen this called "deception" (e.g. "Goodhart deception" in this post), thus the clarification.
We do have experimental evidence for goal misgeneralization under distributional shift (the second scenario above). A well-known one is the CoinRun agent from Goal misgeneralization in Deep RL, and more recently, DeepMind published many more examples.
A classic example for sudden capability gains is the history of human evolution. Relatively small changes in the human brain compared to other primates made cultural evolution feasible, which allowed humans to improve from a source other than biological evolutionary pressure. The consequence were extremely quick capability gains for humanity (compared to evolutionary time scales). This example contains both the "threshold mechanism", where a small change to cognitive architectures has big effects, and the "learning from another source mechanism", with the former enabling the latter.
In ML, grokking might be an example for the "threshold mechanism" for sudden capability gains: a comparatively small number of gradient steps can massively improve generalization beyond the training distribution. An example of learning from something other than gradients is in-context learning in language models (e.g. you can give an LM information in the prompt and it can use that information). But for now, this doesn't lead to permanent improvements to the language model.
Relations between these concepts
I used to conflate deception, distributional shift, and sharp left turns as "inner alignment" in a way that I now think wasn't helpful. But on the other hand, these do feel related, so what makes them similar?
One obvious aspect is that these could all lead to very sudden failures (as opposed to a pure "going out with a whimper" scenario). In each case, the AI might behave fine for a while—not just in terms of "looking fine" to human observers, but even under some ideal outer alignment solution. Then something changes, typically quite suddenly, and the AI behaves very differently (and likely badly in a way that would be obvious to us). The reason these scenarios are dangerous is thus that the AI could make high-stakes decisions, to use Paul's framing. I think this is the sense in which they all feel related to inner alignment.
A more interesting (but also more hand-wavy) point is that all three are in some sense about capabilities being more robust than alignment:
- Deception: Just because we let the AI out of the box, it doesn't suddenly become incompetent. That would be quite strange indeed, assuming there is no significant distributional shift! It's also not obvious that alignment would fail. But at least there are plausible arguments for why gradient descent might most easily find models that do well in training but then are competently misaligned once they detect they are deployed. In contrast, there is no reason to think we'd get systems that decide to become incompetent once deployed. Such models are possible solutions to the outer optimization problem—you could have an AI that detects whether it's still in training, and if not just performs random actions, and this AI would get low training loss. But it's not a natural thing for gradient descent to find, it's just unnecessarily complex.
- Distributional shift: The worry is precisely that capabilities will generalize better than goals across the distributional shift. If capabilities didn't generalize, we'd be fine. But as the CoinRun agent examplifies, you can get AIs that capably pursue a different objective after a distributional shift than the one you were hoping for. One difference to deception is that models which become incompetent after a distributional shift are in fact quite plausible. But to the extent that we think we'll get goal misgeneralization specifically, the underlying worry again seems to be that capabilities will be robust while alignment will not.
- Capability gains: Nate's sharp left turn post explicitly discusses how this is about capability generalization without alignment generalization. We are worried that an AI will learn capabilities that robustly transfer to domains it hasn't been explicitly trained on, while alignment properties are brittle.
A note on distributional shift
I've specifically been talking about goal misgeneralization under distributional shift. Here, we mainly care about not generalizing catastrophically, i.e. competently pursuing a misaligned goal. There's a different type of distributional shift that's important in alignment, where we care about actually good generalization, rather than just non-catastrophic one. For example, for ELK we want to generalize from questions that humans can answer to questions we can't answer. But we don't just want to generalize in a way that doesn't kill us if we read the answer, we want to generalize in the way that gives correct answers. This seems distinct from what I've discussed in this post in some important ways, though there are definitely parallels (e.g. the generalization we want not being natural enough that we're guaranteed to get it automatically).
Thanks to Leon Lang for feedback on a draft of this post!
A tiny distributional shift would be enough to let the model determine whether it's in training or deployment, and in principle it could even be deceptive without any expected distributional shift, by defecting randomly.