ETA: This post can basically be read as arguing that imitating human decisions, or any other outputs from an (approximate) planning process, seems especially likely to produce mesa-optimization, since a competent imitator should recover an (approximate) planning (i.e. optimization) process.
This post states an observation which I think a number of people have had, but which hasn't been written up (AFAIK). I find it one of the more troubling outstanding issues with a number of proposals for AI alignment.
1) Training a flexible model with a reasonable simplicity prior to imitate (e.g.) human decisions (e.g. via behavioral cloning) should presumably yield a good approximation of the process by which human judgments arise, which involves a planning process.
2) We shouldn't expect to learn exactly the correct process, though.
3) Therefore imitation learning might produce an AI which implements an unaligned planning process, which seems likely to have instrumental goals, and be dangerous.
Example: The human might be doing planning over a bounded horizon of time-steps, or with a bounded utility function, and the AI might infer a version of the planning process that doesn't bound horizon or utility.
Clarifying note: Imitating a human is just one example; the key feature of the human is that the process generating their decisions is (arguably) well-modeled as involving planning over a long horizon.
- The human may have privileged access to context informing their decision; without that context, the solution may look very different
- Mistakes in imitating the human may be relatively harmless; the approximation may be good enough
- We can restrict the model family with the specific intention of preventing planning-like solutions
Overall, I have a significant amount of uncertainty about the significance of this issue, and I would like to see more thought regarding it.