ETA: This post can basically be read as arguing that imitating human decisions, or any other outputs from an (approximate) planning process, seems especially likely to produce mesa-optimization, since a competent imitator should recover an (approximate) planning (i.e. optimization) process.

This post states an observation which I think a number of people have had, but which hasn't been written up (AFAIK). I find it one of the more troubling outstanding issues with a number of proposals for AI alignment.

1) Training a flexible model with a reasonable simplicity prior to imitate (e.g.) human decisions (e.g. via behavioral cloning) should presumably yield a good approximation of the process by which human judgments arise, which involves a planning process.

2) We shouldn't expect to learn exactly the correct process, though.

3) Therefore imitation learning might produce an AI which implements an unaligned planning process, which seems likely to have instrumental goals, and be dangerous.

Example: The human might be doing planning over a bounded horizon of time-steps, or with a bounded utility function, and the AI might infer a version of the planning process that doesn't bound horizon or utility.

Clarifying note: Imitating a human is just one example; the key feature of the human is that the process generating their decisions is (arguably) well-modeled as involving planning over a long horizon.


  • The human may have privileged access to context informing their decision; without that context, the solution may look very different
  • Mistakes in imitating the human may be relatively harmless; the approximation may be good enough
  • We can restrict the model family with the specific intention of preventing planning-like solutions

Overall, I have a significant amount of uncertainty about the significance of this issue, and I would like to see more thought regarding it.

11 comments, sorted by Click to highlight new comments since: Today at 7:18 PM
New Comment

I find it one of the more troubling outstanding issues with a number of proposals for AI alignment.

Which proposals? AFAIK Paul's latest proposal no longer calls for imitating humans in a broad sense (i.e., including behavior that requires planning), but only imitating a small subset of the human policy which hopefully can be learned exactly correctly. See this comment where he wrote:

Are you perhaps assuming that we can max out regret during training for the agents that have to be trained with human involvement, but not necessarily for the higher level agents?

Yeah, I’m relatively optimistic that it’s possible to learn enough from humans that the lower level agent remains universal (+ aligned etc.) on arbitrary distributions. This would probably be the case if you managed to consistently break queries down into simpler pieces until arriving at a very simple queries.

ETA: Oh, but the same kind of problems you're point out here would still apply at the higher level distillation steps. I think the idea there is for an (amplified) overseer to look inside the imitator / distilled agent during training to push it away from doing anything malign/incorrigible (as Jessica also mentioned). Here is a post where Paul talked about this.

See the clarifying note in the OP. I don't think this is about imitating humans, per se.

The more general framing I'd use is WRT "safety via myopia" (something I've been working on in the past year). There is an intuition that supervised learning (e.g. via SGD as is common practice in current ML) is quite safe, because it doesn't have any built-in incentive to influence the world (resulting in instrumental goals); it just seeks to yield good performance on the training data, learning in a myopic sense to improve it's performance on the present input. I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they *do* seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).

See the clarifying note in the OP. I don’t think this is about imitating humans, per se.

Yes, I realized that after I wrote my original comment, so I added the "ETA" part.

I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they do seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).

I think this makes sense and at least some have also realized this and have reacted appropriately within their agenda (see the "ETA" part of my earlier comment). It also seems good that you're calling it out as a general issue. I'd still suggest giving some examples of AI alignment proposals where people haven't realized this, to help illustrate your point.

This sounds to me like an argument that inner optimizers are particularly likely to arise in imitation learning, because humans are pretty close to optimizers. Does that seem right?

Yes, maybe? Elaborating...

I'm not sure how well this fits into the category of "inner optimizers"; I'm still organizing my thoughts on that (aiming to finish doing so within the week...). I'm also not sure that people are thinking about inner optimizers in the right way.

Also, note that the thing being imitated doesn't have to be a human.

OTTMH, I'd say:

  • This seems more general in the sense that it isn't some "subprocess" of the whole system that becomes a dangerous planning process.
  • This seems more specific in the sense that the boldest argument for inner optimizers is, I think, that they should appear in effectively any optimization problem when there's enough optimization pressure.

Yeah, I agree with all of those clarifications.

In trying to argue for the safety of imitation learning, the key property of the imitation might be something like "the imitation is indistinguishable from the actual human on this distribution, where the distinguishers may be from this model class and have this information". GANs get a property like this pretty directly, although there is an issue in that they don't have access to the imitator's internal states, so e.g. they can't tell whether the imitator is using pseudorandomness or the input-based randomness imitators have access to.

Suppose we have an imitation system with a property like this. Then the argument for safety is going to look something like: "The human being imitated is corrigible. If the imitator weren't corrigible, then the discriminator could tell, because this would result in distinguishable actions."

For example, let's say that the imitated human is able to violently take control of the AI lab. In fact the imitated human doesn't do this (in any of the training data), and, arguably, it doesn't take a smart discriminator to tell that the human probably wouldn't do this in a new situation. So, if the imitator did do this (because e.g. it incorrectly inferred the human's decision theory), the discriminator could tell it apart. Of course, arguing that the discriminator generalizes this well would require some robustness argument; this particular problem seems easy (if the method for taking control involves taking really obvious actions like using weapons) but there might be more subtle ways of taking control. In those cases we would want some argument that, if the imitator comes up with a malign/incorrigible plan, then a discriminator with access to the imitator's internal states can notice this and notice that the imitated human wouldn't do this, because this isn't harder than coming up with the plan in the first place, and the discriminator is at least as capable as the imitator.

In general, while there are potential problems, I expect them to be more subtle than "the imitator incorrectly infers the human's decision theory and pursues convergent instrumental goals".

(Worth noting other problems with imitation learning, discussed in this post and this post)

I think I disagree pretty broadly with the assumptions/framing of your comment, although not necessarily the specific claims.

1) I don't think it's realistic to imagine we have "indistinguishable imitation" with an idealized discriminator. It might be possible in the future, and it might be worth considering to make intellectual progress, but I'm not expecting it to happen on a deadline. So I'm talking about what I expect might be a practical problem if we actually try to build systems that imitate humans in the coming decades.

2) I wouldn't say "decision theory"; I think that's a bit of a red herring. What I'm talking about is the policy.

3) I'm not sure the link you are trying to make to the "universal prior is malign" ideas. But I'll draw my own connection. I do think the core of the argument I'm making results from an intuitive idea of what a simplicity prior looks like, and its propensity to favor something more like a planning process over something more like a lookup table.

If I'm taking your point correctly, it seems you're concerned about Goodharting in imitation learning. I agree, this seems a major issue, and I think people are aware of it and thinking about ways to address it.

I don't think I'd put it that way (although I'm not saying it's inaccurate). See my comments RE "safety via myopia" and "inner optimizers".

Suppose you take a terabyte of data on human decisions and actions. You search for the shortest program that outputs the data, then see what gets outputted afterwards. The shortest program that outputs the data might look like a simulation of the universe with an arrow pointing to a particular hard drive. The "imitator" will guess at what file is next on the disk.

One problem for imitation learning is the difficulty in pointing out the human and separating them from the environment. The details of the humans decision might depend on what they had for lunch. (Of course, multiple different decisions might be good enough. But this illustrates that "imitate a human" isn't a clear cut procedure. And you have to be sure that the virtual lunch doesn't contain virtual mind control nanobots. ;-)

You could put a load of data about humans into a search for short programs that produce the same data. Hopefully the model produced will be some approximation of the universe. Hopefully, you have some way of cutting a human out of the model and putting them into a virtual box.

Alternatively you could use nanotech for mind uploading, and get a virtual human in a box.

If we have lots of compute and not much time, then uploading a team of AI researchers to really solve friendly AI is a good idea.

If we have a good enough understanding of "imitation learning", and no nanotech, we might be able to get an AI to guess the researchers mental states given observational data.

An imitation of a human might be a super-fast intelligence, with a lot of compute, but it won't be qualitatively super-intelligent.