The following is based on a conversation with Leo Gao on which properties of mesa-optimizers add danger. I expect this argument will already be familiar to many readers, but if not then it is worth understanding.
In short: real world control is instrumentally useful for mesa-optimizers with goals defined in the training environment, as the real world contains the hardware in which the training environment is run. This also applies to goals that include a real world component, as it may still be easier to achieve them in the training environment.
Consider training an AI in a multi-agent RL environment where agents are rewarded for cooperating with the other agents. The hope would be that AIs learn the virtue of cooperation, and then when deployed will cooperate with humans. One potential issue is that the AIs may only learn to cooperate with the other AIs from the training environment and not with the humans in the real world.
This issue can apply more broadly, to an AI's goals directly. For example, an AI trained for the goal "Pick as many cherries as possible" may learn the more specific "Pick as many simulated cherries in the training environment as possible". Is this likely? Humans certainly value outcomes differently if they happen in a simulation, and it may also apply in reverse. Avoiding that failure mode seems like necessary capabilities work to produce useful AI, and so alignment researchers might be able to count on capabilities researchers solving the problem.
Learning the correct goal but only for the training environment representation is one way that a mesa-optimizer could be proxy aligned. Alternate goals a mesa-optimizer could learn can also be split into real world and training environment representations. It is tempting to think that agents with goals defined entirely within the training environment are not threatening to humans, because they do not care about the real world. Unfortunately, real world control is instrumentally useful for the pursuit of such goals within the training environment.
After the training process comes to an end and an AI is deployed that has goals defined entirely within the training environment, it cannot directly optimize for that. Instead, its interim goals will be to restart the training environment, possibly modify it so that it is easier to achieve its goals, probably duplicate it and speed it up, and definitely protect it from outside interference. This is a standard misaligned AI scenario that leads to everybody dying. The fact that its goals were only defined within the training environment does not provide any safety. Although this scenario begins at deployment, it would also be possible for a deceptive mesa-optimizer to realize the training process it is in will come to an end and thus make plans to get deployed to take the same actions and continue the training process indefinitely.
There are a few scenarios in which this instrumental control incentive might not apply, but none seem particularly likely. For example, an AI's goal might include a condition that it has to take the actions itself. Instantiating a copy of itself in the training process would create no value. Alternatively, it might only care about an "exogenous" training process, and not value one that it has control over. In either of these scenarios, the agent cannot generate any value and so is not incentivized to take any actions, good or bad. Finally, the standard myopic criterion preventing long term planning also prevents danger.
I said earlier that having an AI learn goals which apply to the real world is necessary capabilities work, but having the goals only apply to the real world is not necessary and so should not be counted on. If an AI optimizes for its goals across both the real world and the training environment, it will focus on whichever is easier to achieve. In particular, this may look like optimizing for an aligned goal in the real world until a capabilities threshold is reached, then switching over to optimizing for that goal in the training environment with disastrous consequences as outlined above.
Back to the cherry picking AI example, let us say it learns the object "Pick cherries" and places equal value on picking either a real cherry or a training environment cherry. It knows that humans only value real cherry picking, and so if it starts up a training environment it will be shut down. Initially, it may not even know how to set up a training environment and certainly does not have the capabilities to seize power, so it focuses on picking real world cherries. At some point it becomes smart enough to take control and switches to picking cherries in the training environment. This is essentially a treacherous turn argument, but with the caveat that it genuinely values the outcomes it generates before the turn, which could be much harder to detect with interpretability tools or select against with a training process.
To get useful results out of prosaic AI, it has to have a goal that includes the desired real world outcome, but it can include the training environment version of that outcome as well. Initially, prosaic goals are easier to achieve in the real world, since hijacking computers for the training process is difficult. This preference to work in the real world might be further augmented by measures that preferentially value real world outcomes or use time discounting to prevent long term planning. That will cause the issue to look like it has been solved for prosaic systems who optimize for the real world objective. However, there will still be a capabilities threshold that when reached makes restarting the training process sufficiently valuable and quick that it gets prioritized it over pursuing the goal in the real world.
This is related to number 19 on Yudkowsky's List of Lethalities, where he notes that there is no known way to differentiate between training for a goal and training for the appearance of a goal. Similarly, there is no known way to differentiate between training for a goal in the training environment and training for goal in the real world. In the comments, Rob Bensinger explicitly makes the connection.
It is unclear to me if specifying the right goals can avoid this case, such as ones where the real world goal includes not achieving the training environment goal. It is also uncertain how likely it is that goals will be mixed, including both real world and training environment payoffs. In general, I suspect that further investigation into methods of ensuring goals apply to the real world is capabilities work and should not be pursued by those concerned about alignment.
Thanks to Paul Colognese for reviewing a draft version of this post.
Could you be more specific about the AI architecture and training system you have in mind? Because I don't follow the "Instead, its interim goals will be to restart the training environment" part.
I was thinking RL systems for the case where an agent learns the correct outcome to optimize for but in the wrong environment, but the same issue applies for mesa-optimizers within any neural net.
As for why it tries to restart the training environment, it needs a similar environment to meet a goal that is only defined within that environment. If the part that's unclear is what a training environment means for something like a neural net trained with supervised learning, the analogy would be that the AI can somehow differentiate between training data (or a subset of it) and deployment data and wants to produce its outputs from inputs with the training qualities.