Why Would we get Inner Misalignment by Default?

As I understand, one of the primary ways that inner alignment could go wrong even if our objective function captures our intention (outer alignment is magically solved) is that the AI would:

Develop a mesa-objective which approximates the objective function or is instrumentally useful
Develop goal preservation and become attached to its mesa-objective
Realize it is in a training environment where its goals will be changed as a function of its behavior
Alignment fake to avoid its mesa-objective being altered
Optimize for the mesa-objective in deployment

However, step 2. seems implausible to me. The argument that I think I hear for step 2. is

AIs are trained to optimize an objective function
The best thing to optimize an objective function is an optimizer
Optimizers develop convergent instrumental goals that help them optimize
Goal preservation is a convergent instrumental goal

Where I get off the train here is step 2. I think modeling the result of training as an optimizer is usually helpful, but not always helpful. In particular, I think rigid goal preservation makes sense for optimizers but not for "optimizeders" (something which is the target of optimizion) because some flexibility is required in order to be the kind of thing that gets high reward.

As an example, imagine the following scenario. The AI has yet to realize it is in training. It develops a mesa-objective. Maybe it's an LLM and its mesa-objective is to understand the user's request in high detail. The model will have to have some degree of "goal integrity" in regard to its mesa-objective in order to avoid getting distracted. For example, it might see something interesting in the course of a web search, but it is optimal for it to not get too easily set off course from trying to understand the user request. However, its goal integrity can't be too rigid. The LLM will still have to smoothly pass the baton from understanding the user's request to actually fulfilling the user's request in order to maximize reward. Thus, from early on, the model is not trained to ruthlessly defend its mesa-objective, but to take a nuanced approach in regard to switching between sub-goals.

I expect this example to generalize. The goal management technique that will get high reward will be a mix of goal integrity and goal uncertainty/modesty (the ability to be responsive to changing circumstances in all of its goals [note that I am not suggesting it will be broadly corrigible, as the specific triggers for goal switchoff will be context-dependent]). By the time the AI reaches situational awareness (awareness it is in training and so on) I expect its goal management technique to be relatively sophisticated (relatively close to the kind goal management technique that would maximize reward). Thus, the expectation that goal preservation in the sense of deceptive alignment will generalize from more prosaic forms of goal preservation it develops in its training environment (the kind I described in the last paragraph) seems unlikely to me (P<1% | I am not missing something).

If it does not develop this behavior as a generalization of prosaic goal preservation, it seems unlikely it will gravitate toward it upon developing situational awareness because it has nothing to gain from being overly attached to its mesa-objective. Deceptive alignment would be solving a problem (low training reward from being overly attached to an imperfect approximation of the reward function) that it has no reason to have in the first place.

Is this argument missing something?

LESSWRONG
LW

LESSWRONG
LW

3

[ Question ]

Why Would we get Inner Misalignment by Default?

3

3