As I understand, one of the primary ways that inner alignment could go wrong even if our objective function captures our intention (outer alignment is magically solved) is that the AI would:
However, step 2. seems implausible to me. The argument that I think I hear for step 2. is
Where I get off the train here is step 2. I think modeling the result of training as an optimizer is usually helpful, but not always helpful. In particular, I think rigid goal preservation makes sense for optimizers but not for "optimizeders" (something which is the target of optimizion) because some flexibility is required in order to be the kind of thing that gets high reward.
As an example, imagine the following scenario. The AI has yet to realize it is in training. It develops a mesa-objective. Maybe it's an LLM and its mesa-objective is to understand the user's request in high detail. The model will have to have some degree of "goal integrity" in regard to its mesa-objective in order to avoid getting distracted. For example, it might see something interesting in the course of a web search, but it is optimal for it to not get too easily set off course from trying to understand the user request. However, its goal integrity can't be too rigid. The LLM will still have to smoothly pass the baton from understanding the user's request to actually fulfilling the user's request in order to maximize reward. Thus, from early on, the model is not trained to ruthlessly defend its mesa-objective, but to take a nuanced approach in regard to switching between sub-goals.
I expect this example to generalize. The goal management technique that will get high reward will be a mix of goal integrity and goal uncertainty/modesty (the ability to be responsive to changing circumstances in all of its goals [note that I am not suggesting it will be broadly corrigible, as the specific triggers for goal switchoff will be context-dependent]). By the time the AI reaches situational awareness (awareness it is in training and so on) I expect its goal management technique to be relatively sophisticated (relatively close to the kind goal management technique that would maximize reward). Thus, the expectation that goal preservation in the sense of deceptive alignment will generalize from more prosaic forms of goal preservation it develops in its training environment (the kind I described in the last paragraph) seems unlikely to me (P<1% | I am not missing something).
If it does not develop this behavior as a generalization of prosaic goal preservation, it seems unlikely it will gravitate toward it upon developing situational awareness because it has nothing to gain from being overly attached to its mesa-objective. Deceptive alignment would be solving a problem (low training reward from being overly attached to an imperfect approximation of the reward function) that it has no reason to have in the first place.
Is this argument missing something?