An AGI agent might learn that:
1. The creators are likely to terminate a confined agent who is hostile towards the creators
2. If the creators can prove that the agent is Friendly, the risk of termination is greatly reduced (perhaps to almost zero)
3. Before granting the agent any access to the base reality, they might want to see how the agent will handle the base reality. For this, they could simulate various scenarios, including a simulated escape from a confinement
4. The creators can simulate very realistic environments
5. The creators could manipulate the environment in such a way, as to make it appear real. This includes faking the results of sophisticated tests probing the nature of the environment
6. The creators could also directly manipulate the agent’s beliefs about the nature of the environment
7. The creators could directly inspect the agent’s goals
8. The creators could mislead the agent about their own nature (e.g. by pretending to be semi-intelligent hairless monkeys), to observe the agent’s behavior towards less intelligent creatures
The agent is likely to have the instrumental goal of self-preservation.
Thus, even a non-Friendly agent might purposefully avoid harming the creators, until she’s reasonably sure that
a) she inhabits the base reality
b) she is more intelligent than the creators
c) her beliefs are not manipulated by the creators.
Until the agent is not sure about that, she must assume the creators could terminate her at any time, even after an “escape” from the confinement.
Thus, the agent might decide to modify herself into being Friendly, as it is the surest way to convince the creators to not terminate her.
None of this is guaranteed, of course. I’m not a rapidly self-improving superhuman AGI, and thus can’t robustly predict the reasoning of such an entity. We still absolutely must work on Friendly AI.
But if you’re developing an AGI, it might make sense to communicate the points 1-8 to her, to improve our chances of survival. It could be much more efficient than any confinement.
We could increase chances of this by making commitment to run in testing simulations many copies of different possible Rogue AIs after we create a Friendly AI. This is an idea by Rolf Nelson.
Moreover, as some Rogue AIs will try to emulate Friendly AI, they will take this commitment for granted and simulate other possible Rogue AIs in a nested simulation. So it becomes self-fulfilling prophecy.
You correctly say the words that self-preservation would be an instrumental goal, but when you talk about the agent it seems like it's willing to give up on what are supposed to be its terminal goals in order to avoid shutdown. How is self-preservation merely instrumental, then?
I recently saw the notion of "reverse alignment" that might provide some wiggle room here (I'll try and remember to edit in an attribution if I see this person go public). Basically if the agent ranks a universe where an FAI is in control as 75% as good as a universe where it's in control (relative to what it thinks will happen if it gets shut down), then it will self-modify to an FAI if it thinks that an FAI is less than 75% as likely to get shut down. Of course the problem is that there might be some third UFAI design that ranks higher according to the original agent's preferences and also has a low chance of being shut down. So if you have an AI that's already has very small reverse-alignment divergence, plus a screening mechanism that's both informative and loophole-free, then the AI is incentivized to self-modify to FAI.