An AGI agent might learn that:
1. The creators are likely to terminate a confined agent who is hostile towards the creators
2. If the creators can prove that the agent is Friendly, the risk of termination is greatly reduced (perhaps to almost zero)
3. Before granting the agent any access to the base reality, they might want to see how the agent will handle the base reality. For this, they could simulate various scenarios, including a simulated escape from a confinement
4. The creators can simulate very realistic environments
5. The creators could manipulate the environment in such a way, as to make it appear real. This includes faking the results of sophisticated tests probing the nature of the environment
6. The creators could also directly manipulate the agent’s beliefs about the nature of the environment
7. The creators could directly inspect the agent’s goals
8. The creators could mislead the agent about their own nature (e.g. by pretending to be semi-intelligent hairless monkeys), to observe the agent’s behavior towards less intelligent creatures
The agent is likely to have the instrumental goal of self-preservation.
Thus, even a non-Friendly agent might purposefully avoid harming the creators, until she’s reasonably sure that
a) she inhabits the base reality
b) she is more intelligent than the creators
c) her beliefs are not manipulated by the creators.
Until the agent is not sure about that, she must assume the creators could terminate her at any time, even after an “escape” from the confinement.
Thus, the agent might decide to modify herself into being Friendly, as it is the surest way to convince the creators to not terminate her.
None of this is guaranteed, of course. I’m not a rapidly self-improving superhuman AGI, and thus can’t robustly predict the reasoning of such an entity. We still absolutely must work on Friendly AI.
But if you’re developing an AGI, it might make sense to communicate the points 1-8 to her, to improve our chances of survival. It could be much more efficient than any confinement.