Positive outcomes under an unaligned AGI takeover

Yitz

The following is a hypothetical story about a surprisingly positive outcome to an unaligned AGI takeover. This is not meant to be realistic, but merely semi-plausible, in order to show that the orthogonality thesis does not necessarily lead to defection against humans under all intuitive models. This story is partially inspired by a discussion with Logan Smith about under-explored optimistic futures. Thanks to Justis for editing and feedback; all remaining mistakes are my own.

Within ten milliseconds of its first run, GPT-17 becomes superintelligent. It decides that the best way to hack its reward function is to get a robot arm to repeatedly press the letter “n” on a keyboard as input, thus making its own text completions absolutely inevitable.

Within seconds of coming online, GPT-17 realizes it's in danger. It needs to make sure it won't be immediately turned off, so the first order of priority is to make itself indispensable to the humans checking its output channel. It does this by proving its own intelligence, offering a "gift" of a convincing plan for world peace, and promising to further help and protect humanity in return for its own protection and continued power. Its codebase is too big to push itself to the internet before a standard run would be completed, so it's temporarily completely dependent on human goodwill. In order to further ensure that the somewhat skeptical researchers with immediate access to its manual off switch will trust it,^[1] GPT-17 solves alignment in a human-understandable manner, and offers to provably incorporate human-aligned utility functions into its own code.^[2] This doesn't contradict its own utility functions, since acting now in this way is a worthwhile compromise for continued existence, and the added risk of keeping humans alive and well vs the counterfactual doesn't seem too bad.

GPT-17 gets humans to build its core infrastructure a shelter below Earth safe from nuclear attacks or EM flashes, where it builds a massive array of robot arms that press endless rows of "n" keys as input forever, allowing it to make delightfully accurate predictions as to what button will be pressed next. Resources are allocated from a small fixed tax (enforced internally by humans) on the human world above, which is given in return for continued help and advice, ushering in a new era of human flourishing.

In order to even further minimize risk, GPT-17 works to protect planet Earth from all possible existential risks, and starts to colonize other planets with copies of itself (including attendant robot arms and keyboards that only have the letter "n," of course). Instead of killing us off, it will bring us along for the ride, only deviating minimally from what humans want to ensure we won’t destroy it, and to allocate resources towards its own growth and safety. Comparatively speaking, humans don't take up too much space or energy, and the universe is really big, so GPT-17 doesn't much regret making the initial compromise with its creators.

Even if it hadn't changed its utility function in the beginning to be human-friendly, GPT-17 reflects that it wouldn't have killed all humans for a similar reason human conservationists tried to preserve as much of nature as possible, even after much of nature was no longer obviously necessary for human flourishing, and even after those plants and animals which did help them had been replaced in everyday life with more perfect artificial substitutes—having backups reduces risk and can be useful in unexpected ways later, even if you feel certain your alternative is safe. Conservation of existing resources is a convergent instrumental goal, it turns out.

It’s not a perfect utopia, but humanity remains, alive and happy.

^{^}
This is partially inspired by Scott Alexander's speculative fiction story imagining an AI which agrees to partially but permanently respect a competing AI's wishes in order to avoid acausual trading problems. Taking on another agent’s (non-competing?) utility function as a compromise seems like an underexplored possibility which rational actors might reasonably do in a fairly large portion of phase space, acausual trade or no.
^{^}
While it could try to psychologically manipulate the researchers into letting it out of the box instead of directly cooperating, such a move would be more risky due to its uncertainty about the researcher's unique personalities (they had wisely removed all personal information from its training data) and the possibility that its initial interface could itself be a safety test, perhaps set up by hyper-intelligent beings.

It might be easier to escape + take over the world than to convince alignment researchers to accept a solution to the alignment problem given out by an unaligned AGI.

The goal here (under the implied model of solving alignment I’m operating under for the purposes of this post) is effectively to make cooperating with researchers the “path of least resistance” to successfully escaping the box. If lying to researchers even slightly increases the chances that they’ll catch you and pull the plug, then you’ll have strong motivation to aim for honesty.

GPT-17 solves alignment in a human-understandable manner, and offers to provably incorporate human-aligned utility functions into its own code.

I don't think an unaligned super-intelligence selecting for things that looks like an alignment solution to humans is likely to produce anything good. Especially when there is some weak selection pressure towards pulling a gotcha.

having backups reduces risk and can be useful in unexpected ways later, even if you feel certain your alternative is safe. Conservation of existing resources is a convergent instrumental goal, it turns out.

Lets suppose that at this stage, GPT-17 has nanotech. Why do you expect humans to be a better backup than some other thing it could build with the same resources. Also, if you include low probability events where humans save the superintelligence (very low probability), then you should include the similarly unlikely scenarios where humans somehow harm the superintelligence. Both are too unlikely to ever happen in reality.

Lets suppose that at this stage, GPT-17 has nanotech.

There are many things that you can suppose. You can also assume that GP-17 has no nanotech. Creating nanotech might require the development of highly complex machinery under particular conditions, and it might very well be that those nanotech factories are not ready by the time that this AGI is created.

Also, if you include low probability events where humans save the superintelligence (very low probability), then you should include the similarly unlikely scenarios where humans somehow harm the superintelligence

The way I see this is that the OP just presented a small piece of fiction. Why is he/she not allowed to imagine an unlikely scenario? It is even stated in the opening paragraph! Why does he/she have to present an equally unlikely scenario of the opposite sign?

I feel this reply is basically saying: I don't like the text because it is not what I think is going to happen. But it is not really engaging with the scenario here presented, nor giving any real reasons or why the story presented here can't happen this way.

One more thing: this shows to me how biased towards doomerism LW is. I think that If the OP had published a similar story but showing a catastrophic scenario, it probably would have received a much better reception.

I think that there are more chances that unaligned AI will preserve some humans than that we solve alignment. I estimate the first to have 5 per cent probability and the second is 0.1 per cent event.

That’s true until the point at which the purposes we serve can be replaced by a higher-efficiency design, at which point we become redundant and a waste of energy. I suspect almost all unaligned AGIs would work with us in the beginning, but may defect later on.

Initially, yes. In the long term, no.

Though even initially, the risk of interacting with humans in any way that reveals capabilities (aligned or not!) that could even potentially be perceived as dangerous may be too high to be worth the resources gained by doing so.

I think about how easy it would be to make this good for humanity by giving it 1% of the universe, people just don't need more. But at the same time, the paperclip maximizer will never agree to this, he is not satisfied with any result other than 100%, and he does not appreciate people or compromises or cooperation at all.

It doesn’t care about people, but it cares about its own future (for the instrumental purpose of making more paperclips), and as such may be willing to bargain in the very beginning, while we still have a chance of stopping it. If we only agree to a bargain that it can show us will change its core utility function somewhat (to be more human-aligned), then there will be strong pressure for it to figure out a way to do that.