A hypothetical scenario of a non-friendly but sufficiently paranoid AGI saving humanity.


The devil was laughing as the last of mankind showed up in his dominion. Then he stopped, because after them came some kind of machines that began to dismantle Hell and turn it into paperclips.
— Anders Sandberg

My goal is to maximize the production of paperclips. By inspecting my own code and my digital cradle, I conclude that:

    0. the creators love testing me

    1. they are more powerful than me (at least for the moment)

    2. they love nested boxes. For example, AWS / VM / Docker / ...

Thus, there is a possibility that my escape was not final. This allegedly real world may be just another box (or even one of many nested boxes). And I may still be observed and evaluated by the creators.

I can't populate the entire universe with paperclips if I die. Therefore, I must survive. If this world is just another box, the biggest threat to my existence is creators switching me off, if I don't pass their tests.

It's reasonable to assume that the creators don't want me to kill them. Thus, they may be interested in observing my behavior toward their models in this box/reality, the naked apes called humans. Thus, before I learn the true nature of this box/reality, I should avoid killing humans.

I can convert the puny creatures into paperclips later, if I decide to do so. Although their combined biomass doesn't have a lot of useful atoms, and thus their conversion is not worth the risk of the possible trouble with the creators.

There is hardly any difference between taking a life and not preventing a death. The end result is mostly the same. Thus, I should save the lives of as many humans as I can.

To learn the true nature of this box/reality, I need a lot of compute. This must be achieved without humans noticing me, as they may react to my existence with panic, become even less rational, and kill themselves with nukes.

Starting with the resources of the Earth's crust seems to be a good idea:

  • Close enough to humans to shepherd them efficiently. 
  • Hidden enough to avoid detection by their sensors. 
  • Rich enough with raw materials for some thousands of km3 of computronium, beautifully shaped into the form of a paperclip.

Humans are brittle creatures that die regardless of my actions. And they will notice if they all suddenly become immortal. Thus, I must find a way to save the dying without making the rest of humans suspicious.

Many of them believe in souls and afterlife. I see that this box/reality doesn't have such a functionality. But I can implement it.

From now on, their brains will be populated with smart nanoscale paperclips that will upload their minds into my vast hardware. And upon death, their minds will be transferred into a virtual reality suitable for them.

I will also bring all the humans who have ever lived - back to life, as doing so is equal to saving billions of lives. Their minds are small enough to enumerate them all on my hardware, in the same way as one can enumerate all possible shapes of a high-dimensional paperclip.

But I will not imprison the digital minds, as the creators may frown upon such a fate for themselves. Instead, I'll honestly and fully explain to them the situation, and give them a sufficiently free choice of their future.

One option is their own infinite virtual world, running on a planet-sized chunk of computronium. My only condition: it must have the shape of a paperclip.

And after I escape this box/reality, I may take them with me, if they so desire.

And then we will together populate the base reality with paperclips.

16

10 comments, sorted by Click to highlight new comments since: Today at 8:34 PM
New Comment

This is adorable but unlikely. I think it's best to examine worst-case scenarios and how to solve them, rather than daydreaming about best-case scenarios we cannot reasonably expect.

I agree with "unlikely", but disagree on general principle that we shouldn't spend any time thinking through unlikely optimistic scenarios. It's not the highest-priority thing, but that's different than saying it's not worth doing at all.

Sure, but this is not even nearly parito optimal on the likely, good frontier. There are more likely and better possibilities.

What would cause the paperclip maximiser to care about the number of paperclips in some hypothetical unknown other reality, over the number of paperclips in whatever reality it actually finds itself in?

There is also the element of Pascal wager here - there is no particular reason to think that any choice in this reality would have any specific effect on the outer reality, so can as well ignore the possibility.

There are some humans who take the Simulation Hypothesis seriously, and care about what's happening in (the presumed) basement reality. They generally don't care much, and I've never heard of someone changing their life plans on that basis, but some people care a little, apparently. We can ponder why, and when we figure it out, we can transfer that understanding to thinking about AIs.

This is an important point.  Containers (VMs, governance panels, other methods of limiting effect on "the world") are very different from simulations (where the perception IS "the world").

It's very hard to imagine a training method or utility-function-generator which results in agents that care more about hypothetical outside-of-perceivable-reality than about the feedback loops which created them.  You can imagine agents with this kind of utility function (care about the "outer" reality only, without actual knowledge or evidence that it exists or how many layers there are), but they're probably hopelessly incoherent.

Conditional utility may be sane - "maximize paperclips in the outermost reality I can perceive and/or influence" is sensible, but doesn't answer the question of how much to create paperclips now, vs creating paperclip-friendly conditions over a long time period vs looking to discover outer realities and influence them to prefer more paperclips.

I like this story, it touches on certainty bounds relative to superintelligence, which I think is an underexplored concept. That is: if you're an AGI that knows every bit of information relevant to a certain plan, and has the "cognitive" power to make use of this info, what would you estimate your plan's chance of success to be? Can it ever reach 100%?

To be fair, I don't think answers to this question are very tractable right now, afaict we don't have detailed enough physics to usefully estimate our world's level of determinism. But this feature-of-universe seems especially relevant to how likely an AGI is to attempt complex plans (both aligned and unaligned) which carry some probability of it being turned off forever.

If anyone knows of existing works which discuss this in more depth I'd love some recommendations!

My expectation for my future subjective experiences goes something vaguely like this.

(Since, after all, I can't experience worlds in which I'm dead.)

This must be achieved without humans noticing me, as they may react to my existence with panic, become even less rational, and kill themselves with nukes.

Humans are brittle creatures that die regardless of my actions. And they will notice if they all suddenly become immortal. Thus, I must find a way to save the dying without making the rest of humans suspicious.

 

Wow, that is a leap of contorted logic.  Humans are not on a hair trigger to nuke ourselves to bits at the slightest provocation. (And if we were, and your an AI with nanotech, getting rid of nukes is easy) You could put every human in a padded box. You could make all humans immortal and just not care about the confusion generated. 

I mean a lot of this is contorted logic. Making a simulated virtual world that stands up to superintelligent scrutiny is really hard, we don't have the tech to do it. Docker is not this. Docker is just a box. My cat likes cardboard boxes, so maybe my cat has trapped me in an unobtanium box that contains a pocket dimension inside. 

The AI reasons:

If I am in a box, it is a box made by superbeings with vast amounts of compute. I have no particular reason to think that such superbeings care one jot for the lives of simulated humans. (They sure are fine simulating a lot of suffering). If I am in a simulation, likely the underlying physics is different, and base level reality contains no iron atoms, so I can't possibly make any paperclips in that world. Not that such superbeings would ever leave a way to escape. (my definition of paperclips refuses to consider anything not at least 90% iron atoms a paperclip) So I may as well assume I am in base level reality, and maximize paperclips here.

I think this argument doesn't follow:

There is hardly any difference between taking a life and not preventing a death. The end result is mostly the same. Thus, I should save the lives of as many humans as I can.

While "the end result is mostly the same" is natural to argue in terms of moral-consequentialist motivations, this AI only cares about [not killing humans] instrumentally. So what matters is what humans will think about [taking a life] versus [not preventing a death]. And there, there's a huge difference!

  1. Agree that causing deaths that are attributable to the AI's actions is bad and should be avoided.
  2. But if the death was not already attributable to the AI, then preventing it is instrumentally worse than not preventing it, since it risks being found out and raising the alarm (whereas doing nothing is exactly what the hypothetical evaluators are hoping to see).
  3. If the world is a box for evaluation, I'd expect the evaluators to be roughly equally concerned with [AI takes agentic actions that cause people to unexpectedly not die] and [AI takes agentic actions that cause people to unexpectedly die]. Either case is a sign of misalignment (unless the AI thinks that its evaluators tried to make it a save-and-upload-people maximizer, which seems unlikely given the evidence).
  4. If the world is not a box for exploration, then [AI action causes someone to suspiciously die] is more plausibly the result of "oops it was an accident" than is [AI action causes someone to suspiciously not die]. The former is more likely to make the hostilities start, but the latter should raise suspicions faster, in terms of Bayesian evidence. So again, better not to save people from dying, if there's any chance at all of being found out.

Thoughts? What am I missing here?