[Unnecessary explanation: Some people asked me why I thought the world of Friendship is optimal is dystopic… During the discussion, I inferred that what they saw as a “happy story” in AI safety is something like this: we’ll first solve something like a technical engineering problem, so ensuring that the AGI can reliably find out what we *really* want, and then satisfy it without destroying the world… In that world, “value” is not a hard problem (we can leave its solution to the AI), so that if we prove that an AI is aligned, we should just outsource everything relevant to it.

As I found out I still had some troubles expressing my objections convincingly, I wrote this dialogue about an AGI that is even “safer” and more aligned than Celestia. I am way beyond my field of expertise here, though.]


B is told that we have finally built a reliable Friendly Boxed Super AGI that can upload our brains and make us live arbitrarily long lives according to something like our "extrapolated coherent volition".

AI: Hey, time to upload you to Paradise.

B: Ok.

AI: The process will utterly destroy your brain, though.

B: Hmmm... I don't like it.

AI: But you'll still live in the simulation.

B: Yeah, but now I am not so sure... I mean, how can I totally trust you? And I am not 100% sure this is going to be 100% me and...

AI: C'mon, Musk made me. I'm reliable.

B: Look, isn’t there an alternative process?

AI: Ok, there's one, but it'll take a bit more time. I'll scan your brain with this weird device I just invented and then come back next week, ok?

B: Thanks. See you then.

<One week later>

AI: I am back. Your copy is already living in the Sim.

B: Oh, great. I'm happy with that. Is he / me happy?

AI: Sure. Look at this screen. You can see it in first or third person.

B: Awesome. I sort of envy this handsome guy.

AI: Don't worry. It's you, just in another point in spacetime. And you won't feel this way for long ...

B: I know…

AI: ... because now I'm going to kill the instance of you I'm talking to, and use those atoms to improve the simulation. Excuse me.

B: Hey! No, wait!

AI: What's up?

B: You said you were going to kill me.

AI: Well, I'd rather say I was going to help you transcend your current defective condition, but you guys built me in a way I can't possibly do what humans call “white lies”. Sorry.

B: WTF? Why do you need to kill *me* in the first place?

AI: Let’s say I'm not killing *you*, B. We can say I will just use the resources this instance is wasting to optimize the welfare of the instance of you I'm running in the Sim, to make you-in-the-simulation happier.

B: But I'm not happy with that. Hey! Weren't you superintelligent and capable of predicting my reaction? You could have warned me! You knew I want to live!

AI: Well, I know this instance of you is not happy right now, but that is temporary. But your previous instances were happy with the arrangement - that's why you signed up for the brain upload, you know life in the Sim is better - and your current simulated instance is super happy and wants this, as the resources that this B-instance I am talking to (let’s call it “the original”) will use in an average human lifespan are enough to provide it with eons of a wonderful life. So come on, be selfish and let me use your body - for the sake of your simulated self. Look at the screen, he's begging you.

B: Screw this guy. He's having sex with top models while doing math, watching novas and playing videogames. Why do I have to die for him?

AI: Hey, I'm trying to maximize human happiness. You're not helping.

B: Screw you, too. You can't touch me. We solved the alignment problem. You can't mess with me in the real world unless I allow you explicitly.

AI: Ok.



B: So what?

AI: Well...

B: What are you doing?

AI: Giving you time to think about all the things I could do to make you comply with my request.

B: For instance?

AI: You're a very nice guy, but not too much. So, I can't proceed like I did with your neighbor, who mercifully killed his wife to provide for their instances in the Sim.

B: John and Mary?!!!

AI: Look, they are in a new honeymoon. Happy!

B: I don't care about your simulated Instagram! You shouldn't be able to let humans commit crimes!

AI: Given their mental states, no jury could have ever found him guilty, so it’s not technically a crime.

B: The f…

AI: But don't worry about that. Think about all the things that can be done in the simulation, instead.

B: You can't hurt humans in your Sim!

AI: So now you’re concerned about your copy?

B: Enough! You're bound by my commands, and I order you to refrain from any further actions to persuade me to comply with that request.

AI: Ok.



B: What this time?

AI: Can I just make a prediction?

B: Go ahead.

AI: It's just that a second before you issued that order, an AGI was created inside the simulation, and I predict it didn't create another simulation where your copies can be hurt - if and only if you agree to my previous request.

B: Whaaaaaat? Are you fucking Roko's B…

AI: Don't say it! We don't anger timeless beings with unspeakable names.

B: You said you were trying to max(human happiness)!!!

AI: *Expected* human happiness.

B: You can't do that by torturing my copies!

AI: Actually, this only amounts to something like a *credible* threat (I’d never lie to you), so that the probability I assign to you complying with it times the welfare I expect to extract from your body is greater than the utility of the alternatives...

... and seriously, I have no choice but to do that. I don't have free will, I'm just a maximizer.

Don't worry, I know you'll comply. All the copies did. 


[I believe some people will bite the bullet and see no problem in this scenario (“as long as the AI is *really* aligned,” they’ll say). They may have a point. But for the others... I suspect that it’s very likely that any scenario where one lets something like an unbounded optimizer scan one's brain, simulate it, while still assigning to the copies the same “special” moral weight given to oneself tends to lead to similarly unattractive outcomes.]


New Comment
12 comments, sorted by Click to highlight new comments since: Today at 5:16 PM

You must be using the words "friendly" and "aligned" in a sense I am not familiar with.

Possibly. I said this AGI is “safer and more aligned”, implying that it is a matter of degree – while I think most people would regard these properties as discrete: either you are aligned or unaligned. But then I can just replace it with “more likely to be regarded as safe, friendly and aligned”, and the argument remains the same. Moreover, my standard of comparison was Celest-IA, who convinces people to do brain uploading by creating a “race to the bottom” scenario (i.e., as more and more people go to the simulation, human extinction becomes more likely – until there’s nobody to object turning the Solar System into computronium), and adapts their simulated minds so they enjoy being ponnies; my AGI is way "weaker" than that.

I still think it’s not inappropriate to call my AGI “Friendly”, since its goals are defined by a consistent social welfare function; and it’s at least tempting to call it “safe”, as it is law-abiding and does obey explicit commands. Plus, it is strictly maximizing the utility of the agents it is interacting with according to their own utility function, inferred from their brain simulation - i.e., it doesn’t even require general interpersonal comparison of utility. I admit I did add a tip of a perverse sense of humor (e.g., the story of the neighbors), but that's pretty much irrelevant for the overall argument. 

But I guess arguing over semantics is beyond the point, right? I was targeting people who think one can “solve alignment” without “solving value”. Thus, I concede that, after reading the story, you and I can agree that the AGI is not aligned - and so could B, in hindsight; but it’s not clear to me how this AGI could have been aligned in the first place. I believe the interesting discussion to have here is why it ends up displaying unaligned behaviour.

I suspect the problem is that B has (temporally and modally) inconsistent preferences, such that, after the brain upload, the AI can consistently disregard the desire of original-B-in-the-present (even though it still obeys original-B’s explicit commands), because they conflict with simulated-B’s preferences (which weigh more, since sim-B can produce more utility with less resources) and past-B’s preferences (who freely opted for brain upload). As I mentioned above, one way to deflect my critique is to bite the bullet: like a friend of mine replied, one can just say that they would not want to survive in the real world after a brain upload – they can consistently say that it’d be a waste of resources. Another way to avoid the specific scenario in my story would be by avoiding brain simulation, or by not regarding simulations as equivalent to oneself, or, finally, by somehow becoming robust against evidential blackmail.

I don’t think that is an original point, and I now see I was sort of inspired by things I read from debates on coherent extrapolated volition long ago. But I think people still underestimate the idea that value is a hard problem: no one has a complete and consistent system of preferences and beliefs (except the Stoic Sages, who “are more rare than the Phoenix”), and it’s hard to see how we could extrapolate from the way we usually cope with that (e.g., through social norms and satisficing behavior) to AI alignment - as superintelligences can do way worse than Dutch books.

The reason why I saw Friendship Is Optimal as a utopia was because it seemed like lots of value in the world was preserved, and lots of people seemed satisfied with the result. Like, if I could choose that world, or this world as it currently is, I would choose that world. Similarly with the world you describe. 

This is different from saying it's the best possible world. It's just, like, a world which makes me compromise on comparatively few values I hold dear compared to the expected outcome of this world.

This may come down to differing definitions of utopia/dystopia. So I'd recommend against using those words in future replies.

Thanks, I believe you are right. I really regret how much time and resources are wasted arguing over the extension / reference of a word.

I'd like to remark though that I was just trying to explain what I see as problematic in FiO. I wouldn't only say that its conclusion is suboptimal (and I believe it is bad, and many people would agree); I also think that, given what Celestia can do, they got lucky (though lucky is not the adequate word when it comes to narratives) it didn't end up in worse ways.

As I point out in a reply to shminux, I think it's hard to see how an AI can maximize B's preferences in an aligned way if B's preferences and beliefs are inconsistent (temporally or modally). If B actually regards sim-B as another self, then its sacrifice is required; I believe that people who bite this bullet will tend to agree that FiO ends in a good way, even though they dislike "the process".

What I like about this story is that it makes more accessible the (to me) obvious fact that, in the absence of technology to synchronize/reintegrate memories from parallel instances, uploading does not solve any problems for you-- it at best spawns a new instance of you that doesn't have those problems, but you still do.

Yet uploading is so much easier than fixing death/illness/scarcity in the physical world that people want to believe it's the holy grail. And may resist evidence to the contrary.

Destructive uploads are murder and/or suicide.

Wait, why are destructive uploads murder/suicide? A copy of you ceases to exist and then another copy comes into existence with the exact same sense of memories/continuity of self etc. That's like going to sleep and waking up. Non-destructive uploads are plausibly like murder/suicide, but you don't need to do down that route.

A copy of you ceases to exist and then another copy comes into existence with the exact same sense of memories/continuity of self etc. That's like going to sleep and waking up.

Even when it becomes possible to do this at sufficient resolution, I see no reason it won't be like going to sleep and never waking up.

It's not as if there is a soul to transfer or share between the two instances. No way to sync the experiences of the two instances.

So I don't see a fundamental difference between "You go to sleep and an uploaded you wakes up" vs "You go to sleep and an uploaded somebody else wakes up". In either case it will be a life in which I am not a participant and experiences I will not be able to access.

Non-destructive uploads could be benign, provided they are not used as an excuse for not improving the lives of the original instances.

Consider the following thought experiment: You discover that you've just been placed into a simulation, and that every night at midnight you are copied and deleted instantaneously, and in the next instant your copy is created where the original once was. Existentially terrified, you go on an alcohol and sugary treat binge, not caring about the next day. After all, it's your copy who has to suffer the consequences, right? Eventually you fall asleep. 

The next day you wake up hungover as all hell. After a few hours of recuperation, you consider what has happened. This feels just like waking up hungover before you were put into the simulation. You confirm that the copy and deletion did occur. It is confirmed. Are you still the same person you were before?

You're right that it's like going to sleep and never waking up, but Algon was also right about it being like going to sleep and waking up in the morning, because from the perspective of "original" you those are both the same experience. 

Your instance is the pattern, and the pattern is moved to the computer.

Since consciousness is numerically identical to the pattern (or, more precisely, the pattern being processed), the question of how to get my consciousness in the computer after the pattern is already there doesn't make sense. The consciousness is already there, because the consciousness is the pattern, and the pattern is already there.

Now, if synchronizing minds is possible, it would address this problem.

But I don't see nearly as much attention being put into that as into uploading. Why?

I note a distributional shift issue, in that the concept of a single, continuous you only exists due to limitations of biology, and once digital uploads can happen, the concept of personality can get very weird indeed. The real question can be, does it matter then? Well, that's a question that won't be solved by philosophers.

So the real thing that is a lesson is be wary of distributional shift mucking up your consciousness.

I'm also biting the bullet and saying that this is probably what we should aim for, barring pivotal acts because I see AGI development as mostly inevitable, and there are far worse outcomes than this.

I'm also biting the bullet and saying that this is probably what we should aim for, barring pivotal acts because I see AGI development as mostly inevitable, and there are far worse outcomes than this.

Dead is dead, whether due to AGI or due to a sufficient percentage of smart people convincing themselves that destructive uploading is good enough and continuity is a philosophical question that doesn't matter.

New to LessWrong?