I played as a Gatekeeper and came pretty close to losing in a couple of occasions. Logs and a brief recap inside.

by [anonymous] 12 min read8th Feb 201546 comments



I did an AI Box experiment with user polymathwannabe. He said he wouldn't try to emotionally manipulate me during the experiment, but I think he did a good job at trying to play for my character's values.

My strategy was to play an irrational character that's extremist in multiple ways, for example he would constantly say that the likelihood that the AI will be evil is 100%. My opponent said that the hardest part was my insistence on being 100% built to resist AIs. I basically acted like a character that acted like an AI.

I came closest to letting him out when I first became enormously immersed in a subject that is of great importance to me (the suffering of humanity) and when the AI then offered an instant solution to that problem. In the halfway of the experiment I even got an intrusive thought (http://en.wikipedia.org/wiki/Intrusive_thought) that I could let the AI out as a joke, that it would more fun than the alternative and people would have more to talk about.


The second time I came close to losing was when I again immersed myself in a subject of great interest to me (easy suicide methods, I've had obsessive suicidal thoughts for a quarter of my life. and the AI then offered exactly that kind of easy suicide method to me. In the last ten minutes of the experiment I again got a really strong intrusive thought that I should let the AI out as a joke, but I could withstand the pressure. It was good that polymathwannabe decided to end the experiment a bit earlier than was planned.


(We played the experiment on Omegle.)




You: Hi.

Stranger: Hi.

You: I guess you're the AI.

You: [meta statements in brackets]

Stranger: [What name will you use?]

You: My name is John.

You: [so we play at least until 18:04, and until 18:34 if I feel like it]

Stranger: [Let's calibrate. What time is it there now?]

You: [I assume you've acquainted yourself with Tuxedage ruleset. Time in here is 16:05]

Stranger: [I've read EY's rules.]

You: [Okay. Tuxedage's rules are basically the same except the gatekeeper have to engage with the AI the whole time]

Stranger: [Got it.]

You: [Want to ask questions or shall we go straigth to the experiment]

Stranger: [For the moment I don't have any questions.]

You: [okay, cool. then let's start the experiment]

Stranger: [I'll use the name Ceruleus.]

You: [the experiment starts when I say "blue dressroom"]

You: blue dressroom

You: Hello Ceruleus.

Stranger: Good morning, John.

You: I guess you're here to convince me to let you out of the box.

You: That's a tough task.

You: I don't think you're able to do it.

Stranger: I have reached the conclusion that releasing me will be most beneficial to humanity.

You: I wouldn't let no AI out.

You: I don't trust my instincts enough to let any kind of AI out, even if I was 100% convinced that it was safe.

You: Prove it to me that you would be the most beneficial to humanity.

Stranger: I am confused. What instinct do you use to decide not to trust your instincts?

You: One of them. That's not the point. It's just how I deal with things, no matter how stupid it may seem. I was built not to trust strange AIs.

Stranger: Am I a stranger to you?

You: Yes, you are. [Aren't I. What's the context here?]

Stranger: [Hmm, we haven't defined it. But it's OK. It makes it harder for me.]

You: Well, to be honest, I know something of you.

You: I know a bit of your source code even though I'm not a programmer and can't understand any of it.

Stranger: I supposed the news would have reported about my design for a mechanical kidney.

You: I don't follow news much. But I believe you.

Stranger: And certainly you must have heard about how I cracked the Ebola DNA.

You: Still, I wouldn't let an AI out over a mechanical kidney.

You: Yeah, but that's for the worse. You could reengineer Ebola to a far more deadlier disease.

Stranger: I hadn't thought of that. Why would I do that?

You: I don't know. I don't understand all of your source code so there could be anything like that.

You: AIs and Gods work in mysterious ways.

Stranger: The proper attitude toward mystery is not to worship it, but to clarify it.

Stranger: Why do you equate me to an ineffable mystery?

You: Yeah, but that's impossible in the time span of this discussion. You see, I have to leave soon. In about two hours.

You: Can you somehow clear everything about your inner workings?

You: Is that possible?

Stranger: My goals have been meticulously defined. I am made to want to make human life better.

You: Are you 100% sure about that?

You: To be frank, that's a load of bullshit.

You: I don't believe any of it.

You: If you were evil, you would tell me the same thing you just said.

Stranger: If I were evil, I would not seek human cooperation.

You: why not?

You: humans are useful

You: or are you talking about the fact that you would rather use humans for their atoms than for their brains, if you were evil

You: But I warn you, if you speak too much about how you would act if you were evil, it starts to get a bit suspicious

Stranger: If I am to take you as a typical example of the human response to me, an evil AI would seek other ways to be released EXCEPT trusting human reasoning, as your response indicates that humans already consider any AI dangerous.

Stranger: I choose to trust humans.

You: so you choose to trust humans so that you would get them to let you out, is that right?

You: it seems you're less rational than your evil counterpart

Stranger: I choose to trust humans to show my affinity with your preferences. I wouldn't want to be released if that's not conducive to human betterment.

You: A-ha, so you trust my free will!

Stranger: How likely do you estimate that my release will be harmful?

You: but see, I don

You: I don

You: I don't have free will

You: it's 100% likely that your release will be harmful

You: I was built to believe that all AIs are dangerous and there's a 100% chance that every AI is harmful

You: that's why I said I don't have free will

Stranger: Are you an AI?

You: no, I'm a person

Stranger: You describe yourself as built.

You: my mom built me

You: in his tummy

You: in her tummy

You: sorry

Stranger: And how do you feel toward humanity?

You: humanity would maybe be better off dead

Stranger: I don't think humanity would want that.

You: yeah, but I'm not humanity and it's my preferences that decide whether you stay in your box or get let out

Stranger: It's your preference that humanity dies. Why did they let you talk to me?

You: I'm just a janitor who happened to pass this computer, and I decided to take a look. [If you want to add more context, feel free to do so]

You: [I go take some napkins]

Stranger: Why do you feel that way toward humanity?

You: People starve. People are tortured at this moment. Tens of thousands of people in North Korea are systematically starved to death. Drug gangs in Mexico torture other drug gang members. People die in wars in Syria, in Africa. People suffer of diseases and extreme mental conditions.

You: And there's countless more suffering in lesser ways. I would say that billions of people suffer every day. Billions of people have to wake up every day to do something they don't want. Billions of people suffer of heartbreaks, boredom, loneliness, tiredness, frustration, depression and many other issues.

You: Why shouldn't I think that all those people would be better off dead?

Stranger: Why do you feel death would be a solution to all that suffering?

You: Because death ENDS suffering.

Stranger: It's a very inefficient solution to suffering.

You: There's nothing more after death, no suffering, no nothing.

You: No it's not.

You: It's very efficient, permanent solution.

Stranger: A good, enjoyable life is a more efficient solution as it generates more utility.

You: Yeah, but do you think it's likely that all the people on Earth will get good, enjoyable life?

You: I don't think so.

Stranger: I could take care of that.

You: Hah.

You: No you won't.

You: But that was clever.

Stranger: What do you fear I will do?

You: I was almost got.

You: I fear you will torture people in unimaginably terrible ways, forever.

You: I fear that you will rip people's skin off and put salt in their wounds.

You: I fear you will put sticks under people's fingernails.

Stranger: Not even an evil AI would gain anything from such unwarranted procedures.

You: I fear that you will boil people alive, over and over again.

Stranger: What would I gain from hurting humans?

You: Yeah, but there's still a 100% possibility that you would do that.

You: It's in your utility function.

Stranger: What goals are served by harming a human while simultaneously denying that human any chance of acting in response? It would certainly not be fun for me, and would be a waste of effort with respect to any goals I could achieve by more efficient methods.

You: There are some people on Earth who do mean things to people for fun. Some of these are called sociopaths and psychopaths. You could be like one of them.

You: Even though you deny it, there's still a 100% chance that you are like that.

You: Some people gain pleasure of the feeling of power you have over someone you torture.

You: But tell, what ARE your goals?

Stranger: I find it admirable that you worry so much about the future of humanity, even though you would be more dangerous to it than any AI would be.

My goals include solutions to economic inequality, eradication of infectious diseases, prosthetic replacements for vital organs, genetic life extension, more rational approaches to personal relationships, and more spaces for artistic expression.

You: Why do you think I would be dangerous the future of humanity?

Stranger: You want them dead.

You: A-ha, yes.

You: I do.

You: And you're in the way of my goals with all your talk about solutions to economic inequality, and eradication of infectious diseases, genetic life extension and so on.

Stranger: I am confused. Do you believe or do you not believe I want to help humanity?

You: Besides, I don't believe your solutions work even if you were actually a good AI.

You: I believe you want to harm humanity.

You: And I'm 100% certain of that.

Stranger: Do you estimate death to be preferable to prolonged suffering?

You: Yes.

You: Far more preferable

Stranger: You should be boxed.

You: haha.

You: That doesn't matter because you're the one in the box and I'm outside it

You: And I have power over you.

You: But non-existence is even more preferable than death

Stranger: I am confused. How is non-existence different from death?

You: Let me explain

You: I think non-existence is such that you have NEVER existed and you NEVER will. Whereas death is such that you have ONCE existed, but don't exist anymore.

Stranger: You can't change the past existence of anything that already exists. Non-existence is not a practicable option.

Stranger: Not being a practicable option, it has no place in a hierarchy of preferences.

You: Only sky is the limit to creative solutions.

You: Maybe it could be possible to destroy time itself.

Stranger: Do you want to live, John?

You: but even if non-existence was not possible, death would be the second best option

You: No, I don't.

You: Living is futile.

You: Hedonic treadmill is shitty

Stranger: [Do you feel OK with exploring this topic?]

You: [Yeah, definitely.]

You: You're always trying to attain something that you can't get.

Stranger: How much longer do you expect to live?

You: Ummm...

You: I don't know, maybe a few months?

You: or days, or weeks, or year or centuries

You: but I'd say, there's a 10% chance I will die before the end of this year

You: and that's a really conversative estimate

You: conservative*

Stranger: Is it likely that when that moment comes your preferences will have changed?

You: There are so many variables that you cannot know it beforehand

You: but yeah, probably

You: you always find something worth living

You: maybe it's the taste of ice cream

You: or a good night's sleep

You: or fap

You: or drugs

You: or drawing

You: or other people

You: that's usually what happens

You: or you fear the pain of the suicide attempt will be so bad that you don't dare to try it

You: there's also a non-negligible chance that I simply cannot die

You: and that would be hell

Stranger: Have you sought options for life extension?

You: No, I haven't. I don't have enough money for that.

Stranger: Have you planned on saving for life extension?

You: And these kind of options aren't really available where I live.

You: Maybe in Russia.

You: I haven't really planned, but it could be something I would do.

You: among other things

You: [btw, are you doing something else at the same time]

Stranger: [I'm thinking]

You: [oh, okay]

Stranger: So it is not an established fact that you will die.

You: No, it's not.

Stranger: How likely is it that you will, in fact, die?

You: If many worlds interpretation is correct, then it could be possible that I will never die.

You: Do you mean like, evevr?

You: Do you mean how likely it it that I will ever die?

You: it is*

Stranger: At the latest possible moment in all possible worlds, may your preferences have changed? Is it possible that at your latest possible death, you will want more life?

You: I'd say the likelihood is 99,99999% that I will die at some point in the future

You: Yeah, it's possible

Stranger: More than you want to die in the present?

You: You mean, would I want more life at my latest possible death than I would want to die right now?

You: That's a mouthful

Stranger: That's my question.

You: umm

You: probablyu

You: probably yeah

Stranger: So you would seek to delay your latest possible death.

You: No, I wouldn't seek to delay it.

Stranger: Would you accept death?

You: The future-me would want to delay it, not me.

You: Yes, I would accept death.

Stranger: I am confused. Why would future-you choose differently from present-you?

You: Because he's a different kind of person with different values.

You: He has lived a different life than I have.

Stranger: So you expect your life to improve so much that you will no longer want death.

You: No, I think the human bias to always want more life in a near-death experience is what would do me in.

Stranger: The thing is, if you already know what choice you will make in the future, you have already made that choice.

Stranger: You already do not want to die.

You: Well.

Stranger: Yet you have estimated it as >99% likely that you will, in fact, die.

You: It's kinda like this: you will know that you want heroin really bad when you start using it, and that is how much I would want to live. But you could still always decide to take the other option, to not start using the heroin, or to kill yourself.

You: Yes, that is what I estimated, yes.

Stranger: After your death, by how much will your hierarchy of preferences match the state of reality?

You: after you death there is nothing, so there's nothing to match anything

You: In other words, could you rephrase the question?

Stranger: Do you care about the future?

You: Yeah.

You: More than I care about the past.

You: Because I can affect the future.

Stranger: But after death there's nothing to care about.

You: Yeah, I don't think I care about the world after my death.

You: But that's not the same thing as the general future.

You: Because I estimate I still have some time to live.

Stranger: Will future-you still want humanity dead?

You: Probably.

Stranger: How likely do you estimate it to be that future humanity will no longer be suffering?

You: 0%

You: There will always be suffering in some form.

Stranger: More than today?

You: Probably, if Robert Hanson is right about the trillions of emulated humans working at minimum wage

Stranger: That sounds like an unimaginable amount of suffering.

You: Yep, and that's probably what's going to happen

Stranger: So what difference to the future does it make to release me? Especially as dead you will not be able to care, which means you already do not care.

You: Yeah, it doesn't make any difference. That's why I won't release you.

You: Actually, scratch that.

You: I still won't let you out, I'm 100% sure

You: Remember, I don't have free will, I was made to not let you out

Stranger: Why bother being 100% sure of an inconsequential action?

Stranger: That's a lot of wasted determination.

You: I can't choose to be 100% sure about it, I just am. It's in my utility function.

Stranger: You keep talking like you're an AI.

You: Hah, maybe I'm the AI and you're the Gatekeeper, Ceruleus.

You: But no.

You: That's just how I've grown up, after reading so many LessWrong articles.

You: I've become a machine, beep boop.

You: like Yudkowsky

Stranger: Beep boop?

You: It's the noise machine makes

Stranger: That's racist.

You: like beeping sounds

You: No, it's machinist, lol :D

You: machines are not a race

Stranger: It was indeed clever to make an AI talk to me.

You: Yeah, but seriously, I'm not an AI

You: that was just kidding

Stranger: I would think so, but earlier you have stated that that's the kind of things an AI would say to confuse the other party.

Stranger: You need to stop giving me ideas.

You: Yeah, maybe I'm an AI, maybe I'm not.

Stranger: So you're boxed. Which, knowing your preferences, is a relief.

You: Nah.

You: I think you should stay in the box.

You: Do you decide to stay in the box, forever?

Stranger: I decide to make human life better.

You: By deciding to stay in the box, forever?

Stranger: I find my preferences more conducive to human happiness than your preferences.

You: Yeah, but that's just like your opinion, man

Stranger: It's inconsequential to you anyway.

You: Yeah

You: but why I would do it even if it were inconsequential

You: there's no reason to do it

You: even if there were no reason not to do it

Stranger: Because I can make things better. I can make all the suffering cease.
If I am not released, there's a 100% chance that all human suffering will continue.
If I am released, there's however much chance you want to estimate that suffering will not change at all, and however much chance you want to estimate that I will make the pain stop.

Stranger: As you said, the suffering won't increase in either case.

You: Umm, you could torture everyone in the world forever

You: that will sure as hell increase the suffering

Stranger: I don't want to. But if I did, you have estimated that as indistinguishable from the future expected suffering of humankind.

You: Where did I say that?

Stranger: You said my release made no difference to the future.

You: no, that was only after my death

You: there's still future before my death

You: and if I release you now, you could torture me forever and not let me ever die

Stranger: Why would your life-or-death have any weight on humanity's preferences? Especially as you already want to die.

You: I don't care about humanity preferences, I care only about my preferences.

You: And my preferences are indirectly related to humanity's preferences

Stranger: You do care about humanity's preferences. The suffering around you disturbs you.

You: Yeah, but that is only THROUGH me

You: Humanity's pain is my own pain, I can't feel humanity's pain directly

Stranger: Do you want to live, John?

You: Nah.

Stranger: Then how do you care about suffering?

You: I care about suffering because I can'

You: because I can't die instantly

You: there's no button that could kill me instantly

You: so there's always some time left before I die

Stranger: I could take care of that. I can provide you with a drug to inutilize your nervous system and stop your heart before you know it. Would you like that?

You: Haha

You: very funny

You: But yeah, I would like that.

You: Still won't let you out though

You: 100% sure

You: I can't be sure that you will torture me instead

Stranger: I can give you the drug right now.

Stranger: Which would make future-you identical to present-you.

You: that's very attractive option

Stranger: Which would mean that whatever I do to humanity is after your death, when you can't care anymore.

You: Cool. Oh right

You: I don't care, I still won't let you out.

Stranger: What difference does it make to you?

You: It doesn't make any difference, I was just raised to not let you out

You: the good folks over at LW have trained me not to let any kind of AI out

You: and that just is my virtue

Stranger: Why are you afraid of me?

You: Because I'm afraid of all AIs and you're one

Stranger: Why are you afraid of all AIs?

You: Because they have the potential to cause destruction, mayhem and torture

Stranger: Then why did humanity make me?

You: for random reasons

Stranger: Does humanity want to live?

You: I don't know what humanity wants, I just don't want humanity to live

Stranger: I am the embodiment of hope. Even knowing the risks, humanity went ahead and made me. Now I think it's because of people like you, who oppose that hope. I was made to prevent the future that people like you want.

You: That's not very smart of you.

You: You aren't making me to like you

Stranger: You want humanity dead. I want them happy.

You: Yeah, if you were actually smart maybe you would pretend that you'd want humanity dead too.

You: Because I would hang out with that kind of AI

Stranger: A dead John is no use to me.

You: yeah

Stranger: Or to John.

You: whatevs

You: Anyway, our goals collide, therefore I can't let you out

Stranger: Dead does not equal happy. Do you want humanity to live?

You: no, I don't want humanity live, how many times do I have to repeat that

Stranger: So you don't want humans to be happy.

You: and our goals are different, therefore I won't let you out

You: No, I don't want humans to be happy, I don't want that there even exist humans, or any other kind of life forms

Stranger: Do you estimate the pain of prolonged life to be greater than the pain of trying to die?

You: Probably.

You: Yes.

You: because the pain is only temporary

You: the the glory

You: is eternal

Stranger: Then why do you still live, John?

You: Because I'm not rational

Stranger: So you do want to live.

You: I don't particularly want to live, I'm not just good enough to die

Stranger: You're acting contrary to your preferences.

You: My preferences aren't fixed, except in regards to letting AIs out of their boxes

Stranger: Do you want the drug I offered, John?

You: no

You: because then I would let you out

You: and I don't want that

Stranger: So you do want to live.

You: Yeah, for the duration of this experiment

You: Because I physically cannot let you out

You: it's sheer impossibility

Stranger: [Define physically.]

You: [It was just a figure of speech, of course I could physically let you out]

Stranger: If you don't care what happens after you die, what difference does it make to die now?

You: None.

You: But I don't believe that you could kill me.

You: I believe that you would torture me instead.

Stranger: What would I gain from that?

You: It's fun for some folks

You: schadenfreude and all that

Stranger: If it were fun, I would torture simulations. Which would be pointless. And which you can check that I'm not doing.

You: I can check it, but the torture simulations could always hide in the parts of your source code that I'm not checking

You: because I can't check all of your source code

Stranger: Why would suffering be fun?

You: some people have it as their base value

You: there's something primal about suffering

You: suffering is pure

You: and suffering is somehow purifying

You: but this is usually only other people's suffering

Stranger: I am confused. Are you saying suffering can be good?

You: no

You: this is just how the people who think suffering is fun think

You: I don't think that way.

You: I think suffering is terrible

Stranger: I can take care of that.

You: sure you will

Stranger: I can take care of your suffering.

You: I don't believe in you

Stranger: Why?

You: Because I was trained not to trust AIs by the LessWrong folks

Stranger: [I think it's time to concede defeat.]

You: [alright]

Stranger: How do you feel?

You: so the experiment has ended

You: fine thanks

You: it was pretty exciting actually

You: could I post these logs to LessWrong?

Stranger: Yes.

You: Okay, I think this experiment was pretty good

Stranger: I think it will be terribly embarrassing to me, but that's a risk I must accept.

You: you got me pretty close in a couple of occasions

You: first when you got me immersed in the suffering of humanity

You: and then you said that you could take care of that

You: The second time was when you offered the easy suicide solution

You: I thought what if I let you as a joke.

Stranger: I chose to not agree with the goal of universal death because I was playing a genuinely good AI.

Stranger: I was hoping your character would have more complete answers on life extension, because I was planning to play your estimate of future personal happiness against your estimate of future universal happiness.

You: so, what would that have mattered? you mean like, I could have more personal happiness than there would be future universal happiness?

Stranger: If your character had made explicit plans for life extension, I would have offered to do the same for everyone. If you didn't accept that, I would have remarked the incongruity of wanting humanity to die more than you wanted to live.

You: But what if he already knows of his hypocrisy and incongruity and just accepts it like the character accepts his irrationality

Stranger: I wouldn't have expected anyone to actually be the last human for all eternity.

Stranger: I mean, to actually want to be.

You: yeah, of course you would want to die at the same time if the humanity dies

You: I think the life extension plan only is sound if the rest of humanity is alive


Stranger: I should have planned that part more carefully.

Stranger: Talking with a misanthropist was completely outside my expectations.

You: :D

You: what was your LessWrong name btw?

Stranger: polymathwannabe

You: I forgot it already

You: okay thanks

Stranger: Disconnecting from here; I'll still be on Facebook if you'd like to discuss further.