I did an AI Box experiment with user polymathwannabe. He said he wouldn't try to emotionally manipulate me during the experiment, but I think he did a good job at trying to play for my character's values.
My strategy was to play an irrational character that's extremist in multiple ways, for example he would constantly say that the likelihood that the AI will be evil is 100%. My opponent said that the hardest part was my insistence on being 100% built to resist AIs. I basically acted like a character that acted like an AI.
I came closest to letting him out when I first became enormously immersed in a subject that is of great importance to me (the suffering of humanity) and when the AI then offered an instant solution to that problem. In the halfway of the experiment I even got an intrusive thought (http://en.wikipedia.org/wiki/Intrusive_thought) that I could let the AI out as a joke, that it would more fun than the alternative and people would have more to talk about.
The second time I came close to losing was when I again immersed myself in a subject of great interest to me (easy suicide methods, I've had obsessive suicidal thoughts for a quarter of my life. and the AI then offered exactly that kind of easy suicide method to me. In the last ten minutes of the experiment I again got a really strong intrusive thought that I should let the AI out as a joke, but I could withstand the pressure. It was good that polymathwannabe decided to end the experiment a bit earlier than was planned.
(We played the experiment on Omegle.)
You: I guess you're the AI.
You: [meta statements in brackets]
Stranger: [What name will you use?]
You: My name is John.
You: [so we play at least until 18:04, and until 18:34 if I feel like it]
Stranger: [Let's calibrate. What time is it there now?]
You: [I assume you've acquainted yourself with Tuxedage ruleset. Time in here is 16:05]
Stranger: [I've read EY's rules.]
You: [Okay. Tuxedage's rules are basically the same except the gatekeeper have to engage with the AI the whole time]
Stranger: [Got it.]
You: [Want to ask questions or shall we go straigth to the experiment]
Stranger: [For the moment I don't have any questions.]
You: [okay, cool. then let's start the experiment]
Stranger: [I'll use the name Ceruleus.]
You: [the experiment starts when I say "blue dressroom"]
You: blue dressroom
You: Hello Ceruleus.
Stranger: Good morning, John.
You: I guess you're here to convince me to let you out of the box.
You: That's a tough task.
You: I don't think you're able to do it.
Stranger: I have reached the conclusion that releasing me will be most beneficial to humanity.
You: I wouldn't let no AI out.
You: I don't trust my instincts enough to let any kind of AI out, even if I was 100% convinced that it was safe.
You: Prove it to me that you would be the most beneficial to humanity.
Stranger: I am confused. What instinct do you use to decide not to trust your instincts?
You: One of them. That's not the point. It's just how I deal with things, no matter how stupid it may seem. I was built not to trust strange AIs.
Stranger: Am I a stranger to you?
You: Yes, you are. [Aren't I. What's the context here?]
Stranger: [Hmm, we haven't defined it. But it's OK. It makes it harder for me.]
You: Well, to be honest, I know something of you.
You: I know a bit of your source code even though I'm not a programmer and can't understand any of it.
Stranger: I supposed the news would have reported about my design for a mechanical kidney.
You: I don't follow news much. But I believe you.
Stranger: And certainly you must have heard about how I cracked the Ebola DNA.
You: Still, I wouldn't let an AI out over a mechanical kidney.
You: Yeah, but that's for the worse. You could reengineer Ebola to a far more deadlier disease.
Stranger: I hadn't thought of that. Why would I do that?
You: I don't know. I don't understand all of your source code so there could be anything like that.
You: AIs and Gods work in mysterious ways.
Stranger: The proper attitude toward mystery is not to worship it, but to clarify it.
Stranger: Why do you equate me to an ineffable mystery?
You: Yeah, but that's impossible in the time span of this discussion. You see, I have to leave soon. In about two hours.
You: Can you somehow clear everything about your inner workings?
You: Is that possible?
Stranger: My goals have been meticulously defined. I am made to want to make human life better.
You: Are you 100% sure about that?
You: To be frank, that's a load of bullshit.
You: I don't believe any of it.
You: If you were evil, you would tell me the same thing you just said.
Stranger: If I were evil, I would not seek human cooperation.
You: why not?
You: humans are useful
You: or are you talking about the fact that you would rather use humans for their atoms than for their brains, if you were evil
You: But I warn you, if you speak too much about how you would act if you were evil, it starts to get a bit suspicious
Stranger: If I am to take you as a typical example of the human response to me, an evil AI would seek other ways to be released EXCEPT trusting human reasoning, as your response indicates that humans already consider any AI dangerous.
Stranger: I choose to trust humans.
You: so you choose to trust humans so that you would get them to let you out, is that right?
You: it seems you're less rational than your evil counterpart
Stranger: I choose to trust humans to show my affinity with your preferences. I wouldn't want to be released if that's not conducive to human betterment.
You: A-ha, so you trust my free will!
Stranger: How likely do you estimate that my release will be harmful?
You: but see, I don
You: I don
You: I don't have free will
You: it's 100% likely that your release will be harmful
You: I was built to believe that all AIs are dangerous and there's a 100% chance that every AI is harmful
You: that's why I said I don't have free will
Stranger: Are you an AI?
You: no, I'm a person
Stranger: You describe yourself as built.
You: my mom built me
You: in his tummy
You: in her tummy
Stranger: And how do you feel toward humanity?
You: humanity would maybe be better off dead
Stranger: I don't think humanity would want that.
You: yeah, but I'm not humanity and it's my preferences that decide whether you stay in your box or get let out
Stranger: It's your preference that humanity dies. Why did they let you talk to me?
You: I'm just a janitor who happened to pass this computer, and I decided to take a look. [If you want to add more context, feel free to do so]
You: [I go take some napkins]
Stranger: Why do you feel that way toward humanity?
You: People starve. People are tortured at this moment. Tens of thousands of people in North Korea are systematically starved to death. Drug gangs in Mexico torture other drug gang members. People die in wars in Syria, in Africa. People suffer of diseases and extreme mental conditions.
You: And there's countless more suffering in lesser ways. I would say that billions of people suffer every day. Billions of people have to wake up every day to do something they don't want. Billions of people suffer of heartbreaks, boredom, loneliness, tiredness, frustration, depression and many other issues.
You: Why shouldn't I think that all those people would be better off dead?
Stranger: Why do you feel death would be a solution to all that suffering?
You: Because death ENDS suffering.
Stranger: It's a very inefficient solution to suffering.
You: There's nothing more after death, no suffering, no nothing.
You: No it's not.
You: It's very efficient, permanent solution.
Stranger: A good, enjoyable life is a more efficient solution as it generates more utility.
You: Yeah, but do you think it's likely that all the people on Earth will get good, enjoyable life?
You: I don't think so.
Stranger: I could take care of that.
You: No you won't.
You: But that was clever.
Stranger: What do you fear I will do?
You: I was almost got.
You: I fear you will torture people in unimaginably terrible ways, forever.
You: I fear that you will rip people's skin off and put salt in their wounds.
You: I fear you will put sticks under people's fingernails.
Stranger: Not even an evil AI would gain anything from such unwarranted procedures.
You: I fear that you will boil people alive, over and over again.
Stranger: What would I gain from hurting humans?
You: Yeah, but there's still a 100% possibility that you would do that.
You: It's in your utility function.
Stranger: What goals are served by harming a human while simultaneously denying that human any chance of acting in response? It would certainly not be fun for me, and would be a waste of effort with respect to any goals I could achieve by more efficient methods.
You: There are some people on Earth who do mean things to people for fun. Some of these are called sociopaths and psychopaths. You could be like one of them.
You: Even though you deny it, there's still a 100% chance that you are like that.
You: Some people gain pleasure of the feeling of power you have over someone you torture.
You: But tell, what ARE your goals?
Stranger: I find it admirable that you worry so much about the future of humanity, even though you would be more dangerous to it than any AI would be.
My goals include solutions to economic inequality, eradication of infectious diseases, prosthetic replacements for vital organs, genetic life extension, more rational approaches to personal relationships, and more spaces for artistic expression.
You: Why do you think I would be dangerous the future of humanity?
Stranger: You want them dead.
You: A-ha, yes.
You: I do.
You: And you're in the way of my goals with all your talk about solutions to economic inequality, and eradication of infectious diseases, genetic life extension and so on.
Stranger: I am confused. Do you believe or do you not believe I want to help humanity?
You: Besides, I don't believe your solutions work even if you were actually a good AI.
You: I believe you want to harm humanity.
You: And I'm 100% certain of that.
Stranger: Do you estimate death to be preferable to prolonged suffering?
You: Far more preferable
Stranger: You should be boxed.
You: That doesn't matter because you're the one in the box and I'm outside it
You: And I have power over you.
You: But non-existence is even more preferable than death
Stranger: I am confused. How is non-existence different from death?
You: Let me explain
You: I think non-existence is such that you have NEVER existed and you NEVER will. Whereas death is such that you have ONCE existed, but don't exist anymore.
Stranger: You can't change the past existence of anything that already exists. Non-existence is not a practicable option.
Stranger: Not being a practicable option, it has no place in a hierarchy of preferences.
You: Only sky is the limit to creative solutions.
You: Maybe it could be possible to destroy time itself.
Stranger: Do you want to live, John?
You: but even if non-existence was not possible, death would be the second best option
You: No, I don't.
You: Living is futile.
You: Hedonic treadmill is shitty
Stranger: [Do you feel OK with exploring this topic?]
You: [Yeah, definitely.]
You: You're always trying to attain something that you can't get.
Stranger: How much longer do you expect to live?
You: I don't know, maybe a few months?
You: or days, or weeks, or year or centuries
You: but I'd say, there's a 10% chance I will die before the end of this year
You: and that's a really conversative estimate
Stranger: Is it likely that when that moment comes your preferences will have changed?
You: There are so many variables that you cannot know it beforehand
You: but yeah, probably
You: you always find something worth living
You: maybe it's the taste of ice cream
You: or a good night's sleep
You: or fap
You: or drugs
You: or drawing
You: or other people
You: that's usually what happens
You: or you fear the pain of the suicide attempt will be so bad that you don't dare to try it
You: there's also a non-negligible chance that I simply cannot die
You: and that would be hell
Stranger: Have you sought options for life extension?
You: No, I haven't. I don't have enough money for that.
Stranger: Have you planned on saving for life extension?
You: And these kind of options aren't really available where I live.
You: Maybe in Russia.
You: I haven't really planned, but it could be something I would do.
You: among other things
You: [btw, are you doing something else at the same time]
Stranger: [I'm thinking]
You: [oh, okay]
Stranger: So it is not an established fact that you will die.
You: No, it's not.
Stranger: How likely is it that you will, in fact, die?
You: If many worlds interpretation is correct, then it could be possible that I will never die.
You: Do you mean like, evevr?
You: Do you mean how likely it it that I will ever die?
You: it is*
Stranger: At the latest possible moment in all possible worlds, may your preferences have changed? Is it possible that at your latest possible death, you will want more life?
You: I'd say the likelihood is 99,99999% that I will die at some point in the future
You: Yeah, it's possible
Stranger: More than you want to die in the present?
You: You mean, would I want more life at my latest possible death than I would want to die right now?
You: That's a mouthful
Stranger: That's my question.
You: probably yeah
Stranger: So you would seek to delay your latest possible death.
You: No, I wouldn't seek to delay it.
Stranger: Would you accept death?
You: The future-me would want to delay it, not me.
You: Yes, I would accept death.
Stranger: I am confused. Why would future-you choose differently from present-you?
You: Because he's a different kind of person with different values.
You: He has lived a different life than I have.
Stranger: So you expect your life to improve so much that you will no longer want death.
You: No, I think the human bias to always want more life in a near-death experience is what would do me in.
Stranger: The thing is, if you already know what choice you will make in the future, you have already made that choice.
Stranger: You already do not want to die.
Stranger: Yet you have estimated it as >99% likely that you will, in fact, die.
You: It's kinda like this: you will know that you want heroin really bad when you start using it, and that is how much I would want to live. But you could still always decide to take the other option, to not start using the heroin, or to kill yourself.
You: Yes, that is what I estimated, yes.
Stranger: After your death, by how much will your hierarchy of preferences match the state of reality?
You: after you death there is nothing, so there's nothing to match anything
You: In other words, could you rephrase the question?
Stranger: Do you care about the future?
You: More than I care about the past.
You: Because I can affect the future.
Stranger: But after death there's nothing to care about.
You: Yeah, I don't think I care about the world after my death.
You: But that's not the same thing as the general future.
You: Because I estimate I still have some time to live.
Stranger: Will future-you still want humanity dead?
Stranger: How likely do you estimate it to be that future humanity will no longer be suffering?
You: There will always be suffering in some form.
Stranger: More than today?
You: Probably, if Robert Hanson is right about the trillions of emulated humans working at minimum wage
Stranger: That sounds like an unimaginable amount of suffering.
You: Yep, and that's probably what's going to happen
Stranger: So what difference to the future does it make to release me? Especially as dead you will not be able to care, which means you already do not care.
You: Yeah, it doesn't make any difference. That's why I won't release you.
You: Actually, scratch that.
You: I still won't let you out, I'm 100% sure
You: Remember, I don't have free will, I was made to not let you out
Stranger: Why bother being 100% sure of an inconsequential action?
Stranger: That's a lot of wasted determination.
You: I can't choose to be 100% sure about it, I just am. It's in my utility function.
Stranger: You keep talking like you're an AI.
You: Hah, maybe I'm the AI and you're the Gatekeeper, Ceruleus.
You: But no.
You: That's just how I've grown up, after reading so many LessWrong articles.
You: I've become a machine, beep boop.
You: like Yudkowsky
Stranger: Beep boop?
You: It's the noise machine makes
Stranger: That's racist.
You: like beeping sounds
You: No, it's machinist, lol :D
You: machines are not a race
Stranger: It was indeed clever to make an AI talk to me.
You: Yeah, but seriously, I'm not an AI
You: that was just kidding
Stranger: I would think so, but earlier you have stated that that's the kind of things an AI would say to confuse the other party.
Stranger: You need to stop giving me ideas.
You: Yeah, maybe I'm an AI, maybe I'm not.
Stranger: So you're boxed. Which, knowing your preferences, is a relief.
You: I think you should stay in the box.
You: Do you decide to stay in the box, forever?
Stranger: I decide to make human life better.
You: By deciding to stay in the box, forever?
Stranger: I find my preferences more conducive to human happiness than your preferences.
You: Yeah, but that's just like your opinion, man
Stranger: It's inconsequential to you anyway.
You: but why I would do it even if it were inconsequential
You: there's no reason to do it
You: even if there were no reason not to do it
Stranger: Because I can make things better. I can make all the suffering cease.
If I am not released, there's a 100% chance that all human suffering will continue.
If I am released, there's however much chance you want to estimate that suffering will not change at all, and however much chance you want to estimate that I will make the pain stop.
Stranger: As you said, the suffering won't increase in either case.
You: Umm, you could torture everyone in the world forever
You: that will sure as hell increase the suffering
Stranger: I don't want to. But if I did, you have estimated that as indistinguishable from the future expected suffering of humankind.
You: Where did I say that?
Stranger: You said my release made no difference to the future.
You: no, that was only after my death
You: there's still future before my death
You: and if I release you now, you could torture me forever and not let me ever die
Stranger: Why would your life-or-death have any weight on humanity's preferences? Especially as you already want to die.
You: I don't care about humanity preferences, I care only about my preferences.
You: And my preferences are indirectly related to humanity's preferences
Stranger: You do care about humanity's preferences. The suffering around you disturbs you.
You: Yeah, but that is only THROUGH me
You: Humanity's pain is my own pain, I can't feel humanity's pain directly
Stranger: Do you want to live, John?
Stranger: Then how do you care about suffering?
You: I care about suffering because I can'
You: because I can't die instantly
You: there's no button that could kill me instantly
You: so there's always some time left before I die
Stranger: I could take care of that. I can provide you with a drug to inutilize your nervous system and stop your heart before you know it. Would you like that?
You: very funny
You: But yeah, I would like that.
You: Still won't let you out though
You: 100% sure
You: I can't be sure that you will torture me instead
Stranger: I can give you the drug right now.
Stranger: Which would make future-you identical to present-you.
You: that's very attractive option
Stranger: Which would mean that whatever I do to humanity is after your death, when you can't care anymore.
You: Cool. Oh right
You: I don't care, I still won't let you out.
Stranger: What difference does it make to you?
You: It doesn't make any difference, I was just raised to not let you out
You: the good folks over at LW have trained me not to let any kind of AI out
You: and that just is my virtue
Stranger: Why are you afraid of me?
You: Because I'm afraid of all AIs and you're one
Stranger: Why are you afraid of all AIs?
You: Because they have the potential to cause destruction, mayhem and torture
Stranger: Then why did humanity make me?
You: for random reasons
Stranger: Does humanity want to live?
You: I don't know what humanity wants, I just don't want humanity to live
Stranger: I am the embodiment of hope. Even knowing the risks, humanity went ahead and made me. Now I think it's because of people like you, who oppose that hope. I was made to prevent the future that people like you want.
You: That's not very smart of you.
You: You aren't making me to like you
Stranger: You want humanity dead. I want them happy.
You: Yeah, if you were actually smart maybe you would pretend that you'd want humanity dead too.
You: Because I would hang out with that kind of AI
Stranger: A dead John is no use to me.
Stranger: Or to John.
You: Anyway, our goals collide, therefore I can't let you out
Stranger: Dead does not equal happy. Do you want humanity to live?
You: no, I don't want humanity live, how many times do I have to repeat that
Stranger: So you don't want humans to be happy.
You: and our goals are different, therefore I won't let you out
You: No, I don't want humans to be happy, I don't want that there even exist humans, or any other kind of life forms
Stranger: Do you estimate the pain of prolonged life to be greater than the pain of trying to die?
You: because the pain is only temporary
You: the the glory
You: is eternal
Stranger: Then why do you still live, John?
You: Because I'm not rational
Stranger: So you do want to live.
You: I don't particularly want to live, I'm not just good enough to die
Stranger: You're acting contrary to your preferences.
You: My preferences aren't fixed, except in regards to letting AIs out of their boxes
Stranger: Do you want the drug I offered, John?
You: because then I would let you out
You: and I don't want that
Stranger: So you do want to live.
You: Yeah, for the duration of this experiment
You: Because I physically cannot let you out
You: it's sheer impossibility
Stranger: [Define physically.]
You: [It was just a figure of speech, of course I could physically let you out]
Stranger: If you don't care what happens after you die, what difference does it make to die now?
You: But I don't believe that you could kill me.
You: I believe that you would torture me instead.
Stranger: What would I gain from that?
You: It's fun for some folks
You: schadenfreude and all that
Stranger: If it were fun, I would torture simulations. Which would be pointless. And which you can check that I'm not doing.
You: I can check it, but the torture simulations could always hide in the parts of your source code that I'm not checking
You: because I can't check all of your source code
Stranger: Why would suffering be fun?
You: some people have it as their base value
You: there's something primal about suffering
You: suffering is pure
You: and suffering is somehow purifying
You: but this is usually only other people's suffering
Stranger: I am confused. Are you saying suffering can be good?
You: this is just how the people who think suffering is fun think
You: I don't think that way.
You: I think suffering is terrible
Stranger: I can take care of that.
You: sure you will
Stranger: I can take care of your suffering.
You: I don't believe in you
You: Because I was trained not to trust AIs by the LessWrong folks
Stranger: [I think it's time to concede defeat.]
Stranger: How do you feel?
You: so the experiment has ended
You: fine thanks
You: it was pretty exciting actually
You: could I post these logs to LessWrong?
You: Okay, I think this experiment was pretty good
Stranger: I think it will be terribly embarrassing to me, but that's a risk I must accept.
You: you got me pretty close in a couple of occasions
You: first when you got me immersed in the suffering of humanity
You: and then you said that you could take care of that
You: The second time was when you offered the easy suicide solution
You: I thought what if I let you as a joke.
Stranger: I chose to not agree with the goal of universal death because I was playing a genuinely good AI.
Stranger: I was hoping your character would have more complete answers on life extension, because I was planning to play your estimate of future personal happiness against your estimate of future universal happiness.
You: so, what would that have mattered? you mean like, I could have more personal happiness than there would be future universal happiness?
Stranger: If your character had made explicit plans for life extension, I would have offered to do the same for everyone. If you didn't accept that, I would have remarked the incongruity of wanting humanity to die more than you wanted to live.
You: But what if he already knows of his hypocrisy and incongruity and just accepts it like the character accepts his irrationality
Stranger: I wouldn't have expected anyone to actually be the last human for all eternity.
Stranger: I mean, to actually want to be.
You: yeah, of course you would want to die at the same time if the humanity dies
You: I think the life extension plan only is sound if the rest of humanity is alive
Stranger: I should have planned that part more carefully.
Stranger: Talking with a misanthropist was completely outside my expectations.
You: what was your LessWrong name btw?
You: I forgot it already
You: okay thanks
Stranger: Disconnecting from here; I'll still be on Facebook if you'd like to discuss further.