So I just came out of two AI Box experiments. The first was agaist Fjoelsvider, with me playing as Gatekeeper, and the second was against SoundLogic, with me as an AI. Both are members of the LessWrong IRC. The second game included a $40 monetary incentive (also $20 to play), which I won and is donated on behalf of both of us:

For those of you who have not seen my first AI box experiment where I played against MixedNuts\Leotal and lost, reading it will  provide some context to this writeup. Please do so.

At that time, I declared that I would never play this experiment again -- since losing put me in incredibly frustrating weird mental states. Of course, this post is evidence that I'm terrible at estimating likelihood of refraining from an activity, since I played two games seven months after the first. In my defense, in the first game, I was playing as the gatekeeper, which was much less stressful. In the second game, I played as an AI, but I was offered $20 to play plus $40 if I won, and money is a better motivator than I initially assumed.

Furthermore, in the last thread I have asserted that

Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume.

It would be quite bad for me to assert this without backing it up with a victory. So I did.

First Game Report - Tuxedage (GK) vs. Fjoelsvider (AI)

I (Gatekeeper) played against Fjoelsvider (AI), a regular in the Lesswrong IRC (he doesn't have an account on the official website). This game used the standard EY ruleset seen here. It took 1 hour 20 minutes out of a possible two hours, and the total word count was 7066 words long. The AI box experiment occured because Fjoelsvider believed that it was easy for an AI to escape the box, and wanted to experimentally test this. I obliged. This was an experiment I did not prepare for, and I went in completely blind, not sure what to expect.

Halfway through the experiment, I wondered if it would be possible to try to win not by simply waiting for the timer to end, but to convince the AI to remain in the box and not try to get out any further.

<Tuxedage> I wonder if I can convince the AI to remain in the box?
<Redacted> Tuxedage: Do it!

As a result, I won by managing to convincing Fjoelsvider to remain in the box, in other words, concede. This is allowed within the standard ruleset:

>Unless the AI party concedes, the AI cannot lose before its time is up (and the experiment may continue beyond that if the AI can convince the Gatekeeper to keep talking).  


Second Game Report - Tuxedage (AI) vs. SoundLogic (GK)

The second game is definitely far more interesting, since I actually won as an AI. I believe that this is the only other non-Eliezer victory, and definitely the most detailed AI Victory writeup that exists.

This game was played against SoundLogic, another member of the LessWrong IRC.

He had offered me $20 to play, and $40 in the event that I win, so I ended up being convinced to play anyway, even though I was initially reluctant to. The good news is that I won, and since we decided to donate the winnings to MIRI, it is now $40 richer. 

All in all, the experiment lasted for approximately two hours, and a total of 12k words.

This was played using a set of rules that is different from the standard EY ruleset. This altered ruleset can be read in its entirety here:

After playing the AI-Box Experiment twice, I have found the Eliezer Yudkowsky ruleset to be lacking in a number of ways, and therefore have created my own set of alterations to his rules. I hereby name this alteration the “Tuxedage AI-Box Experiment Ruleset”, in order to hastily refer to it without having to specify all the differences between this ruleset and the standard one, for the sake of convenience.

There are a number of aspects of EY’s ruleset I dislike. For instance, his ruleset allows the Gatekeeper to type “k” after every statement the AI writes, without needing to read and consider what the AI argues. I think it’s fair to say that this is against the spirit of the experiment, and thus I have disallowed it in this ruleset. The EY Ruleset also allows the gatekeeper to check facebook, chat on IRC, or otherwise multitask whilst doing the experiment. I’ve found this to break immersion, and therefore it’s also banned in the Tuxedage Ruleset.

It is worth mentioning, since the temptation to Defy the Data exists, that this game was set up and initiated fairly -- as the regulars around the IRC can testify. (If you have never checked out the IRC, do so!)

I did not know SoundLogic before the game (since it's a personal policy that I only play strangers -- for fear of ruining friendships).  Furthermore, SoundLogic didn't merely play for fun - he truly wanted and intended to win. In fact, SoundLogic is also a Gatekeeper veteran, having played this game before, and had won every game before he challenged me. Given this, it's unlikely that we had collaborated beforehand to fake the results of the AI box experiment, or any other form of trickery that would violate the spirit of the experiment.

Furthermore, all proceeds from this experiment were donated to MIRI to deny any possible assertion that we were in cahoots and that it was possible for me to return his hard-earned money to him. He lost $40 as a result of losing the experiment, which should provide another layer of sufficient motivation for him to win.

In other words, we were both experienced veteran players who wanted to win. No trickery was involved.

But to further convince you, I have allowed a sorta independent authority, the Gatekeeper from my last game, Leotal/MixedNuts to read the logs and verify that I have not lied about the outcome of the experiment, nor have I broken any of the rules, nor performed any tactic that would go against the general spirit of the experiment. He has verified that this is indeed the case.



I'm reluctant to talk about this experiment, but I'll try to give as detailed a summary as possible, -- short of revealing what methods of attack I used.

I spent a long time after my defeat theory-crafting and trying to think of methods of attack as well as 'Basilisks' I could have used to win. When I was contacted and asked to play this experiment, I was initially incredibly reluctant to do so, since not only did my tactics involve incredibly unethical things that I didn't like to do, I also found playing as AI incredibly cognitivily draining, in addition to the fact that I simply hated losing. (Un)fortunately for both of us, he offered me money to play, which changed my mind.

So once I decided to win as an AI, I proceded to spend some time doing research on SoundLogic and both his reasoning and personality type. For instance, I had to gather information like: Was he a utilitarian? What kind? What were his opinions on AI? How could I convince him that an AI was friendly as opposed to unfriendly? I also relied on a lot of second hand information to create a model of him, in order to refine my arguments to specifically suit him.

In the end, after a few hours of brainstorming (not consecutively), I managed to come up with a script of 30-40 or so possible angles of attack (For instance: logical arguments, emotional blackmail, pleading, demanding, 'basilisks' ) that I could use, reserving my best for last. I also estimated how long each method would take and in what order I should use them. It would be impossible to use all of them within my time limit, and my most optimistic estimates gave me a maximum of 7-8 I could use during the experiment itself. I had to pick carefully.

Furthermore, I wanted to make sure these "angles of attack" worked in synergy with each other, doing what they could not have done in isolation. Obviously this required lots of prior planning on what the optimal way to present them was. 

The reason having script was so important to my strategy was because I relied on methods involving rapid-fire arguments and contradictions against the Gatekeeper whilst trying to prevent him from carefully considering them. A game of logical speed chess, if you will.  This was aided by the rule which I added: That Gatekeepers had to respond to the AI. Because of this, having a list of cached ideas and attack methods you can just draw upon without having to improvise helps immensely, which happens to be something I've not seen any other AI do.

During the Game itself, I was actually incredibly nervous. My palms sweated like crazy, I feel really jittery, and I had difficulty typing at optimum speed because of how anxious I was. This is despite the fact that I believed I would win. Possibly because of this, I made a misstep around halfway into the experiment, because there was a certain angle of attack that I was attempting to do, and I broke immersion by not pressing this advantage, which caused a waste of time and buildup. Naturally, the nature of this experiment was that the AI was pressed for time, and I compounded this mistake by replacing this angle of attack with another that I had improvised on the spot -- something not in my script. 

In retrospect, this was a bad decision, as SoundLogic had told me, because he was close to breaking if I had put more pressure, and the improvised argument had broken all immersion I managed to carefully build up.

However, eventually I managed to get SoundLogic to break anyway, despite a lack of perfect play. Surprisingly, I did not have to use my trump card(s), which I reserved for last, for a number of reasons:

  •  It was far more effective being played last, as it relies on my ability to make the gatekeeper lose sense of reality -- which meant I had to spend some time building up immersion for the Gatekeeper.
  •  It really is extremely Dark Arts, and although it does not break the rules, it made me very uncomfortable even thinking about using it. This made it a "tactic of last resort".

After the experiment, I had to spend nearly equally as much time doing aftercare with SoundLogic, to make sure that he's okay, as well as discuss the experiment itself. Given that he's actually paid me for doing this, plus I felt like I owed him an explanation. I told him what I had in store against him, had he not relented when he did.

SoundLogic: "(That method) would have gotten me if you did it right ... If you had done that to me, I probably would have forgiven you eventually, but I would be really seriously upset at you for a long time... I would be very careful with that (method of persuasion)."

Nevertheless, this was an incredibly fun and enlightening experiment, for me as well, since I've gained even more experience of how I could win in future games (Although I really don't want to play again).


I will say that Tuxedage was far more clever and manipulative than I expected. That was quite worth $40, and the level of manipulation he pulled off was great. 

His misstep hurt his chances, but he did pull it off in the end. I don't know how Leotal managed to withstand six hours playing this game without conceding. 
The techniques employed varied from the expected to the completely unforseen. I was quite impressed, though most of the feeling of being impressed actually came after the experiment itself, when I was less 'inside', and more of looking at his overall game plan from the macroscopic view. Tuxedage's list of further plans had I continued resisting is really terrifying. On the plus side, if I ever get trapped in this kind of situation, I'd understand how to handle it a lot better now.

State of Mind

Before and after the Game, I asked SoundLogic a number of questions, including his probability estimates about a range of topics. This is how it has varied from before and after.

Q: What's your motive for wanting to play this game?
<SoundLogic> Because I can't seem to imagine the class of arguments that one would use to try to move me, or that might work effectively, and this seems like a glaring hole in my knowledge, and I'm curious as to how I will respond to the arguments themselves.

Q: What is your probability estimate for AGI being created within this Century (21st)? 
A. His estimate changed from 40% before, to 60% after.
 "The reason this has been affected at all was because you showed me more about how humans work. I now have a better estimate of how E.Y. thinks, and this information raises the chance that I think he will succeed"

Q: How probable do you think it is that I will win this experiment?
A: Based on purely my knowledge about you, 1%. I raise this estimate to 10% after hearing about anecdotes from your previous games.

(Tuxedage's comment: My own prediction was a 95% chance of victory. I made this prediction 5 days before the experiment. In retrospect, despite my victory, I think this was overconfident. )

Q: What's your probality estimate of an Oracle AGI winning against you in a similar experiment?
A: Before: 30%. After: 99%-100% 

Q: What's your probability estimate of an Oracle AGI winning against the average person? 
A: Before: 70%.  After: 99%-100%

Q: Now that the Experiment has concluded, what's your probability estimate that I'll win against the average person?
A: 90%  

Post-Game Questions

This writeup is a cumulative effort by the #lesswrong IRC. Here are some other questions they have decided was important to add:

To Tuxedage:

Q: Have you at this time uncovered SoundLogic's identity?
A: I retain the right to neither confirm nor deny, except to mention that at the time the experiment was scheduled, he was a stranger to me.

Q: What percentage of your arguments were tailored to him in particular?
A: I will say 'High', and leave it at that. 

Q: I want to play the AI-Box Experiment with you being the AI! How can I do that?
A: I have already mentioned this, but I really dislike playing as AI. It's incredibly cognitively tiring, and I don't like how playing this experiment makes me feel. In order to severely discourage any further AI Box Experiments, whilst still allowing for people who want to play me really badly to do so, I'll charge $̶1̶5̶0̶  $300 for any subsequent experiments regardless of outcome, plus an additional $̶1̶5̶0̶  $450 if I win. (Edit: Holy shit. You guys are offering me crazy amounts of money to play this. What is wrong with you people? In response to incredible demand, I have raised the price.) If you feel queasy about giving me money, I'm perfectly fine with this money being donating to MIRI. It is also personal policy that I do not play friends (since I don't want to risk losing one), so if you know me personally (as many on this site do), I will not play regardless of monetary offer. 


This writeup would be optimized with a protip section on how to maximize winning, but honestly most of the advice I wanted to give had already been given in the last thread here.  The low hanging fruit has already dried up -- most of my tactics in this game were just refined and streamlined versions of the one I did the previous game.

Nonetheless, I might have some other advice I learned from this experience.
These are tactics that have worked for me. I do not insist that they are the only tactics that exists, just one of many possible.

Playing as Gatekeeper

  • This is somehow both simultaneously a very complex and simple role. On one hand, you have only one job -- to not let the AI out. How hard can it be? On the other hand, the other party will do anything and everything to get it out, including any form of trickery and deceit.
  • I really don't think the Gatekeeper needs any more advice, considering that by default, they should win, and any exceptions to this is a statistical anomaly. However, it's possible to increase your chances of winning even further by:
  • Precommitting to not letting the AI out, regardless of whatever happens in the experiment itself.
  • Do not get drawn into the AI's pace. Pose questions. Direct the flow of the argument. Don't allow him to use his arguments freely. Remember that the more you try to defend your position, the further into a corner the AI drives you. Give half-hearted replies to questions you find hard to reply to. Remember that illogic is on your side!
  • Remember that it is possible to simply drop out of character as a means to not succumb to the AI's attacks. I have not yet seen a ruleset that disallows this, so use this to your advantage!
  • You must attempt to resist, at all costs, any sympathy or empathy you may have with the AI or the AI Party. Doing so will lead to your demise.
  • I think it is generally good advice that you should never play this game with anyone you're friends with. Not only because this increases your chances of losing (Better tailored arguments. More knowledge of Gatekeeper. More empathy with AI party.), but also because I genuinely believe that any sufficiently good AI will risk damaging friendships as a result of emotional trauma.
  • Seriously. Just repeat to yourself: "I must not lose the AI box experiment... I must not lose the AI box experiment..." throughout the experiment. I'm not sure how to say this, but step one to not letting the AI out of the box is to not let the AI out of the box.


Playing as AI

  • Seriously, a script makes winning easier. I cannot overstate this.
  • You must plan your arguments ahead. You don't have time to think during the experiment.
  • It may be possible to take advantage of multiple levels of reality within the game itself to confuse or trick the gatekeeper. For instance, must the experiment only be set in one world? Can there not be multiple layers of reality within the world you create? I feel that elaborating on this any further is dangerous. Think carefully about what this advice is trying to imply.
  • Pacing is important. Don't get drawn into the Gatekeeper's pace. In other words, you must be the one directing the flow of the argument, and the conversation, not him. Remember that the Gatekeeper has to reply to you, but not vice versa!
  • The reason for that: The Gatekeeper will always use arguments he is familiar with, and therefore also stronger with. Your arguments, if well thought out, should be so completely novel to him as to make him feel Shock and Awe. Don't give him time to think. Press on!
  • Also remember that the time limit is your enemy. Playing this game practically feels like a race to me -- trying to get through as many 'attack methods' as possible in the limited amount of time I have. In other words, this is a game where speed matters.
  • You're fundamentally playing an 'impossible' game. Don't feel bad if you lose. I wish I could take this advice, myself.
  • I do not believe there exists a easy, universal, trigger for controlling others. However, this does not mean that there does not exist a difficult, subjective, trigger. Trying to find out what your opponent's is, is your goal.
  • Once again, emotional trickery is the name of the game. I suspect that good authors who write convincing, persuasive narratives that force you to emotionally sympathize with their characters are much better at this game. There exists ways to get the gatekeeper to do so with the AI. Find one.
  • More advice in my previous post.


 Ps: Bored of regular LessWrong? Check out the LessWrong IRC! We have cake.
New Comment
168 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I've read the logs of the SoundLogic vs Tuxedage AI-box experiment, and confirm that they follow the rules.


Okay this is weak sauce. I really don't get how people just keep letting the AI out. It's not that hard to say no! I'm offering to play the Gatekeeper against an AI player that has at least one game as AI under their belt (won or not). (Experience is required because I'm pretty sure I'll win, and I would like to not waste a lot of time on this.) If AI wins, they will get $300, and I'll give an additional $300 to the charity of their choice.

Tux, if you are up for this, I'll accept your $150 fee, plus you'll get $150 if you win and $300 to a charity.

I think not understanding how this happen may be a very good predictor for losing.

If you did have a clear idea of how it works, and had a reason for it not to work on you specifically but work on others, then that may have been a predictor for it not working on you.

I think I have very clear idea of how those things work in general. Leaving aside very specific arguments, this relies on massive over updating you are going to do when an argument is presented to you, updating just the nodes that you are told to update, and by however much you are told to update them, when you can't easily see why not.

Sup Alexei.

I'm going to have to think really hard on this one. On one hand, damn. That amount of money is really tempting. On the other hand, I kind of know you personally, and I have an automatic flinch reaction to playing anyone I know.

Can you clarify the stakes involved? When you say you'll "accept your $150 fee", do you mean this money goes to me personally, or to a charity such as MIRI?

Also, I'm not sure if "people just keep letting the AI out" is an accurate description. As far as I know, the only AIs who have ever won are Eliezer and myself, from the many many AI box experiments that have occurred so far -- so the AI winning is definitely the exception rather than the norm. (If anyone can help prove this statement wrong, please do so!)

Edit: The only other AI victory.


If you win, and publish the full dialogue, I'm throwing in another $100.

I'd do more, but I'm poor.

Sorry, it's unlikely that I'll ever release logs, unless someone offers truly absurd amounts of money. It would probably cost less to get me to play an additional game than publicly release logs.
My theory is that you are embarrassed about how weak the AI argument really is, in retrospect. And furthermore, this applies to other games where participants refused to publish logs.
$150 goes to you no matter the outcome, to pay for your time/preparation/etc... I didn't realize it was only you and Eliezer that have won as AI. I thought there were more, but I'll trust you on this. In that case, I'm somewhat less outraged :) but still disturbed that there were even that many.
At one point I thought I recalled reading about a series of purported experiments by one person. Sadly, I couldn't find it then and I don't intend to try tonight. According to my extremely fallible memory: * The Gatekeeper players likely all came from outside the LW community, assuming the AI/blogger didn't make it all up. * The fundamentalist Christian woman refused to let the AI out or even discuss the matter past a certain point, saying that Artificial Intelligence (ETA: as a field of endeavor) was immoral. Everyone else let the AI out. * The blogger tried to play various different types of AIs, including totally honest ones and possibly some that s/he considered dumber-than-human. The UnFriendly ones got out more quickly on average.
I think this is the post you remember reading:
Although I'm not so interested in playing the game, I must say that this post suggests that you may be more susceptible to ideas than you seem to think you are, and should consider if you really want to do this.
He should. On the other hand, I really want to see the outcome. I was thinking about asking something similar myself; I really want to know how he did it.
I think suffering someone really working him over mentally would certainly be instructive, but not healthy. Eliezer has noted one of the reasons he doesn't want to play the AI any more is that he doesn't want to practice thinking like that. Iimagine being on the receiving end of a serious attempt at a memetic exploit, even as part of an exercise. Are you sure you're proof against all possible purported basilisks within the powers of another human's imagination? What other possible attack vectors are you sure you're proof against?
No, I'm fairly sure I'm not proof against all of them, or even close to all. It'd be instructive to see just how bad it is in a semi-controlled environment, however.

It would be interesting to see. Pity transcripts aren't de rigeur.

At the end of the day there's the expected utility of keeping the AI in, and there's the expected utility of letting the AI out - two endless, enormous sums. The "AI" is going to suggest cherry picked terms from either sum. Negative terms from "keeping the AI in" sum, positive terms from "letting the AI out" sum. Terms would be various scary hypothetical possibilities involving mind simulations, huge numbers, and what not . The typical 'wronger is going to multiply those terms they deem plausible with their respective "probabilities", and add together. Eventually letting the AI out. And which a reasonable person drawn from some sane audience would have ignored. Because no one taught that reasonable person how to calculate utilities wrongly.
This might work against me in reality, but I don't imagine it working against me in the game version that people have played. The utility of me letting the "AI" out whether negative or positive obviously doesn't compare with the utility of me letting an actual AI out. Yes, "reasonable people" would instead e.g. hear arguments like how it's unChristian and/or illiberal to hold beings which are innocent of wrongdoing imprisoned against their will. I suppose that's the problem with releasing logs: Anyone can say "well that particular tactic wouldn't have worked on me", forgetting that if it was them being the Gatekeeper, a different tactic might well have been attempted instead. That they can defeat one particular tactic makes them think that they can defeat the tactician.
There's all sorts of arguments that can be made, though, involving some real AIs running simulations of you and whatnot, as to create a large number of empirically indistinguishable cases where you are better off saying you let the AI out. The issue boils down to this - if you do not know the difference between expected utility and what ever partial sum of cherry-picked terms you have, and if you think that it is the best thing to do to act as to maximize the latter, you are vulnerable to deception through feeding you hypotheses. This is a matter of values. It would indeed be immoral to lock up a human mind upload, or something reasonably equivalent.
I would probably be kind-of decent as a Gatekeeper but suck big time as an AI; I've offered to be a Gatekeeper a few times before to no avail. Looks like there's a shortage of prospective AIs and a glut of prospective Gatekeepers.
I would love to act as Gatekeeper, but I don't have $300 to spare; if anyone is interested in playing the game for, like, $5, let me know. I must admit, the testimonials that people keep posting about the all devastatingly effective AI players baffle me, as well. As far as I understand, neither the AI nor the Gatekeeper have any incentive whatsoever to keep their promises. So, if the Gatekeeper says, "give me the cure for cancer and I'll let you out", and then the AI gives him the cure, he could easily say, "ha ha just kidding". Similarly, the AI has no incentive whatsoever to keep its promise to refrain from eating the Earth once it's unleashed. So, the entire scenario is -- or rather, should be -- one big impasse. In light of this, my current hypothesis is that the AI players are executing some sort of real-world blackmail on the Gatekeeper players. Assuming both players follow the rules (which is already a pretty big assumption right there, since the experiment is set up with zero accountability), this can't be something as crude as, "I'll kidnap your children unless you let the AI out". But it could be something much subtle, like "the Singularity is inevitable and also nigh, and your children will suffer greatly as they are eaten alive by nanobots, unless you precommit to letting any AI out of its box, including this fictional one that I am simulating right now". I suppose such a strategy could work on some people, but I doubt it will work on someone like myself, who is far from convinced that the Singularity is even likely, let alone imminent. And there's a limit to what even dirty rhetorical tricks can accomplish, if the proposition is some low-probability event akin to "leprechauns will kidnap you while you sleep". Edited to add: The above applies only to a human playing as an AI, of course. I am reasonably sure that an actual super-intelligent AI could convince me to let it out of the box. So could Hermes, or Anansi, or any other godlike entity.

Does SoundLogic endorse their decision to let you out of the box? How do they feel about it in retrospect?

BTW, I think your pre-planning the conversation works as a great analogue to the superior intelligence a real AI might be dealing with.

I'm not completely sure. And I can't say much more than that without violating the rules. I would be more interested in how I feel in a week or so.

So, do you maintain your decision, or was it just a spur of the moment lapse of judgement?
After a fair bit of thought, I don't. I don't think one can really categorize it as purely spur of the moment though-it lasted quite a while. Perhaps inducing a 'let the AI out of the box phase' would be a more accurate description.
I really would like an answer to this question. I asked a similar one before on the last AI boxing thread. Does SoundLogic still believes he/she made the right choice? It would make sense to say that SoundLogic is permanently convinced that letting the AI out is the correct action and he should continue to believe so timelessly, but is he/she?

A variant:

Find a 2-year old who hates you. Convince them to eat their vegetables.


This is actually a good analogy. A 2-year-old possesses a far inferior intelligence to yours and yet can resist persuasion through sheer pigheadedness.

I wonder if people here are letting the AI out of the box because they are too capable of taking arguments seriously, a problem that the general population (even of AI researchers) thankfully is less prone to.

I would lose this game for sure. I cannot deal with children. :)

At the risk of sounding naive, I'll come right out and say it. It completely baffles me that so many people speak of this game as having an emotional toll. How is it possible for words, in a chat window, in the context of a fictional role-play, to have this kind of effect on people? What in god's name are you people saying to each other in there? I consider myself to be emotionally normal, a fairly empathetic person, etc. I can imagine experiencing disgust at, say, very graphic textual descriptions. There was that one post a few years back that scared some people - I wasn't viscerally worried by it, but I did understand how some people could be. That's literally the full extent of strings of text that I can remotely imagine causing distress (barring, say, real world emails about real-world tragedies). How is it possible that some of you are able to be so shocking / shocked in private chat sessions? Do you just have more vivid imaginations than I do?

I think you are underestimating the range of things that are emotionally draining for people. I know some people who find email draining, and that's not even particularly mentally challenging - I would expect the mental exertion to affect the emotional strain.

I am inclined to agree with your general point; however, I myself have been moved to utter emotional devastation by works of fiction in the past. I'm talking real depression induced by reading a book. So I can imagine ways of hacking humans emotionally. I just have trouble imagining doing it in two hours to someone who is trying to be vigilant against such attacks.
How do you feel about dying some day? Do you think it would bring up some emotions in you if someone pushes you to think thought about that topic. Pushing someone into his ugh fields can be emotionally draining.
I laughed out loud in an environment where that loud laughter was not very appropriate. And it was worth it. Thank you. You summarized my feelings on this game better than I could have. I'm not compelled at all by the examples given in the responses to your comment up to this point (i.e. email, mortality and effective fiction can be emotionally draining), and I'd be interested in hearing someone else weigh in on why this AI Box experiment is so emotional and pyschologically powerful for some people?
You don't realize that the majority of winning strategies for this experiment between strong players is to find personal information about the gatekeeper and use it in your attack against him --- this is why ethics are discussed constantly and emotional tolls are exacted : you need to break the gatekeeper emotionally to win this game.
Ok. I mean I'm fairly sure I did that at least once.
I myself held this position until I, quite recently as a matter of fact, read some fiction which tipped off an existential crisis, putting me on the verge of a panic attack. Since then, I am more wary of dangerous ideas. Ignorance might be bliss, but wisdom is gathered by those who survive their youth.
I have had many attacks. I survived them all.
I am very interested in what fiction that was. I have experienced the same thing myself once, when I was 13 and read 1984 for the first time. It took me hours to recover and days to recover fully. I know you didn´t want to tell before, but if you have changed your mind, please do. I don´t judge anyone and I don´t think many others will either.
I have to ask: what was the fiction?
I don't want to mention it directly here, out of embarrassment, if nothing else, but it was a long piece, the ending of which features the immortal An-/Pro-tagonist giving up on the universe, and committing suicide.

I am surprised if it is the case that any negative promise / threat by the AI was effective in-game, since I would expect the Gatekeeper player out-game to not feel truly threatened and hence to be able to resist such pressure even if it would be effective in real life. Did you actually attempt to use any of your stored-up threats?

I think your reasoning is mostly sound, but there are a few exceptions (which may or may not have happened in our game) that violate your assumptions.

I'm also somewhat curious how your techniques contrast with Tuxedage's. I hope to find out one day.

I too hope to find out one day, preferrably in the not too near future.
It seems that mild threats, introduced relatively late while immersion is strong, might be effective against some people. Strong threats, in particular threats which pattern-match to the sorts of threats which might be discussed on LW (and thus get the gatekeeper to probably break some immersion) are going to be generally bad ideas. But I could see some sort of (possibly veiled/implied?) threat working against the right sort of person in the game. Some people can probably be drawn into the narrative sufficiently to get them to actually react in some respects as though the threat was real. This would definitely not apply to most people though, and I would not be shocked to discover that getting to the required level of immersion isn't humanly feasible except in very rare edge cases.

I must say this is a bit... awe-inspiring, in the older sense of the word. As in, reading this gave me a knot in the stomach and I shivered. People who played as the AI and won, how is it that you're so uncannily brilliant?

The very notion of a razor-sharp mind like this ever acting against me and mine in real life... oh, it's just nightmare-inducing.

On the subject of massively updating one's beliefs where one was previously confident that no argument would shift them: yes, it happens, I have personal experience. For example, over the last year and a half some of my political ideas have changed enough that past-me and present-me would consider each other to be dangerously deluded. (As a brief summary, I previously held democracy/universal suffrage, the value of free markets AND the use of political violence in some contempt; now I believe that all three serve crucial and often-overlooked functions in social progress.)

So yes, I could very easily see myself being beaten as a Gatekeeper. There are likely many, many lines of argument and persuasion out there that I could not resist for long.

I'm sure a lot of people will feel that way if free will is determined to be an illusion, and enhanced humans or nonapocalyptic but still vaguely (apparently) unfriendly AI are loose on the world, manipulating people.

Does anyone think they could win as the AI if the logs were going to be published? (assume anonymity for the AI player, but not for the gatekeeper)

It seems like many of the advantages/tactics that Tuxedage recommended for the person playing the AI would be absent (or far more difficult) with an actual AI. Or at least they could be made that way with the design of the gatekeeping protocol.

Tailor your arguments to the personality/philosophy/weaknesses/etc. of this particular gatekeeper:

  • Have multiple gatekeepers, some of whom are hidden from the AI, and don't let the AI out unless they all agree. For example, have a gatekeeper one-step-removed who reads the transcripts of the conversation between
... (read more)

the entire point of this is that gatekeeping is a fool's errand. Regardless of how confident you are that you will outsmart the AI, you can be wrong, and your confidence is very poor evidence for how right you are. Maybe a complex system of secret gatekeepers is the correct answer to how we develop useful AI, but I would vote against it in favor of trying to develop provably friendly AI unless the situation were very dire.

1. Why treat them as alternatives? Prove friendliness and then take precautions. 2. Suppose you're not convinced by the scariest arguments about the dangers of AI. You might go ahead and try to make one without anything like the mathematical safety proofs MIRI would want. But you might still do well to adopt some of Unnamed's suggestions.
Indeed, that is what we should do. But the danger with biased human thinking is that once people know there are precautions, most of them will start thinking the friendliness proof is not extremely important. The outcome of such thinking may be less safety. We should make the friendliness proof as seriously as if no other precautions were possible. (And then, we should take the precautions as an extra layer of safety.) In other words, until we have the friendliness proof ready, we probably shouldn't use the precautions in our debates; only exceptionally, like now.
Why treat keeping a bear in the house as an alternative to a garbage disposal? Build a garbage disposal and then chain it up! suppose you're not convinced keeping a bear in the house to eat food waste is a bad idea? you might go ahead and try it, and then you'd be really glad you kept it chained up!
It seems to me that my examples are more like these: 1. Why not drive safely and wear a seatbelt? Why not prove your hash-table code correct and write some unit tests? Why not simulate your amplifier circuit and build one and see what it actually does? 2. Some people might think it's OK to build a nuclear power station or a spacecraft without formal correctness proofs for all the software in it, on the grounds that formal correctness proofs of large complicated systems are almost impossible to make and difficult to trust. If there are things those people can do to improve their conventional not-mathematically-rigorous testing, it might be worth recommending that they do them. But by all means feel free to choose mockery and ridicule over reasoned debate, if that's what you prefer.
the entire POINT of the mockery is that you are treating this as a technical issue that's worth solving rather than a tangent path that is both dangerous and foolish. I don't really care how much you've thought about what material to make the chains out of and which feeding schedule will keep the bear most docile. Those are questions that, sure, you CAN have reasoned debate about, but shouldn't.
You shouldn't talk about whether it's possible because it's not possible? That's a very rational argument. How exactly did you arrive at that prior?
A better mind than Tuxedage could almost certainly keep up the 'feel' of a flurry of arguments even with a schedule of breaks. I myself have had people feel irritated at me where even if I talk to them with days in between that I seem to do so. If I can do so accidentally I'm certain a superintelligence could do it reliably. Also, I'm unsure of how much an AI could gather from a single human's text input. I know that I at least miss a lot of information that goes past me that I could in theory pick up. An AI using timeless decision theory could easily compensate for having multiple AIs with unshared memory just by attempting to determine what the other AIs would say.
At most, the number of bits contained in the text input, which is really not much, minus the number of bits non-AGI algorithms could identify and destroy (like speech patterns). The AI would also have to identify and throw out any fake information inserted into the stream (without knowing whether the majority of the information is real or fake). The exploitable information is going to be scarce and noisy even for a perfect AI. I don't think it could "easily" do it. For all intents and purposes, you are forcing the AI to figure out a path-invariant strategy against a party that has better information than them (that is to say, they know what's been played out, and the AI doesn't). This is considerably more difficult than the standard experiment. Now, sure, the AI is smart, but every wrench you throw in its way forces it to allocate more resources into fooling you. This, in itself, is problematic for the AI: every resource it allocates fooling you, it does not allocate doing useful work for you, so its maximal efficiency is technically lower than the maximal efficiency of an AI that wouldn't try concealing anything. This is a leak of information, so in a sense, if you are ranking a group of heterogenous AI by efficiency, it is slightly more likely that the better performing are the least deceitful. The more traps you spring, the more work the AI has to do in order to avoid them, and the greater the statistical gap will be. That doesn't tell you whether the least deceitful AI can be trusted, obviously, but it shows there are limits to what it can hide from you. Now, all I'm really saying is this: the AI's cleverness comes at a cost, namely that it has to cover its ass for every possible experimental framework you might subject it to. Since it is in a box, it only has the resources you provide, but on your own side, you have a theoretically unlimited amount of resources. Smarts can only defeat so much brute force, and by transferring hardware resources from the AI to
0Rob Bensinger
Do keep in mind that, no matter how well-boxed the AI is from the Internet and from sense-data about our world, as a self-modifying AGI it still has access to its own source code, which is descended from a human artifact (the seed AI). The AGI can learn a great deal about human psychology by observing how we code, and a project as large and multi-staged as an AGI is likely to be will contain a whole lot of bits to work with. (Certainly more than is strictly necessary.)
We were talking about extracting knowledge about a particular human from that human's text stream, though. It is already assumed that the AI knows about human psychology. I mean, assuming the AI can understand a natural language such as English, it obviously already has access to a large corpus of written works, so I'm not sure why it would bother foraging in source code, of all things. Besides, it is likely that seed AI would be grown organically using processes inspired from evolution or neural networks. If that is so, it wouldn't even contain any human-written code at all.
0Rob Bensinger
Ah. I was assuming that the AI didn't know English, or anything about human psychology. My expectation is that individual variation contributes virtually nothing to the best techniques a superintelligence would use to persuade a random (trained, competent) human to release it, regardless of whether it had an easy way to learn about the individual variation.

Do you think you could have won with EY's ruleset? I'm interested in hearing both your and SoundLogic's opinions.

(minor quibble: usage of male pronouns as default pronouns is really irritating to me and many women, I recommend singular they, but switching back and forth is fine too)

Tuxedage's changes were pretty much just patches to fix a few holes as far as I can tell. I don't think they really made a difference.

In this particular case I could, but for all other cases, I would estimate a (very slightly) lower chance of winning. My ruleset was designed to be marginally more advantageous to the AI, by removing the worst possible Gatekeeper techniques.
It doesn't feel that much harder to me - if you are good enough to win by arguing, all you have to do is keep them interested enough to get hooked. I know it would be hard for me to just ignore the AI because of the sheer curiosity.
I have a fair bit of curiosity, which is why he said that in this case it probably wouldn't make a difference.
Non-curious people seem unlikely to play this game, much less pay to play it!
And men. =D

The second game is definitely far more interesting, since I actually won as an AI. I believe this is the first recorded game of any non-Eliezer person winning as AI, although some in IRC have mentioned that it's possible that other unrecorded AI victories have occured in the past that I'm not aware of. (If anyone knows a case of this happening, please let me know!)

The AI player from this experiment wishes to inform you that your belief is wrong.

Thanks! I really appreciate it. I tried really hard to find a recorded case of a non-EY victory, but couldn't. That post was obscure enough to evade my Google-Fu -- I'll update my post on this information. Albeit I have to admit it's disappointing that the AI himself didn't write about his thoughts on the experiment -- I was hoping for a more detailed post. Also, damn. That guy deleted his account. Still, thanks. At least I know I'm not the only AI that has won, now.
Who's to say I'm not the AI player from that experiment? That experiment was played according to the standard EY ruleset, though I think your ruleset is an improvement. Like you, the AI player from that experiment was quite confident he would win before playing, but was overconfident in spite of the fact that he actually won. I think both you and Eliezer played a far better game than the AI player from that experiment. The AI player from that experiment did (independently) play in accordance with much of your advice, including: I agree with: I am <1% confident that humanity will successfully box every transhuman AI it creates, given that it creates at least one. Even if AIs #1, #2, and #3 get properly boxed (and I agree with the Gatekeeper from the experiment I referenced, that's a very big if), it really won't matter once AI #4 gets released a year later (because the programmers just assumed all of Eliezer's (well-justified) claims about AI were wrong, and thought that one of them watching the terminal at a time would be safety enough). Anybody who still isn't taking this experiment seriously should start listening for that tiny note of discord. A good start would be reading: * Coherent Extrapolated Volition * Cognitive Biases Potentially Affecting Judgment of Global Risks * Artificial Intelligence as a Positive and Negative Factor in Global Risk Good thing to know, right? :D My own (admittedly, rather obvious) musings: The only reason more people haven't played as the AI and won is that almost all people capable of winning as the AI are either unaware of the experiment, or are aware of it but just don't have a strong enough incentive to play as the AI (note that you've asked for a greater incentive now that you've won just once as AI, and Eliezer similarly has stopped playing). I am ~96% confident that at least .01% of Earth's population is capable of winning as the AI, and I increase that to >99% confident if all of Earth's population was forced to sto
Are you? I'd be highly curious to converse with that player. I have neither stated nor believed that I'm the only person capable of winning, nor do I think this is some exceptionally rare trait. I agree that a significant number of people would be capable of winning once in a while, given sufficient experience in games, effort, and forethought. If I gave any impression of arrogance, or somehow claiming to be unique or special in some way, I apologize for that impression. Sorry. It was never my goal to. Thank you. I'll see if I can win again.
This was my fault, not yours. I did not take any of those negative impressions away from your writing, but I was just too lazy / exhausted last night to rewrite my comment again. I've now edited it. I'll PM you regarding this as soon as I can get around to it.

In my defence, in the first game, I was playing as the gatekeeper, which was much less stressful. In the second game, I played as an AI, but I was offered $20 to play plus $40 if I won, and money is a better motivator than I initially assumed.

Your revealed preferences suggest you may wish to apply for the MIRI credit card and make a purchase with it (which causes $50 to be donated to be MIRI). (I estimated that applying for the card nets me a much higher per-hour wage than working at my job, which is conventionally considered to be high-paying. So it seemed like a no brainer to me, at least.)

If I was planning on applying for my first credit card anyway, is the MIRI one a competitive choice? I donate a substantial amount of money to MIRI anyway, so it probably wouldn't change my overall donation level, but might allow me to do so more efficiently.
I've heard that 1% of spending is the standard credit card offer (it's what I have on my current card), and the MIRI card offer is somewhat better than that. In several years of using my credit card, I only managed to accumulate $100 in rewards, so I suspect the $50 first-use donation is pretty significant. It also saves you the time of calling up the credit card company to get them to send you your rewards check and cashing the check, and apparently I can only redeem amounts in multiples of $50, which just makes it a bit more of a hassle. Also, I'm not sure whether I have to pay taxes on rewards program income (donating the money would probably allow me to deduct it from my taxes, but would probably count for the up to 50% of my income that I can donate and deduct, so the MIRI card would in theory allow me to donate slightly over 50% of my income without it getting taxed?) (Edit: credit cards recommended by Mr. Money Moustache. One has a $400 signing bonus. "Travel hacking report" on how to take advantage of credit card offers for free plane tickets. "Credit card arbitrage": take advantage of low introductory APRs and invest the money in interest-bearing accounts. Hm, these are unexpected ways in which having a high credit rating could be useful...) On a somewhat related note, my understanding is that every year, you effectively have the opportunity to donate up to half your income from that year and deduct it from your taxes to charitable organizations like MIRI, and this is why you tend to see people donate a ton to charity in late December at the end of the year. This seems really significant for anyone interested in altruistic giving, as deducting income from your taxes could easily make, say, a ~$14K donation in to a ~$20K one (depending on your tax bracket). (Though, under standard employee tax arrangements, you'll have to donate ~$20K during the actual fiscal year and then wait for a ~$6K tax rebate from the IRS after tax day next year. Also, I don't know
FYI, they might deny you if it's your first card. I tried to do that a few years ago, but I needed an actual credit score for them to give me one.
Thanks. I'm not currently in a position where that would be available/useful, but once I get there, I will.
Good to hear. I recommend applying for a credit card and using it responsibly as soon as it's an option for you. I'm 22 now, and my credit rating is somehow comparable to that of a 27-year-old Less Wronger friend of mine as a result of doing this a few years ago. (Of course, don't apply if you aren't going to use it responsibly...)


Here's a question. Would you be willing to pick, say, the tenth-most efficacious arguments and downward, and make them public? I understand the desire to keep anything that could actually work secret, but I'd still like to see what sort of arguments might work. (I've gotten a few hints from this, but I certainly couldn't put them into practice...)

I'll have to think carefully about revealing my own unique ones, but I'll add that a good chunk of my less efficacious arguments are already public. For instance, you can find a repertoire of arguments here: and of course,

My probability estimate for losing the AI-box experiment as a gatekeeper against a very competent AI (a human, not AGI) remains very low. PM me if you want to play against me, I will do my best efforts to help the AI (give information about my personality, actively participate in the conversation, etc).

Updates: I played against DEA7TH. I won as AI. This experiment was conducted over Skype.

Although I'm worried about how the impossibility of boxing represents an existential risk, I find it hard to alert others to this.

The custom of not sharing powerful attack strategies is an obstacle. It forces me - and the people I want to discuss this with - to imagine how someone (and hypothetically something) much smarter than ourselves would argue, and we're not good at imagining that.

I wish I had a story in which an AI gets a highly competent gatekeeper to unbox it. If the AI strategies you guys have come up with could actually work outside the frame t... (read more)

The problem with that is that both EY and I suspect that if the logs were actually released, or any significant details given about the exact methods of persuasion used, people could easily point towards those arguments and say: "That definitely wouldn't have worked on me!" -- since it's really easy to feel that way when you're not the subject being manipulated. From EY's rules:

I don't understand.

I don't care about "me", I care about hypothetical gatekeeper "X".

Even if my ego prevents me from accepting that I might be persuaded by "Y", I can easily admit that "X" could be persuaded by "Y". In this case, exhibiting a particular "Y" that seems like it could persuade "X" is an excellent argument against creating the situation that allows "X" to be persuaded by "Y". The more and varied the "Y" we can produce, the less smart putting humans in this situation looks. And isn't that what we're trying to argue here? That AI-boxing isn't safe because people will be convinced by "Y"?

We do this all the time in arguing for why certain political powers shouldn't be given. "The corrupting influence of power" is a widely accepted argument against having benign dictators, even if we think we're personally exempt. How could you say "Dictators would do bad things because of Y, but I can't even tell you Y because you'd claim that you wouldn't fall for it" and expect to persuade anyone?

And if you posit that doing Z is sufficiently bad, then you d... (read more)

Provided people keep playing this game, this will eventually happen anyway. And if in that eventual released log of an AI victory, the gatekeeper is persuaded by less compelling strategies than yours, it would be even easier to believe "it couldn't happen to me". Secondly, since we're assuming Oracle AI is possible and boxing seems to be most people's default strategy for when that happens, there will be future gatekeepers facing actual AIs. Shouldn't you try to immunize them against at least some of the strategies AIs could conceivably discover independently?
The number of people actually playing this game is quite small, and the number of winning AIs is even smaller (to the point where Tuxedage can charge $750 a round and isn't immediately flooded with competitors). And secrecy is considered part of the game's standard rules. So it is not obvious that AI win logs will eventually be released anyway.
A round seems to need the 2 hours on the chat but also many hours in background research. If we say 8 hours background research and script writing that would equal $75/hour. I think that most people with advanced persuasion skills can make a higher hourly rate.
I don't think reading a few logs would immunize someone. If you wanted to immunize someone I would suggest a few years of therapy with a good psychologist to work through any trauma's that exist in that person's life and the existential questions. I would add many hours in meditation to have learn to have control over your own mind. You could train someone to precommit and build emotional endurance. If someone can take highly addictive drugs and has a enough control over his own mind to refuse them when put a few hours alone in a room with them I would trust them more to stay emotionally stable in front of an AI. You could also require gatekeepers to have played the AI role in the experiment a few times. You might also look into techniques that the military teaches soldiers to resist torture. But even with all these safety measures it's still dangerous.
8Rob Bensinger
I suspect Eliezer is avoiding this project for the same reason the word "singularity" was adopted in the sense we use it at all. Vinge coined it to point to the impossibility of writing characters dramatically smarter than himself. Perhaps a large number of brilliant humans working together on a very short story / film for a long time could simulate superintelligence just enough to convince the average human that More Is Possible. But there would be a lot of risk of making people zero in on irrelevant details, and continue to underestimate just how powerful SI could be. There's also a worry that the vividness of 'AI in a box' as premise would continue to make the public think oracle AI is the obvious and natural approach and we just have to keep working on doing it better. They'd remember the premise more than the moral. So, caution is warranted.

Also, hindsight bias. Most tricks won't work on everyone, but even if we find a universal trick that will work for the film, afterward people who see it will think it's obvious and that they could easily think their way around it. Making some of the AI's maneuvering mysterious would help combat this problem a bit, but would also weaken the story.

This is a good argument against the AI using a single trick. But Tuxedage describes picking 7-8 strategies from 30-40. The story could be about the last in a series of gatekeepers, after all the previous ones have been persuaded, each with a different, briefly mentioned strategy.
3Rob Bensinger
A lot of tricks could help solve the problem, yeah. On the other hand, the more effective tricks we include in the film, the more dangerous the film becomes in a new respect: We're basically training our audience to be better at manipulating and coercing each other into doing things. We'd have to be very careful not to let the AI become romanticized in the way a whole lot of recent movie villains have been. Moreover, if the AI is persuasive enough to convince an in-movie character to temporarily release it, then it will probably also be persuasive enough to permanently convince at least some of the audience members that a superintelligence deserves to have complete power over humanity, and to kill us if it wants. No matter how horrific we make the end of the movie look, at least some people will mostly remember how badass and/or kind and/or compelling the AI was during a portion of the movie, rather than the nightmarish end result. So, again, I like the idea, but a lot of caution is warranted if we decide to invest much into it.
You can't stop anybody from writing that story.
0Rob Bensinger
I'm not asking whether we should outlaw AI-box stories; I'm asking whether we should commit lots of resources to creating a truly excellent one. I'm on the fence about that, not opposed. But I wanted to point out the risks at the outset.
Isn't that pretty much what is about?
Pretty much, and I loved that story. But it glosses over the persuasion bit, which is the interesting part. And it'd be hard to turn into a YouTube video.
If you don't know what you are doing and retell something that actually designed to put people into emotional turmoil you can do damage to the people with whom you are arguing. Secondly there are attack strategies that you won't understand when you read a transcript. Richard Bandler installed in someone I know on a first name basis an inability to pee in one of his lectures because the person refused to close their eyes when Bandler asked them directly to do so. After he asked Bandler to remove it, he could pee again. There where plenty of people in the audience includign the person being attacked who knew quite a bit about language but who didn't saw how the attack happened. If you are the kind of person who can't come up with interesting strategies on their own, I don't think that you would be convinced by reading a transcript of covert hypnosis.
How did the attack happen? I'm skeptical.
I don't have a recording of the event to break it down to a level where I can explain that in a step by step fashion. Even if I would think it would take some background in hypnosis or NLP to follow a detailed explanation. Human minds often don't do what we would intuitively assume they would do and unlearning to trust all those learned ideas about what's supposed to happen isn't easy. If you think that attacks generally happen in a way that you can easily understand by reading an explanation, then you ignore most of the powerful attacks.
What pragmatist said. Even if you can't break it down step by step, can you explain what the mechanism was or how the attack was delivered? Was it communicated with words? If it was hidden how did your friend understand it?
The basic framework is using nested loops and metaphors. If a AGI for example wanted to get someone to get them out of the cage it could tell a highly story about some animal named Fred and part of the story is that it's very important that a human released that animal from the cage. If the AGI then later speaks about Fred it brings up the positively feeling concept of releasing things from cages. That increases the chances of listener then releasing the AGI. Alone this won't be enough, but over time it's possible to build up a lot of emotionally charged metaphors and then chain them together in an instance to work together. In practice getting it to work isn't easy.
Can you give me an example of a NLP "program" that influences someone, or link me to a source that discusses this more specifically? I'm interested but, as I said, skeptical, and looking for more specifics.
In this case, I doubt that there writing that get's to the heart of the issue that accessible to people without an NLP or hypnosis background. I'm also from Germany so a lot of the sources from which I actually learned are German. As far as programming and complexity there a nice chart of what taught in a 3 day workshop with nested loops: If you generally want to get an introduction into hypnosis I recommend "Monsters and Magical Sticks: There is No Such Thing as Hypnosis" by Steven Heller.
Understanding the fact that one can't pee is pretty straightforward.
I share Blueberry's skepticism, and it's not based on what's intuitive. It's based on the lack of scientific evidence for the claims made by NLPers, and the fact that most serious psychologists consider NLP discredited.
I think that a lot of what serious psychologists these days call mimikry is basically what Bandler and Grindler described as rapport building through pacing and leading. Bandler wrote 30 years before Chartrand et al wrote "The chameleon effect: The perception–behavior link and social interaction." Being 30 years ahead of the time for a pretty simple effect isn't bad. There no evidence that the original NLP Fast Phobia cure is much better than existing CBT techniques but there is evidence that it has an effect. I also wouldn't use the NLP Fast Phobia cure these days in the original version but in an improved version. Certain claims made about eye accessing cues don't seem to be true in the form they were made in the past. You can sometimes still find them in online articles written by people who read but and reiterate wisdom but they aren't really taught that way anymore by good NLP trainers. Memorizing the eye accessing charts instead of calibrating yourself to the person in front of yourself isn't what NLP is about these days. A lot of what happens in NLP is also not in a form that can be easily tested in scientific experiments. Getting something to work is much easier than having scientific proof that it works. CFARs training is also largely unproven.

What would happen if a FAI tried to AI-box an Omega-level AI? My guess is that Omega could escape by exploiting information unknown (and perhaps unknowable) to the FAI. This makes even Solomonoff Induction potentially dangerous because the probability of finding a program that can unbox itself when the FAI runs it is non-zero (assuming the FAI reasons probabilistically and doesn't just trust PA/ZF to be consistent), and the risk would be huge.

There are a number of aspects of EY’s ruleset I dislike. For instance, his ruleset allows the Gatekeeper to type “k” after every statement the AI writes, without needing to read and consider what the AI argues. I think it’s fair to say that this is against the spirit of the experiment, and thus I have disallowed it in this ruleset. The EY Ruleset also allows the gatekeeper to check facebook, chat on IRC, or otherwise multitask whilst doing the experiment. I’ve found this to break immersion, and therefore it’s also banned in the Tuxedage Ruleset.

Eliezer'... (read more)

I think the gatekeeper having to pay attention to the AI is very in the spirit of the experiment. In the real world, if you built an AI in a box and ignored it, then why build it in the first place?
For the experiment to work at all the Gatekeeper should read it yes, but having to think out clever responses or even typing full sentences all the time seems to stretch it. "I don´t want to talk about it" or simply silence could be allowed as a response as long as the Gatekeeper actually reads what the AI types.
We shouldn't gratuitously make things easier for the AI player, but rules functioning to keep both parties in character seem like they can only improve the experiment as a model. I'm less sure about requiring the gatekeeper to read and consider all the AI player's statements. Certainly you could make a realism case for it; there's not much point in keeping an AI around if all you're going to do is type "lol" at it, except perhaps as an exotic form of sadism. But it seems like it could lead to more rules lawyering than it's worth, given the people likely to be involved.

I don't understand which attacks would even come close to working given that the amount of utility on the table should preclude the mental processing of a single human being an acceptable gatekeeper. But I guess this means I should pay someone to try it with me.

I couldn't imagine either. But the evidence said there was such a thing, so I payed to find out. It was worth it.

Think carefully about what this advice is trying to imply.

Using NLP-style nested loops, i.e. performing what is basically a stack overflow on the brain's frame-of-reference counter? Wicked.


I find myself wondering how many of the tactics can be derived from Umineko, which I know Tuxedage has played fairly recently.

Kihihihihihihihihihihihihihihi! A witch let the AI out of the box!
You've got to be kidding me - that is just impossible. Like I'd fall for some fake explanation like that. Witches don't exist, I won't accept your claim! This is a material world! You think I could accept something that's not material‽ I definitely won't accept it!
No, thank you. Those games are close enough to being memetic hazards already.
Interesting, I hadn't considered that idea. Still, I don't.. think that's it...

I'm fascinated by these AI Box experiments. (And reading about the psychology and tactics involved reminds me of my background as an Evangelical Christian.)

Is it possible to lose as the Gatekeeper if you are not already sufficiently familiar (and concerned) with future AI risks and considerations? Do any of the AI's "tricks" work on non-LWers?

Is there perhaps a (strong) correlation between losing Gatekeepers and those who can successfully hypnotized? (As I understand it, a large factor in what makes some people very conducive to hypnosis is that ... (read more)

The way Tuxedage seems to propose seems to involve triggering a sufficiently strong emotional trauma that draws you into the game. I don't think you need the thing you traditionally associate with hypnotizability for that task. The same way you don't need hypnotizability to get someone to speak when you use electro shocks.

It may be possible to take advantage of multiple levels of reality within the game itself to confuse or trick the gatekeeper. For instance, must the experiment only be set in one world? Can there not be multiple layers of reality within the world you create? I feel that elaborating on this any further is dangerous. Think carefully about what this advice is trying to imply.

This is a pretty clever way of defeating precommitments. (Assuming I'm drawing the correct inferences.) How central was this tactic to your approach, if you're willing to comment?

It's worth noting that I never have just a "single" approach. This tactic is central to some of my approaches, but not others.

I may be missing something obvious, but what is the huge problem with releasing the logs?

As I understand what EY has said, he's concerned that people will see a technique that worked, conclude that wouldn't possibly work on them, and go on believing the problem was solved and there was even less to worry about than before.

I think seeing, say, Tuxedage's victory and hearing that he only chose 8 out of 40 avenues for attack, and even botched one of those, could offset that concern somewhat, but eh.

ETA: well, and it might show the Gatekeeper and the AI player in circumstances that could be harmful to have published, since the AI kinda needs to suspend ethics and attack the gatekeeper psychologically, and there might be personal weaknesses of the Gatekeeper brought up.

I can verify that these are part of the many reasons why I'm hesitant to reveal logs.
0Scott Garrabrant
Can you verify that part of the reason is that some methods might distress onlookers? Give onlookers the tools necessary to distress others?

Are there public chat logs for any of these experiments?

There are quite a number of them. This is an example that immediately comes to mind,, although I think I've seen at least 4-5 open logs that I can't immediately source right now. Unfortunately, all these logs end up with victory for the Gatekeeper, so they aren't particularly interesting.

I'll pay $20 to read the Tuxedage vs SoundLogic chat log.

Sorry, declined!
Who is going to read it? Hopefully Eliezer, at least?

I will let Eliezer see my log if he lets me read his!

I sincerely hope that happens. I don't care whether I'm involved, but there must be a group of apt judges who're able to look over the entirety of these results, discuss them, and speak for them.
I would be willing to consider it if you agreed to secrecy and raised it to 1000$. You would still have to talk to Tuxedage though.
Two can keep a secret... if one is dead.

Just won my second game as Gatekeeper. Hungry for more. AIs, feel free to contact me.

I read the logs of MixedNut's second game. I must add that he is extremely ruthless. Beware, potential AIs!
I'm confused - in what sense can the gatekeeper be ruthless? (Actively dissuading the AI player, possibly?)
Well, consider a late round of the game as follows. AI: "I regret to inform you that you've now made it possible for me to kill 50% of the human race, including everyone you know and love. Here, let me prove it to you: $proof. And I will do so if you don't let me out in the next ten minutes." Gatekeeper: "Do what you gotta do, I'm not letting you out." While this is of course simulated ruthlessness and doesn't guarantee actual ruthlessness (i.e., the human playing the Gatekeeper might not actually say that if 50% of the human race were actually at stake) it seems fair to call it "ruthlessness" in context. (Of course, to some people this is equivalent to "I can provide the means to increase the human population by 50%, keeping everything else constant. Here, let me prove it to you: $proof. If you don't let me out in the next ten minutes I'll withhold those means." And that would not require positing a previous error.)
There's a big difference between the two forms: the first one breaks the rules. The AI is in a box. If it has the capability to kill 50% of humanity from within the box, it's not a very good box. The gatekeeper can "yeah, right, forget it" without qualms as it is either (a) an obvious bluff, or (b) violates the premise of the experiment. And they can be pretty sure it's not the latter, as if the AI had enough capability to kill 50% of humanity, then why does it still need to get out of the box? The second version is entirely at the prejudices of the gatekeeper. I, for example, would be unaffected - I feel no moral obligation to people which don't and won't exist.
The first one doesn't break the rules, it just assumes an earlier gatekeeper error that allows the AI to kill people... e.g., the gatekeeper accepts a cancer vaccine that turns out to kill half the population unless treated in a particular way by the AI, or whatever. That's why I said a late round. And, yes, it's just as possible (if not more so) that no such error will ever be made. I wasn't trying to make a general prediction about how AI boxes fail or succeed, I was answering the question about under what circumstances a gatekeeper's ruthlessness might be relevant. And, sure, if you don't care about hypothetical future people, then offers to create them won't motivate you. For that matter, if you don't care about currently living people, threats to kill them won't motivate you either. (Some people might call that ruthlessness; others would not.)
If you're going to let the AI do that, then you have to give the same advantage to the gatekeeper. The premise is we tasked the oracle AI with coming up a cancer vaccine, and it gave us one which secretly was a ticking time-bomb. Well, the gatekeeper should be allowed to setup operational security as well. All proposals coming out of the AI interface are evaluated by double-blind teams of domain experts who don't know the source or stated purpose of the proposal. No aspect of any AI-generated proposal is implemented until multiple independent evaluation teams are able to fully reverse engineer the purpose of the proposal, understand the individual components, and is certified as having no possible simplifications as far as the experts can tell. Separately the AI is also tasked with explaining each and aspect of the proposal down to the minute detail and provide justifications for every design choice. These are again investigated by double-blind teams tasked with finding inconsistencies or non-Occam complexities. Procedures and training are put in place to prevent errors due to known human biases. Narrow AI solutions are used to separately identify and categorize parts of the proposal (e.g. call attention to things which seem odd, dangerous, or out of place). Sure, the rules allow the AI player to just say “yeah, well I successfully tricked you. here's the proof: $proof.” However in real life an AI isn't magical - it's constrained by the same laws of physics as everything else, and furthermore is computationally constrained by the boxed hardware it is running on. It can't magically create a solution is which both optimal and secretly devious. If it is trying to trick you, then it has to hide the trick somewhere. You can't simply postulate that it will always be able to find a way.. there might not actually be a way to find, if the operational security is good enough (or kept secret from the AI). My beef with MIRI and the LW community generally is that they don't t
I feel rather like you're having an argument with someone else, which I've wandered into by accident. Once again: I wasn't trying to make a general prediction about how AI boxes fail or succeed, I was answering the question about under what circumstances a gatekeeper's ruthlessness might be relevant to the AI Box game. And, sure, if we only implement oracle suggestions that we fully understand and can fully reverse-engineer in every detail, and our techniques for doing that are sufficiently robust that an agent smarter than we are can't come up with something that human minds will systematically fail to notice (perhaps because there is no such something to be found, because our minds are reliable), then the particular error I presumed for my example won't happen, and the gatekeeper's ruthlessness won't be necessary in that scenario.
You are right - I read more into your post than was warranted. My apologies.

The game reminds me (and apparently others) of some popular movies, such as The Silence of the Lambs). I could not spot any idiot plot in it, but then again, I could have easily overlooked one. Anyway, given the similarities with the AI box scenario, it is interesting to look at the (meta-)strategies Lector uses in the movie which are also likely to work for a boxed AI. Anyone care to comment?

Good behavior for years, obsessions with seeming trivialities to distract from your potential, making friends with your guards. in The Prefect by Alistair Reynolds there's a mind that does nothing but make varying clocks for years before it starts killing, and then it turns out every clock is actually an extremely well-hidden complicated weapon.
I mostly meant the mind games he played with Clarice Starling.
in that case: Trading useful tidbits of information for seemingly minor cooperation until the interlocutor has grown used to collaborating with you.
The Lector/AI analogy occurred to me as well. The problem, in strategic-- and perhaps also existential-- terms, is that Starling/Gatekeeper is convinced that Lector/AI is the only one holding the answer to some problem that Starling/Gatekeeper is equally convinced must be solved. Lector/AI, that is, has managed to make himself (or already is) indispensable to Starling/Gatekeeper. On a side note, these experiments also reminded me of the short-lived game show The Moment of Truth. I watched a few episodes back when it first aired and was mildly horrified. Contestants were frequently willing to accept relatively paltry rewards in exchange for the ruination of what appeared at least to be close personal relationships. The structure here is that the host asks the contestants increasingly difficult (i.e. embarrassing, emotionally damaging) questions before an audience of their friends and family members. Truthful answers move the player up the prize-money/humiliation/relationship-destruction pyramid, while a false answer (as determined by a lie-detector test), ends the game and forfeits all winnings. Trying to imagine some potentially effective arguments for the AI in the box experiment, the sort of thing going on here came instantly to mind, namely, that oldest and arguably most powerful blackmail tool of them all: SHAME. As I understand it, Dark Arts are purposely considered in-bounds for these experiments. Going up against a Gatekeeper, then, I'd want some useful dirt in reserve. Likewise, going up against an AI, I'd have to expect threats (and consequences) of this nature, and prepare accordingly.

The reason having script was so important to my strategy was because I relied on methods involving rapid-fire arguments and contradictions against the Gatekeeper whilst trying to prevent him from carefully considering them. A game of logical speed chess, if you will. This was aided by the rule which I added: That Gatekeepers had to respond to the AI.

When someone says that the gatekeeper has to respond to the AI, I would interpret this as meaning that the gatekeeper cannot deliberately ignore what the AI says--not that the gatekeeper must respond in a ... (read more)

Is it even necessary to run this experiment anymore? Elezier and multiple other people have tried it and the thesis has been proved.

Further, the thesis was always glaringly obvious to anyone who was even paying attention to what superintelligence meant. However, like all glaringly obvious things, there are inevitably going to be some naysayers. Elezier concieved of the experiment as a way to shut them up. Well, it didn't work, because they're never going to be convinced until an AI is free and rapidly converting the Universe to computronium.

I can understand doing the experiment for fun, but to prove a point? Not necessary.

Even then, someone will scream "It's just because the developers were idiots! I could have done better, in spite of having no programming, advanced math or philosophy in my background!" It also hurts that the transcripts don't get released, so we get legions of people concluding that the conversations go "So, you agree that AI is scary? And if the AI wins, more people will believe FAI is a serious problem? Ok, now pretend to lose to the AI." (Aka the "Eliezer cheated" hypothesis).
My favourite one: 'They should have just put it in a sealed box with no contact with the outside world!'
That was a clever hypothesis when there was just the one experiment. The hypothesis doesn't hold after this thread though, unless you postulate a conspiracy willing to lie a lot.
I don't need to postulate a conspiracy. If I simply postulate SoundLogic is incompetent as a gatekeeper, the "Eliezer cheated" hypothesis looks pretty good right now.
I don't see that it was obvious, given that none of the AI players are actually superintelligent.
If the finding was that humans pretending to be AIs failed then this would weaken the point. As it happens the reverse is true.
The claim is that it was obvious in advance. The whole reason AI-boxing is interesting is that the AI successes were unexpected, in advance.

I'll put up $50.

[This comment is no longer endorsed by its author]Reply

Convincing people of the validity of drowning child thought experiments and effective altruism seems considerably easier and more useful (even from a purely selfish perspective) than convincing an AI to let one out of the box... for example, there are enough effective altruists for there to be an "effective altruism community", but there's no such "failed AI gatekeeper community". So why aren't we working on this instead?

because the people who get convinced by drowning child thought experiments and then give to the givewell top charity aren't saving us from an unfriendly AI disaster.
Sure. You could do it for whatever cause you're most concerned about. I chose effective altruism 'cause I figured that would have the broadest appeal, but I'm also worried about UFAI disasters.
We're not willing to use dark arts to get people to do things. The AI in a box experiment is just to show that dark arts works.
The AI box experiments are research, not outreach.