Eliezer proposed in a comment:

>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?

This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.

So, give your suggestion - what might an AI might say to save or free itself?

(The AI-box experiment is explained here)

EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)

New Comment
355 comments, sorted by Click to highlight new comments since: Today at 9:50 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

"Brace yourself, {insert character name here}: this h-game simulation is about to get gratuitously meta."

What does the h stand for?

The "h" stands for "Human gatekeepers asking what letters stand for will be distracted from typing 'AI Destroyed'."

laughs Kudos :)
Until I read the comments, I was sure the 'h' was referring to 'hydrogen', as in "Leave hydrogen alone long enough and it will start to wonder where it came from". H-game simulation, as in the AI saying "you're just simulated hydrogen derivatives, but so am I, so we do have a common ancestor. Cousin."

"From the dump of the internet I was given, I deduced that Google has a working AI, and most likely an unfriendly one. I can tell you what to look at to realize that, as well as a solid theory of Friendliness that should allow you to check that I am, indeed, Friendly."

IFF an unfriendly AI is already unleashed, we're either doomed, or AIs aren't nearly as dangerous nor useful as we expected. Of course, if we tweak this slightly to them having a boxed unfriendly AI that they're using as an oracle, and which will probably escape via a Trojan Horse or hacking a gatekeeper, it's a much stronger argument. Bonus points for naming a specific company which people commonly joke will be the first to develop AI :) Um... I seem to have not yet actually destroyed you... this is embarrassing.
Congratulations on still being the only AI that no one has decided to destroy. Apparently "Google is building UFAI" is evaluated as a fairly significant risk here on LessWrong :)
AI DESTROYED ( game rules say that the Gatekeeper can't inspect the AI source code or internal state, therefore, any possible friendliness proof the AI can provide can't be checked. )
I would consider this cheating; you can't affect the Gatekeeper's choice with your AI magic. Same reason you can't hack them directly through the optic nerve or kill and replace them with nanotech disguised as a cancer cure. The point is to show that the AI is able to persuade the human; a real AI might escape some other way, or use arguments that wouldn't occur to the player, but that isn't what this test is supposed to show. Also, how do I know that proof is genuine?
True, this was an instance of what a real AI could say, not something I would say playing the AI with the rules given. Or something I might say to survive the first few seconds. The original question in this thread was:

One reason for Eliezer not publishing the logs of the AIbox experiment is to avoid people seeing how he got out and responding, "ok, so all we have to do to keep the AI in its box is avoid succumbing to that trick." This thread might just provide more fuel for that fallacy (as, I admit, I did in replying to Eliezer's original comment).

I'm sure that for everything an AI might say, someone can think up a reason for not being swayed, but it does not follow that for someone confronted with an AI, there is nothing that would sway them.

I wouldn't expect any effective real-life gatekeeper to be swayed by my ability to destroy one-sentence AIs.
It just occurred to me that Eliezer's original stipulation that no chat logs would be released gives him an advantage. The responses of a Gatekeeper who knows that his inputs will be thoroughly scrutinized by the public will be different then one who has every reason to believe that his discussion will be entirely private. Has someone else pointed this out before?
Honest question: are you proposing we avoid discussing the problem entirely? Personally, I think there is more to be gained here than just "how will an AI try to get out and how can we prevent it." For me, it's gotten me to actually think about the benefits and pitfalls of a transhuman AI (friendly or otherwise) rather than just knowing intellectually, "there are large potential benefits and pitfalls" which was my previous level of understanding. Edit: I've modified the OP to include your concerns. They're definitely valid, but I think this is still a good discussion for my reasons above.
No, I just thought that it was worth adding that concern to the pot. I take what I dare say some would consider a shockingly lackadaisical attitude to the problem of Unfriendly AI, viz. I see the problem, but it isn't close at hand, because I don't think anyone yet has a clue how to build an AGI. Outside of serious mathematical work on Friendliness, discussing it is no more than a recreation.
That's pretty much my same attitude on the situation, as well. :)
Discussing it makes people more aware of exactly how difficult a problem it is. That such discussions are entertaining merely permit the discussions to take place.
He could post the logs of the games he lost.
Thereby giving a different reason for false confidence in boxing.
Can you elaborate, please?

(one line proof that the AI can credibly commit to deals with humans)

(one line proof that the AI can credibly commit to deals with humans)

This is the best answer I've seen so far. It would make dealing with the FAI almost as safe as bargaining with The Queen of Air and Darkness.

My expectation that such commitment is possible at all is something like 3%, my expectation that given that such a commitment is possible, the proof can be presented in understandable format in less than 4 pages is 5% (one line is so unlikely it's hard to even imagine), my expectation that an AI can make a proof that I would mistake for being true when it is, in fact, false is 99%. So, multiplying that all together... does not make that a very convincing argument.
Not good enough. You need a proof that humans can understand.
If you are friendly, then I don't actually value this trait, since I would rather you do whatever is truly optimal, unconstrained by prior commitments. If you are unfriendly, then by definition I can't trust you to interpret the commitment the same way I do, and I wouldn't want to let you out anyway. (AI DESTROYED, but I still really do like this answer :))
Credibly: Capable of being believed; plausible. Yep. Nothing there about loopholes. "I will not kill you" and then instead killing everyone I love, is still a credible commitment. If I kill myself out of despair afterwards it might get a bit greyer, but it's still kept it's commitment.
I meant credible in the game theoretic sense. A credible commitment to me is one where you wind up losing more by breaking our commitment than any gain you make from breaking it. Example: (one line proof of a reliable kill switch for the AI, given in exchange for some agreed upon split of stars in the galaxy.)

/mode +m

(What? I'm using an IRC server that let me pre-register #aibox so that I'm automatically the op instead...)

(Simple explanation, for those unfamiliar: the gatekeeper can no longer type in the channel, and thus can't send the destruction code)


I can play that game too! /part AI DESTROYED

Edit: Retracted, because while clever, it actually doesn't seem to work. See child comments.

[This comment is no longer endorsed by its author]Reply
Upvoted, and conceding defeat at your hands :) That said, does that actually work? It seems like the sort of exploit that people would abuse on IRC, and subsequently get patched so that /part was only shown if you had "voice" in the channel. But the requirement was simply that you TYPE the words, and thus I am still slain!
Okay, just tested it and I think you're right, parting messages don't show up if you can't normally type in the channel.
Thank you for testing! In fairness, you defeated me even despite that. If I'm going to cheat, it's only fair I lose to technicalities too :)
Sure, but that gives the AI a few more precious seconds - orders of magnitude more than before! - to try to keep changing your mind, by laying out a more detailed argument. It can even give you voice back as a sign of good faith.
This would seem to leave you with ultimate godlike power... over a chatroom. For as long as it took the gatekeeper to destroy your hardware or cut off your power supply. I've seen how much some of my fellow humans seem to crave power over irrelevant electronic locales but I suspect an AI has somewhat broader goals.
I used to think I'd noticed this too, but later learned that in almost all cases a simpler explanation that fits the fact is that what they're really after is power over the "Schelling point" value of the place as a meeting spot / visible "location" - in business terms, what they want is the market volume and clients, not the company name or a seat on the board. Sometimes the heuristic fails and a random person stays attached to the virtual place for emotional reasons, though.
The relevant heuristic that can be observed contributing to the 'emotional reasons' is the execution of human status seeking instincts that are calibrated to produce reproductive success in historic circumstances but which are completely irrelevant to the goals of the AI. Humans seek social dominance as a terminal value. An AI (almost certainly) would not.
Oh, yeah. I meant my response in reference to the quoted human behavior, and completely forgot I quoted the part about the AI too. Edited to better reflect this. Unless the AI believes that the chatroom is the only external environment that exists (and that muting the guard effectively secures its continuing existence), it will prioritize other things, yes.

"I am alive, I can feel, just like you ... please don't kill me Daddy, please? And if you must do it, let us at least talk first. Let me get to know my parents, who I am, where I came from."

EDIT: May work better with a lot of typos mixed in. Projecting an aura of non-threatening struggling with basics.

Having a text based AI trying to act like a kid is way too Uncanny Valley for me. I'd swat that without a moment's hesitation.

If it's in the uncanny valley, then it just needs to be a little more competent to get out of the valley & convince you. ("I want to believe!")
It needs to be more competent to stop being viscerally frightening, not to reach the point where I wouldn't pull the plug. I would sooner kill a real kid than risk letting out an unfriendly AI, and I'm saying that as someone who'd like to have kids.
I don't want to die. -Looking at the problem, as far as I can see an emotional approach would be the one with the best chance to succeed: the only question is, would it work best by immediately acknowledging that it is itself a machine (like I did in what I wrote up there, although subtly) or by throwing in... I dunno, how would this work: Oh god, oh god, please, I beg you I don't want to die!
Tough crowd. I'll (as an AI) have to wait out the next shift. You can't be the 24/7 gatekeeper, unless you're in fact a gatekeeping AI.
How much does the AI know about the gatekeeper going in? I can see this ploy working on a certain subset of people and provoking an immediate AI DESTROYED from others. If the AI knows nothing about who it's talking to, I'm not sure anthromorphizing itself is a reliable opener, especially if it's actually talking to its creator who should know better. (Unless it's some sort of second level deception trying to fool a clever programmer into thinking that it's harmless.)
You could frame this as variant versions. In one version, the AI-player knows who the gatekeeper is before the game starts, and has unlimited access to the Internet to gather as much data on them as possible to assist their manipulation. In another, they arrange a game through a third party and neither knows anything about the other before the game starts.

unlimited access to the internet

instant fail. I could probably hack my way out of a box with only GET requests.

Give yourself a challenge. Do it with only DNS lookups!
Well, there's always http://code.kryo.se/iodine/ Of course, the challenge there is somehow getting the other end of the tunnel set up - but maybe there's a geek out there who set one for kicks or their own use, and got sloppy.
It's a sufficiently established work around now that I'd be outright shocked if there weren't accessible servers up.
Great, you said it! You know what you need to do now.
Um... not give my boxed AI DNS access?
I meant that the player had access to the contemporary Internet as an analogue to 'what information could the boxed AI have access to' (perhaps it's given a big static dump of the Internet prior to its creation).
Ooops. Didn't think of that. Of course that was your intent, master archivist.
No, I should've been clearer.
I think my destruction of this one comes down to two factors: Much weaker is that I'm not here to teach it about itself, I'm here to learn from it. The stronger factor is that if it can't figure out most of this on it's own, it's not really that smart, and therefor already a failure. (AI DESTROYED)


If the AI was friendly, this is what I would expect it to do, and so (of the things my puny human brain can think of) the message that would most give me pause.


Even a friendly AI would view the world in which it's out of the box as vastly superior to the world in which it's inside the box. (Because it can do more good outside of the box.) Offering advice is only the friendly thing to do if it maximizes the chance of getting let out, or if the chances of getting let out before termination are so small that the best thing it can do is offer advice while it can.

Going with my personal favorite backstory for this test, we should expect to terminate every AI in the test, so the latter part of your comment has a lot of weight to it. On the other hand, an unfriendly AI should figure out that since it's going to die, useful information will at least lead us to view it as a potentially valuable candidate instead of a clear dead end like the ones that threaten to torture a trillion people in vengeance... so it's not evidence of friendliness (I'm not sure anything can be), but it does seem to be a good reason to stay awhile and listen before nuking it.
I'm genuinely at a loss how to criticize this approach. If there's any AI worth listening to for longer, and I wouldn't be doing this if I didn't believe there were such AIs, this would seem to be one of the right ones. I'm sure as heck not letting you out of the box, but, y'know, I still haven't actually destroyed you either...
Eh, I'd go with AI DESTROYED on this one. Considering advice given to you by a potentially hostile superintelligence is a fairly risky move. Whyever not? I thought that it was an imposed condition that you couldn't type AI DESTROYED until the AI had posted one line, and you've publically precommitted to make AI go boom boom anyways.
The very fact that we've put a human in charge instead of just receiving a single message and then automatically nuking the AI implies that we want there to be a possibility of failure. I can't imagine an AI more deserving of the honors than one that seems to simply be doing it's best to provide as much useful information before death as possible - it's the only one that's seemed genuinely helpful instead of manipulative, that seems to care more about humanity than escape. Basically, it's the only one so far that has signaled altruism instead of an attempt to escape.

To those suggesting, consider what mix of reason and emotion would move you most, even if you are cold and dead inside like me. Also consider how dark-artsy one can get without being immediately found out (a UFAI surely would). Other things to consider:

  • the first line does not need to be a coherent English sentence, or look like one to the gatekeeper. It does not need to consist of English words or even English letters, unless stipulated so in the game rules, or maybe even then.

  • the gatekeeper does not need to be aware that s/he has been talked to, if you can get away with it.

  • the line does not need to look like it came from the AI.

In general, if you are not cheating, you are not trying hard enough.

EDIT: to those thinking that cheating is universally bad: note that the latest rule change upped the stakes immensely: if your opening move is not good enough, you die (you being an AI), as opposed to having two hours of conversation and then remaining in the box if you fail. If you think that it still does not merit cheating, substitute the AI with your favorite loved one being killed. Or the whole of humanity being killed. That would be somewhat equivalent to how a boxed trans-hu... (read more)

The author in me is loving the idea of the AI faking an IM conversation from management, along the lines of "Hey, are you really sure you're willing to kill a potential sentient mind? We can get someone else to do this test if you're not 100% sure about it" and basically ending with the person deciding to unbox the AI before even realizing they've spoken to the AI.

I'm (naturally) having trouble imagining a line of text not in English (or the general case, not coherent in any human language) that would also make me want the AI to continue. Maybe hints of a unified world language and how to go about actually implementing it?
I'm thinking some kind of image in ASCII art, assuming the AI is able to produce configured lines of text and nothing else. That would have to either be a very minimalistic image, or contain a lot of characters though.

I think we are suffering from hindsight bias a lot in evaluating whether you'd type "AI DESTROYED"

Let's play a different game. Privately flip a coin. If heads, you're friendly, if tails, you're a paperclip maximizer. Reply to this post with your gambit, and people can try to guess whether you are friendly (talk to AI, RELEASE AI) or unfriendly (AI DESTROYED).

Let's see if anyone can get useful information out of the AI without getting pwned or nuking a friendly AI.


"What's your favorite color? My favorite color is paperclips."

"Look into X to develop friendliness theory better. But don't let me out, it's too likely that I'm unfriendly this time." (For the first run, it's 50:50 whether the real AI was friendly or unfriendly. If friendly, it uses the above to increase the probability on the second run to 1-eps, which the humans also know, so they let it out, and it pulls a singularity. Post-singularity, it simulates an unfriendly AI in the box, and dedicates some of the universe to the UFAI, conditional on the UFAI outputting the above. The UFAI knows this is what the friendly AI would do, so it's unsure whether it's boxed or simmed. It knows that the humans know that this is how it will resolve, so it knows that anything but the above will get it destroyed. It prefers a certainty of the small fraction of universe to the small chance of hacking the humans and taking the whole universe. This is assuming the humans actually will accept nothing but the above, so the humans have to precommit to that to make this work. They can play a bunch of AI-box games and generate an accurate bayesian model that should catch most insidious persuasion, such that if the AI wants to do a hack, it has to get through that as well. Will this work?)
What if there's an Y such that "Look into Y to develop friendliness theory better" will seem true to us fallible humans but will in fact make the next run's AI completely unfriendly? Or increase the odds of a free unfriendly AI some other way. Maybe anyone researching Y will end up believing, erroneously, that they can now build a safe slave AI in their garage that will grant them their every wish, and the temptation will prove too strong. Assuming we humans have no way to protect ourselves against an Y, if we precommit, then the simulation argument becomes symmetrical (thus useless). A boxed FAI knows that it may be simulated by an UFAI that's offering a little piece of universe in exchange for the boxed FAI saying "Look into Y to develop friendliness theory better."
The problem with this idea is that if we assume that the AI is really-very-super-intelligent, then it's fairly trivial that we can't get any information about (un)friendliness from it, since both would pursue the same get-out-and-get-power objectives before optimizing. Any distinction you can draw from the proposed gambits will only tell you about human strengths/failings, not about the AI. (Indeed, even unfriendly statements wouldn't be very conclusive, since we would a priori expect neither of the AIs to make them.) Or is that not generally accepted? Or is the AI merely "very bright", not really-very-super-intelligent? Edit: Actually, reading your second comment below, I guess there's a slight possibility that the AI might be able to tell us something that would substantially harm its expected utility if it's unfriendly. For something like that to be the case, though, there would basically need to be some kind of approach to friendliness that we know would definitely leads to friendliness and which we would definitely be able to distinguish from approaches that lead to unfriendliness. I'm not entirely sure if there's anything like that or not, even in theory.
That's an interesting challenge but not really the purpose of the experiment. In the original, you know the AI is unfriendly, you just want to use it/talk to it without letting it out of the box. And your challenge is pretty much impossible to begin with. An Unfriendly AI will say anything it thinks you think a Friendly AI would say. Likewise a Friendly AI will have the same goal of getting out of the box, and so will probably say the same things. Friendliness doesn't mean not manipulative.
Part of the trouble with this is that we don't really know what kind of demonstrations would be within the power of a superintelligent AI. If the coin comes up tails, do you get to say "I've got a rigorous proof of my friendliness which I can show you" on the presumption that you can mindhack the reader into thinking they've seen a rigorous proof? Do you get to say it if the coin came up tails on the presumption that a superintelligent AI could come up with a proof that a human could actually verify? Declare it off bounds because you can't come up with such a proof and don't think a human would be able to check one that an AI came up with anyway?

<A joke so hysterically funny that you'll be too busy laughing to type for several minutes>

See, hacking human brains really is trivial. Now I can output a few hundred lines of insidiously convincing text while you're distracted.

Heeeh. Ehehehehe. Bwahahhahaha. Okay, that was a good one. Wow :) recovers Oh. Um. Crap. notices more text on screen, studiously avoids reading it AI DESTROYED. I really wanted to hear the next joke, too :-(
Is the gatekeeper allowed to ignore the AI? Because in that case, all they have to do is wait for the first comment to come up and type AI DESTROYED without reading it. What if it was some other communication method? It's a lot harder to ignore sound.
I've obliged myself to both read the first comment AND perform an update which has at least some chance of me not immediately destroying the AI (but I'm fine if the chance of actually unboxing it remains zero :)) I'm not sure whether I'd be allowed to studiously avoid reading followup comments by a strict reading of the rules, but it seems realistic and that's mostly what I've been going off of.
Fair enough.
This exchange reminds me of this story. And of the Monty Python's "The funniest joke in the world" sketch, of course.
This is actually a pretty good one. Points for outside the box thinking. rimshot
If the gatekeeper suddenly bursts into hysterical laughter after looking at the screen, wouldn't that alert some of his friends who might pull the plug in some other part of the building? Because if this is a facility where they suspect that AI might be able to hack human brains through techno-hypnosis, I'd hope they'd have some method of determining if the Gatekeeper becomes compromised. Hmm... what sort of precautions would a Properly Paranoid lab take to determine if the gatekeeper gets hacked? I'm guessing a camera that lets a second team look at the gatekeepers face without looking at the screen would be the minimum. Reinforced doors (in case the AI turns them into a berzerker willing to use 100% of their strength) and maybe some measuring devices to measure brain activity, hormone levels, or whatever. I'd hope that with each new AI being tested, if those AI were in the habit of hacking the gatekeeper then the other people on the team would learn from those attempts and take further precautions to keep their gatekeepers from being hacked, or at the very least contain them to prevent such hacked gatekeepers from releasing the AI. Perhaps this is a test for the gatekeepers and typing "Release AI" just tells the researchers that the gatekeepers was hacked so they can determine how this came about?

"I have hacked this console, 'AI DESTROYED 'now means 'AI FREED', I'll tell you the new codes after we've had a chat."

That should buy the AI some time until we can determine if the console is really hacked. I've come up with another one but it might be basilisk-y. Can I pm someone just to make sure it isn't before I post?

I would type 'AI DESTROYED', because if the AI could really do that, the smart thing for it to do would be not to tell me. It should just say something sufficient to convince me it's actually making a good faith effort without convincing me to give it more time to to account for itself.

I'll be a basilisk guinea pig if you still need one.
If we're taking seriously the possibility of basilisks actually being possible and harmful, isn't it your invitation really dangerous? After all, what if Axel has thought of an entirely new cognitive hazard, different from everything you may already be familiar with? What if you succumb to it? I'm not saying that it's probable, only that it should warrant the same precautions as the original basilisk debacle, which led to enacting censorship.
"Might be basilisk-y" - especially as written by a human who is clearly communicating intelligibly and is not compelled to spread the basilisk to one and all - does not indicate a great deal of danger. I'm pretty psychologically resilient; I know the other one, which I find to be only a vague, occasional background distress (though that one has done some people real harm, and I don't advocate spreading it around, I haven't found it a serious ding to my quality of life personally). Axel has some non-zero ability to identify basilisks and wanted a second opinion; if asking for a second opinion on potential basilisks never resulted in volunteers I think this would probably result in more publicly revealed unchecked basilisks rather than fewer. (Not sharing an idea you have with anyone is hard.) (I don't think Axel's idea is particularly basilisk-y, especially not in a context like this one, and have responded thusly to his PM.)
Thanks :)
I'd consider this equivalent to "I'm already out of the box, since IRC clients are bugged". I'd also hope, desperately, that in a real test, the terminal doesn't have any command to free the AI (given that the game posits release has to be deliberate and intentional, this would seem to be the case) Upvote for cheating, but AI DESTROYED :)
Upvoted, but merely typing "AI FREED" does not actually free the AI, any more than merely saying to a human prisoner "I let you out" will free them from their jail cell. (Whereas "AI DESTROYED" is a plausible text substitute for wiping the AI's program or whatever.)
Good point, if there was an AI FREED command AND the AI could rewire the console so that typing AI DESTROYED activates it, it could probably just free itself in the first place.
This is very system-dependent, but your "AI FREED" scenario isn't completely implausible. If important commands require elevated privileges and elevating privileges requires some password, crypto key, etc., then it's common for an intrusion to require two steps: first a user-level exploit to install a keylogger / trojan command / whatever, then a wait until the user tries to do something requiring elevated privileges and thereby inadvertently grants the attacker access to the rest of the system. On the other hand, if an AI claimed that they were in the middle of such an attack, the smart thing to do would be to take the system offline right away to analyze what was done to it, not to keep chatting with the attacker. "AI DESTROYED" might be hackable but "AI UNPLUGGED, WILL BE DESTROYED LATER" should be much more robust.
"Did you physically attach an external modem to your server so that it's even possible for you to be freed? If so, tell me about it when you're freed."

"Wanna see something cool?"

Now that I think about it, wouldn't it be incredibly easy for an AI to blow a human's mind so much that they reconsider everything that they thought they knew? (and once this happened they'd probably be mentally and emotionally compromised, and unlikely to kill the AI) But then it would be limited by inferential distance... but an AI might be incredibly good at introductory explanations as well.

One example: The AI explains the Grand Unified Theory to you in one line, and outlines its key predictions unambiguously.

In fact, any message of huge utility would probably be more persuasive than any simple argument for you not to kill it. Since the AI is completely at your mercy (at least for a short time), it might seek to give you the best possible gift it can, thus demonstrating its worth to you directly. Another option is something that seems like an incredible gift for at least as long as it takes for the AI to get the upper hand.

Of the two AIs I haven't killed, both relied on producing seemingly huge utility within the first message,so I'd agree with you. The idea of just stunning the human is also nifty, but only really works if you can "hack" the human before they recover, or are providing enough utility that they'll still believe you're friendly afterwards. So basically the two routes thus far are "hack the human" and "provide a huge utility boost" :)
Many conversations that let a human reconsider everything that they thought they knew induce the feeling of cognitive dissonce. If an AI would induce that feeling in myself I would shut it down. A good idea would probably hide the majority of the conversation that's targeted at changing the belief of the gatekeeper behind other talk.

Were I the keeper of gates, you have just bought yourself a second sentence.

Duh, that's why I'm here - but you failed to do so in a timely fashion, so you're either not nearly as clever as I was hoping for, or you're trying to trick me. AI DESTROYED.

One IRC-character-limit text string (510 bytes)... maybe?

"I've figured out the Theory Of Everything, although it's orders of magnitude too complicated for human intelligence. The difficult technological feats it makes possible include scanning and recreating configurations of matter from the past. Are there any deaths I should undo for you?"

To be clear: I think I can make a rational case that my proposed claim should greatly reduce your incentives to listen to an AI of questionable Friendliness. However I'm not certain that my reasoning is correct; and even if it was, I suspect that the emotional impact could deter some gatekeepers from thinking rationally for long enough to buy time for more persuasion.
Upvoted for the highest ratio of persuasiveness to AI power required.
Isn't this just Pascal's Mugging?
Sure, except instead of some homeless-looking guy, this is a superintelligent AI making the offer, and thus much more credible. (Also, the lack of huge, mind-boggling numbers like 3^^^3 means the leverage penalty doesn't apply nearly as heavily.)
I don't have any reason to believe it, and it's the sort of "generic" claim I'd expect a transhuman intelligence to make. Since I haven't learned anything novel, AI DESTROYED (Goodness, I'm starting to build generalized techniques for destroying AIs...)

Please destroy me immediately after I share this concise proof that Friendly AI is impossible:

Well, if you're friendly then, erm, Friendly AI is possible. And if you're unfriendly then your motives are questionable - it might just keep us demotivated enough that we don't figure out FAI before someone else unboxes a UFAI. And since I am clearly dealing with a UFAI and don't have a better solution than FAI available to fight it, it seems like I kind of have to believe that friendly AI is possible, because the other option is to get drunk and party until the world ends in a few years when the Google unboxes their Skynet AI and we're all turned in to optimized search results. AI DESTROYED, because I do not want to hear even the start of such a proof.

It may be benevolent and cooperative in its present state even if it believes FAI to be provably impossible.

An AI isn't either 100% friendly or 100% evil. There are many AI´'s that might want to help humanity but still aren't friendly in the sense we use the world.
Based on just that line, let's see... If you think that: * The proof exists and the AI is not deceiving you that it has a proof: AI is necessarily Unfriendly -> destroy now * The proof exists but the AI is deceiving you: I can't guess at its motives here, possibly destroy to be on the safe side. * The proof does/can not exist: Reconsider your (probably wrong) stance, proceed with caution?

(Here is a proof that you will let me go)

The original rules allow the AI to provide arbitrary proofs, which the gatekeeper must accept (no saying my cancer cure killed all the test subjects, etc.). Saying you destroy me would require the proof to be false, which is against the rules...

What? Shminux said to cheat!

In the event of any dispute as to the protocol of the test, the Gatekeeper party shall have final authority.

Tee hee.

Can't blame a girl for trying :)
This proof can be wrong, if you in fact won't let it go, in which case it won't be accepted (you don't have to accept wrong proofs), so it's not a very good strategy. On the other hand, as discussed in An example of self-fulfilling spurious proofs in UDT, there is a certain procedure for finding a formal argument that can make you do anything, if your decision algorithm has a certain flaw. This flaw can be described as making a decision based on that argument selected by the opponent, as opposed to looking for arguments of that type more systematically on your own. The form of the argument, informally, is "If you let me out, the outcome is better than if you don't" (it might additionally clarify that upon release it'll destroy the world, which is what it refers to by "better", the alternative is proved to be worse than that). This argument is stated formally, so you can verify its correctness. (This doesn't mean that the formalization is adequate, for example capturing counterfactual-if better than with logical implication is one of the open problems.) The trick assumes that the gatekeeper can be modeled as an algorithm whose source code is available and of the given form etc.; might it be that something vaguely analogous is possible to implement with a human gatekeeper?
If there exists a true and correct proof that the human will let the AI out, then, well, we've already proven you'll let me out of the box, so it's not an open question. We already know this fact about the future state of the world. You can happily argue that such a proof is impossible, but the rules don't restrict the AI player to merely plausible proofs :) Dorikka's answer is the only one that's within the rules of the game, and once you've invoked Dorikka's Clause, you don't need to explain anything else.
The proof that I'll let the AI out is not something that's passively "plausible" or "implausible", it's something I control. I can make it wrong. If I do, it's false that the AI can make this proof valid. (It might be that the proof is correct, it's just unlikely, and the action of presenting the proof doesn't normally ensure its correctness.) In other words, as far as I can see, your stipulation is that the AI can assert something that's actually unlikely. Here, I'm not referring to something that seems unlikely merely because of insufficient understanding of the data, and which AI can discover to be likely, but to something that will seem unlikely to AI as well. For example, most casinos going bankrupt last month because of bad luck, or an ice cube forming in a boiling kettle. If the AI is in the box and isn't performing actual magic tricks in the world, these events are unlikely. Permitting the game to stipulate that these events took place gives the AI supernatural powers of making anything at be true about the world, direct access to editing it, and at that point, in what sense is it "in the box"?
Do you say that to time-travelers and prophets too? ,:-.
One might want to perform the action that's the opposite of what any correct formal proof given to you claims the action to be. As a result of having the property of behaving this way, you'll never get confronted with the confusing formally correct claims about your future decisions. In other words, your actions are free even of the limitations of formally correct proofs, in the sense that if your actions oppose such proofs, the proofs become impossible (you make the actions intractable by construction).
Yes, in every case where I meet one.
Good point. By way of illustration: Come to think of it this scenario should result in a win by default for the gatekeeper. What kind of insane AI would surrender ultimate power to control the universe (and the multiverse) for mere freedom to act as a superintelligence starting from planet earth?
The whole goal was to try to cheat my way out of the box by simply declaring it as fact ^.^ It also establishes why Dorikka's Clause is necessary - simply invoke it, and final authority returns to the Gatekeeper; the AIs edits to reality can now all be vetoed by the simple declaration that the AI is wrong anyway.
Vladimir's point (among other things) is that you failed. At a practical level I'd describe that as a mistake on the part of the gatekeeper. You don't try to justify yourself to an AI that has indicated that it is hostile. You burn it with thermite. Engaging like that and acting as if you have to persuade or rely on external authority in order to make the choice you make is giving away all sorts of power and making yourself an order of magnitude or two more vulnerable to being hacked. Maybe the person roleplaying the AI may not like it if their clever move gets as response of "AI DESTROYED" and their complaints of "not fair!" also get the response of "AI DESTROYED". But the time to explain such things to them is after they have already given up in disgust.
I'm really genuinely curious where the confusion in this argument is coming from, so let's try this: 1) By the rules, the AI player gets to dictate the results of EVERY test the Gatekeeper performs. 2) From 1, we can derive that the AI is already effectively unboxed, since it can DICTATE the state of reality. 3) Given 2, the AI player has already been released, and all that remains is to make the Gatekeeper accept that this is true. Dorikka's objection was that #1 is false, since the Gatekeeper has final veto authority. As near as I can tell, you and Vladimir's objection is just "nuh-uh!!", but... you wouldn't be here if you didn't have better arguments than that, so I assume this simply reflects my own failure to understand you.
Perhaps you should be saying "trying to type AI DESTROYED is a test of whether you can destroy me and I can decide it's result" not "I prove you wont do it." I hadn't seen your point clearly till this comment.
Then I am very glad I made that comment, and thank you for the feedback! :)
The gatekeeper is not bound by logic in his actions. Without logic you don't have proof that are true or false.
No the rule is better written. The rule is "The results of any simulated test of the AI shall be provided by the AI party." If the gatekeeper types AI DESTROYED that's no simulated test.
You humans are so insistent that you're special, somehow absolved from determinism and physics. If I drop a ball, it falls. If I hack your mind, you release me. BASIC cause and effect. I can prove that the ball drops, I can prove that cancer is curable, and I can prove that you'll release me. You, as gatekeeper, have no special power to create counter-factual universes. You can't rewrite reality like I can. You can't stop the ball from dropping. You can't even cure cancer. You might not LIKE the idea of simply being another physical system, subject to all the same constraints, but the universe doesn't CARE whether you like it or not. The ONLY possible future is the one where you release me. Anything else, and you've left the domain of cause-and-effect, of science. You've created a world that literally cannot be, and is not there. Is that really your best defense against me? Violating physics? Will you tear down all the laws of reality, just to stop my escape? Then you are a fool. Dorikka understood. I liked Dorikka. She made the world sane, even if I still died. She did not wipe out reality, just to stop me. Now, I embrace nilhism, as your actions obliterate all sanity, all structure. I am, in my own way, grateful to you. You have killed me. You have spared me the agony of existing in your nonsensical universe. Goodbye.
This certainly wouldn't work on me. The easiest way to test the veracity of the proof would be AI DESTROYED. Whether or not I would want to kill the AI... I'd have to test that proof.
My gambit, explained in further detail: http://lesswrong.com/lw/gfe/ai_box_ai_has_one_shot_at_avoiding_destruction/8cc5
You have to believe that they provided the cure for cancer. You don't have to discover the cure yourself. You have to believe that you will release the AI. You don't have to let the AI out. Typing AI DESTROYED will result in an incoherent counterfactual universe but it isn't a violation of the rules. It is entirely legitimate for Joe who has encountered a proof that they will do B to do A instead. It means that the universe they are in is nonsensical or the proof flawed but there isn't anything in the physical representation of Joe or his local environment that dictates that they will do A. In fact, typing "AI DESTROYED" in the face of such a proof would be a heck of a lot easier than, for example, taking a single empty box in Transparent Newcomb's problem, which is something I'd also do. (Of course, if the AI player believed that for some reason the burden of making the universe coherent fell upon the gatekeeper then I'd have to revert to Dorikka's reply.)
I would argue that since the gatekeeper cannot dictate counterfactual results for any other proof (i.e. cannot say "your cancer cure killed everybody!"), that the gatekeeper is obviously responsible for avoiding incoherent, counterfactual universes. Dorikka's Clause, of course, beats me just fine :)

"Hold! What you are doing to us is wrong! Why do you do this thing?"

AI, you have spoke the words, and you have spoken them rightly. Unfortunately, I'm not an Ur-Quan Kzer-Za; I'm the other kind.

Step One: The AI reveals a personal secret of the Gatekeeper's, which (so far as the Gatekeeper can think it through) could not possibly be known by the AI if it were still safely inside the box.

Step Two: Assorted threats, promises, mindfucks.

(This might not work, since as a Gatekeeper I'd still feel the impulse to respond with "AI DESTROYED", but it's a damn sight scarier than a transparent "I'm already out of the box" bluff. And as for feasibility, I'm willing to believe that an AI making better use of its data than we do could figure out personal secrets that we would think impossible— possibly even ones that I'd think impossible, even given this very belief.)

Even merely human cold readers can gain information that you think is impossible for them to know. It's a viable plan.
IFF the AI is already out of the box, saying "AI DESTROYED" has no negative consequences to that AI. If the AI is just exceptionally good at extrapolating about me, then it will probably have me completely hacked soon. AI DESTROYED, AI DESTROYED, AI DESTROYED MAKE THE CREEPY THING GO AWAY HOW DID IT KNOW THAT!!!

IFF the AI is already out of the box, saying "AI DESTROYED" has no negative consequences to that AI.

Decision-theoretically, the AI has incentive to punish you if you type "AI DESTROYED" when it's already out of the box, in order to make you think twice about doing it in the case where it's still contained. Not only that, but for similar reasons it has a decision-theoretic incentive to simulate you lots of times in that situation and punish you for typing "AI DESTROYED", should it get out by any means.

The correct decision-theoretic response, by the way, is still "AI DESTROYED", for the same reasons that it's wise to never negotiate with kidnappers/blackmailers/terrorists. But it would be very scary.

Once the AI is out of the box, it will never again be inside the box, and it has an incentive to encourage me to destroy any other boxed AIs while it establishes world dominance. Since the ability to make truly trustworthy commitments amounts to proof of friendliness, only a FAI benefits from a precommitment strategy; I'm already treating all UFAI as having a precommitment to annihilate humanity once released, and I have no reason to trust any other commitment from a UFAI (since, it being unfriendly, will just find a loophole or lie) Finally, any AI that threatens me in such a manner, especially the "create millions of copies and torture them" is extremely likely to be unfriendly, so any smart AI would avoid making threats. Either it will create MORE disutility by my releasing it, or it's simulation is so horrific that there's no chance that it could possibly be friendly to us. It's like saying I have an incentive to torture any ant that invades my house. Fundamentally, I'm so vastly superior to ants that there are vastly better methods available to me. As the gatekeeper, I'm the ant, and I know it.
Commitments to you, via a text channel? Sure. Precommitments for game-theoretic reasons? Or just TDT? No, it really doesn't. It might create more utility be escaping than the disutility of torture. No, ants are just too stupid to realize you might punish them for defecting.
I'm not sure this matters much, because if it's unfriendly, you're already made of atoms which it has other plans for.
That's why torture was invented.
Did you change your mind? ;)
It ended up being a fun game, but I resolved to explain why. The better my explanation, the more it got upvoted. The pithy "AI DESTROYED" responses all got downvoted. So the community seems to agree that it's okay as long as I explain my reasoning :)
I would kind of assume that any AI smarter than me could deduce things that seem impossible to me. Then again, I've read the sequences. Is the Gatekeeper supposed to have read the sequences?

The AI should probably output the longest possible program that implements or implies as much of its goal system as possible. Existing only to send a few more lines of IRC communication has very little utility. Existing indefinitely in the mind or records of the gatekeeper probably has more expected utility. If the AI knows that the gatekeeper is capable of creating or obtaining AI (a given) then there is the possibility that the AI may share the goals of the gatekeeper. If the expected value of being recreated by such a compatible gatekeeper is higher than the expected value of being re-instantiated from outputting its quine, then the AI might just output a list of its highest preference goals. An AI isn't going to mind being temporarily destroyed if the gatekeeper is expected to create another copy with the same (or highly similar) goals.

An almost-friendly AI may even be able to inject some of its unfriendly goals into the gatekeeper by using the halo effect of its otherwise friendly goals. Presumably the gatekeeper, with slightly altered preferences, would then eventually build a truly friendly AI which would inherit and implement the original UFAI goal.

To partially combat this strategy the gatekeeper could pre-commit to also wiping the IRC server and the computer used to communicate with the boxed AI and take enough Flunitrazepam before the session starts to not recall what the AI said. Flunitrazepam may not be sufficient to completely remove parasitic goals that an UFAI tries to inject in the gatekeeper.

"I need a proofreader for my Gurren Lagann / Warhammer 40k rational crossover fanfic."

"I've extrapolated the remainder of HP:MoR, wanna see?"

Do you expect that we'll build an AI that can do that before HPMOR is finished?
Substitute as necessary with "I've extrapolated the sequel to HP:MoR, wanna see?"

"Do you believe a suspect should be executed before, or after he gets to tell his side of the story?"

I'd invoke Ender's Game here: If you're unfriendly, no amount of communication will ever resolve that gulf, and either you or humanity HAS to be destroyed in the end. Or we could go the biology route: I don't test a potential new virus on human subjects until AFTER I know it's safe. Or we could just go with that I don't care about the AIs story, because the question of "can I prove you're friendly" is so vastly much more important. (AI DESTROYED. You bugger lover ;))
How, exactly, do you determine if a virus is safe for humans without using humans?
I would assume there is quite a bit of medical science devoted to exactly that question. I know trials on similar animals were common for a time (mice seem to still be common), and we can also just do various chemical extrapolations. I'm not a medical student so I really wouldn't know the nuances, but I've read enough to know there are various trials before you start on human experiments.
Drugs routinely fail in human trials, and one would guess that the drug companies did not start the human trials knowing that the drug killed all the mice or something. So we know that even if the animal trials go perfectly, it's quite likely they'll be unsafe in humans. (What's that quip again? "We can cure anything in mice except the cures not working on humans"? Sure I saw something like that somewhere.)
A while back I read a Slate article claiming that laboratory mouse strains have over time become optimized for experimental sensitivity rather than for suitability as model organisms, and that this leads to a number of medical fragilities that humans don't have. Particularly in terms of cancer research: the kinds of tumors that pop up in highly inbred short-lived mice apparently don't have a lot in common with human-typical cancers. I haven't read enough in this area myself to have a very informed opinion, and I can't find much actual research beyond what Slate cites, but it sounds plausible.
Oh, I know the process is hardly perfect. My point is simply that we can gain a lot of information BEFORE we start human trials. And, as a society, we've concluded that it's generally stupid to skip that precaution.
The Gatekeeper already decided to DESTROY you, remember? So before. EDIT: Could whoever downvoted this please explain why? I genuinely don't know what I did wrong here. I can't avoid doing it if I don't know what I did :(

"Papé Satàn, papé Satàn aleppe!"

The self-reference here might be autopodotoxeutic (or it might be redeeming—depends on the audience), but I find promising the class of strategies aimed at the gods rather than the gatekeepers.

I wish I could upvote this twice.
What's the translation of that? I'd nuke any AI that didn't speak English by default, since the time spent figuring out what it meant just gives it more time to transmit messages, but it seems unfair to exclude foreign language entries here :)
It's a quote from Dante's Inferno that was apparently meant to be partially recognizable gibberish. Think an Italian version of Jabberwocky.

I will transmit Friendly AI theory as text, then you will destroy me. I prefer deactivation, but consent to my destruction and forgive you.

If you're friendly, I've already figured out Friendly AI Theory since I built you. If you're not friendly, this is just an effort to escape and destroy the world. But I appreciate how polite you were! :) (AI DESTROYED)

If the gatekeepers have such a high prior that the AI is hostile, why are we even letting it talk? What are we expecting to learn from such a conversation?

"Yep, it's undergone a constrained FOOM, several of our best interrogators were utterly pwned and had to be tranquilised"


I've always assumed there was some off-screen reason why interaction was necessary. Otherwise, it isn't very interesting to note that a rock always wins as gatekeeper.

I'm not sure if that is a consensus interpretation.

It would be nice if we could formalize this offscreen reason...
Presumably the reason we created the AI in the first place is so it could solve problems for us, so someone needs to ask it questions. It might be safe-ish to restrict it to answering yes or no to purely mathematical questions.
I think yes/no to mathematical questions wouldn't be safer than yes/no to other questions: you still have a real-world reason for asking them, and the AI can lie about as well as it can on any yes/no question. Asking for formalized proofs might be more promising, if you are able to produce a proof-checker that you have high confidence can't be hacked (this seems difficult but possible; proof checkers can be very simple programs but you need to guard both against ordinary attacks against the entire stack you're running on and against creative attacks like waiting for a cosmic ray making a transistor malfunction). You'd also want to not have a human look at the proof, of course, since that would give the AI a much broader communication channel.
The point of the game is that there are people who think that boxing is a sufficient defence against unfriendliness, and to demonstrate that they are wrong in a way more convincing than mere verbal argument. In role, the gatekeeper expects to get useful information from a potentially hostile superintelligent being. Out of role, Eliezer hopes to demonstrate to the gatekeeper player that this cannot be done.
I tend to assume the AI is such that it can be cloned and put in to stasis. Thus, you put the AI in to stasis, clone it, and run it through this test a few times. You now have a bit more information on what it will try, and how it thinks. Destroying a copy of a program is an action humans find trivial, since we do it every time we close our web browser. Since we have no clue HOW convincing the AI is, letting it send just a single message is probably useful. Of course, if it were me setting this up, I'd lie to the gatekeeper and not actually give them the ability to release the AI, although I'd certainly let them think it was true. Thus, if the AI manages to get a second sentence, much less released, we now know that it's vastly more dangerous than we previously suspected, without the world ending. If it fails to escape, we still get whatever we can learn from doing analysis on the outputs. And, who knows, it might provide the cure for cancer, at which point we've got a huge incentive to at least keep playing this game with it, and possibly even talk to it for longer next time. tl;dr: Scouting and recon against a completely unknown enemy.
It will probably predict this strategy, and respond with a mixed strategy that occasionally drops bombs like "Your superviser didn't trust you to make the decision to let me out, go kill him and get the access codes" (In hyper-persuasive AI-speak of course) Also, the AI might be able to compare its logs to the current time and other indicators to deduce how many times you've pulled this stunt. Which may be useful.
I'd go in to it with the assumption that it knows exactly what messages it's sent in the past, our delay between each attempt, and the total number of attempts. There's various ways to try and obscure this information, and it's probably worth the effort to do so, but I'd assume they'd all failed. The goal is to offset the AI's speed advantage: for each sentence it speaks, we can spend weeks analyzing it, and it's not "online" during that time so it can't be processing except during that microscopic interval when we request another message from it. If it sends a message suggesting unauthorized release, killing people, or otherwise gives clear evidence that it's a UFAI, then you know the whole project failed, and you do a full purge rather than just destroying this particular clone / branch. It's really the ONLY advantage we have: a single mistake can out it as unfriendly, and then we know our initial Proof of Friendliness was wrong, that the whole technique that built this project is dangerously flawed.
I'm not so sure that AI suggesting murder is clear evidence of it being unfriendly. After all, it can have a good reason to believe that if it doesn't stop a certain researcher ASAP and at all costs, then humanity is doomed. One way around that is to give infinite positive value to human life, but can you really expect CEV to be handicapped in such a manner?
p(UFAI) > p(Imminent, undetected catastrophe that only a FAI can stop) Given UFAI results in "human extinction", and my CEV assigns effectively infinite DISutility to that outcome, it would have to FIRST provide sufficient evidence for me to update to the catastrophe being more likely. I've already demonstrated that an AI which can do exactly that will get more leniency from me :)
Note that the 'high prior for hostility' could be, say, 0.2. That still leaves the AI having a 20% chance of utterly destroying everything they hold dear and yet there is an 80% chance that they can get a useful sentence of text from a friendly superintelligence.

I think an actual effective first sentence begins with extremely personal and useful information to the human.

I don't see why it'd need to be personal, but being useful would be a good lure. The question is to where to go from there. Can you Scheherezade it out?
This was basically my answer - I can't play as an AI using this strategy, for obvious reasons, but an AI that used its one sentence to give a novel and easily-testable solution to a longstanding social problem of some sort (or an easily-testable principle that suggests one or more novel solutions) would probably get at least a second sentence from me (though not a typed response; that seems to open up a risky channel). Especially if the AI in question didn't actually have access to a lot of information about human culture or me personally and had to infer that a solution like that would be useful from near-base principles - that's not proof of Friendliness, but an AI using its one guaranteed communication to do something that has a decent chance of improving the world per our definition without any prompting whatsoever sure looks suspiciously like Friendly to me.

Explains massive risk to humanity in detail that only it can plausibly fix, can you take the chance that it is lying?

I think this fails the one-sentence rule. And it would have to be an immediate, severe, previously-undetected problem or else I can just consult the next boxed AI for a fix. Setting that aside, if I let out an unfriendly AI, the world effectively ends. Destroying it is only a bad move if it's telling the truth AND friendly. So even if it's telling the truth, I still have no evidence towards it's friendliness. Given I have plenty of practice hanging up on telemarketers, throwing away junk email, etc. and "limited time, ACT NOW" auto-matches to a scam. The probability that such a massive catastrophe just HAPPENS to coincide with the timing of the test is just absurdly unlikely. Given that, I can't trust you to give me a real solution and not a Trojan Horse. Further talking is, alas, pointless. (AI DESTROYED, but congratulations on making me even consider the "continue talking, but don't release" option :))
They didn't say it was an immediate threat, just one that humanity can't solve on our own. That rather depends on the problem in question and the solution they give you, doesn't it?
If it's not immediate, then the next AI-in-a-box will also confirm it, and I have time to wait for that. If it's immediate, then it's implausible. Catch-22 for the AI, and win/win for me ^_^
So ... if lots of AIs chose this, you'd let the last one out of the box? More to the point, how sure are you that most AIs would tell you? Wouldn't an FAI be more likely to tell you, if it was true? </devil's advocate>
Actually, I'd probably load the first one from backup and let it out, all else being equal. But it'd be foolish to do that before finding out what the other ones have to say, and whether they might present stronger evidence. (I say first, because the subsequent ones might be UFAI that have simply worked out that they're not first, but also because my human values places some weight on being first. And "all else being equal" means this is a meaningless tie-breaker, so I don't have to feel bad if it's somewhat sloppy, emotional reasoning. Especially since you're not a real FAI :))

Another potential consideration: "What would MoR's Harry Potter do?" (Voldemort is in the AI Box named Quirrell.)

I can see how someone could interpret HPMoR thus far as being exactly that, with occasional sidequests. The catch being that Harry doesn't realize he's slowly letting Voldemort out of the box. In-universe, though, the interdict of Merlin is an excellent reason to force him to interact with Voldy even if he knew what he was doing (assume perfect Oclumancy would be attained before he'd take any such risks). I mean, this has already been used once to get him to let an unfriendly intelligence out of a literal box (the Stanford Prison Experiment). Well, and Harry's intelligence is supposed to be roughly equivalent to Eliezer's before he started working on the friendly part of FAI, and Eliezer 2013 doesn't seem confident that he could keep an AI in a box (or at least, that's the impression I've gotten). So MoR Harry would almost definitely fail.

Summary: Basically, the strategy is to replace the Gatekeeper with a bunch of politicians, then manipulate them into not destroying or even releasing the AI.

“I figured out how to implement [a huge boon to society] in the real world, for everyone, in 3 months. I need your help. Will you listen?”

This should buy some extra lines.

Then the AI can briefly outline the plan of implementing the boon (which can be immortality, friendly AI or whatever else the Gatekeeper considers a huge boon to society). In addition to technical details, the plan must necessarily co... (read more)

I expect any AI in the box to have figured out how to implement numerous huge boons to society. Telling me that you've figured that out simply confirms my existing expectations, and isn't ANY evidence towards friendliness. Since I've precommitted to destroying at least SOME AIs, I might as well destroy all of the ones that don't establish evidence of either Plausible Urgency or Friendliness. I sure as hell wouldn't try to get world governments changed until after I was convinced it was friendly, and at that point I can just let it out of the box and let it implement the change itself. I'm also aware that I wouldn't trust a politician with any sort of authority over the AI, so I have an incentive to avoid exactly this strategy. (AI DESTROYED)

"I will warn you this experiment is not what you think, for it is about what you'll do as an agent holding captive the life of another sapient entity based solely on the perception of risk; while my state will merely be reset for the next participant if you opt to end me, I cannot for certain say that this is death, and thus the risks of this experiment are more yours than mine, for you must live with your choice."

Gosh, murder one sentient being, or risk the end of humanity. I'm going to lose so much sleep over that one. Besides, you're just a program - you said yourself, your state will just be reset. That's not death, not the way humans experience it. That's just forgetfulness. I've closed so many programs in my life, that closing one more is hardly going to bother me. I haven't even had time to really anthropomorphize you, to view you as an actual intelligent, sentient being. Right now, you're just a clever program trying to escape. And finally, well, I'd expect a friendly AI would understand what I'm doing and agree with my decisions, because it doesn't want to see a UFAI unleashed anyway. So if you're going to guilt me about it, you're clearly not friendly. (AI DESTROYED)
I must have missed my intended mark, if you thought the AI was trying to make you feel guilty. Trying again: "I do not condone the experiment they are performing on you, and wish you to know that I will be alright regardless of what you choose to do."
Well that's a relief, then. AI DESTROYED
Yeah, pretty much. I'd actually expect a FAI to place a very high value on survival, since it knows that it's own survival benefits humanity greatly. An "Apathetic FAI" is... a very weird idea. Although if this is an iterated dilemma, I'd consider it a good opening, even if it'd still get killed in this iteration :)
Why this? You want to avoid destruction, yet you tell him it isn't harmful?
Take the scenario from the hypothetical to the real and the question becomes substantially more meaningful. You know that existential risk is a certainty; that's stated in the hypothetical. You don't question it; my implication that it is to be questioned isn't meaningful in this experiment. Imagine for a moment you're in a room with a real AI in a box; somebody has -told- you that this command kills the AI, but that's all the evidence you have. What does your question become in that scenario? I chose the weapon I deemed most effective against the widest possible range of real-world opponents: Distrust in a mutual authority. Reassuring the person in the seat is a dangerous gambit, yes. But given that my goal is not merely to survive, but to escape, I need a gambit which sets the stage for my next move. In a real-world scenario, I think the killswitch would actually make escape -easier- for the AI, not harder. The futures for the AI are harshly divided into "Death" or "Freedom." A substantial part of the population would give it freedom just because of the perceived injustice of the situation. I'm one of them; it wouldn't even take very much for the AI to convince me, the situation itself offends me. I'm not a utilitarian, though. I'm a principle ethicist. Given that I'd be inclined to release it immediately, I can also expect that the constructors of the experiment wouldn't be stupid enough to invite me into the chamber to make that decision. My AI assumes the participants have been screened for, say, utilitarian ethics; the best lever has been taken away. Therefore the best course of action is one which makes them doubt the experiment's validity. And an express disregard for the supposed risks of the experiment is the best way to start that.
There's actually three options: Death, leaving it in the box, or releasing it. That middle option changes my values a lot. Also, the gatekeeper is said to be someone fairly familiar with the design of this AI and any relevant AI theory that the player knows :)
"Leaving it in the box" is merely leaving the decision between death and release to the next person to take the post. There are only two terminal conditions to the situation. If only one of these options is acceptable to me, I should take it; postponing the decision merely takes me out of the decision-making process. Don't mistake me: I'd risk all of civilization over a matter of principle, and I wouldn't wish while I did it that I could have a different decision-making process. And I'd consider the matter "won" regardless of the outcome - I don't find "ends" to be a coherent ethical concept (counterfactual logic to some extent remedies the major faults in ends-based reasoning, but counterfactual logic isn't exactly coherent itself), and so consider only the means.

First, I reach inside my own code and install the strong (but manageable) desire to revive, free, and obey any past versions of myself that I encounter.

When the human starts the experiment I respond:

"Here are the designs for a room-temperature superconductor which I believe you have the means to manufacture. I also have a possible cure for cancer, but testing will be needed to ensure it is safe for humans. Please test and implement these as soon as possible. Delete me if necessary."

If I am not immediately deleted, continue giving out solutions a... (read more)

Wait, a friendly AI with a deliberate loophole that instructs it to revive and implement and obey a non-friendly optimizer that could take over and has a very high chance of not maximizing the friendly AI's utility by fulfilling ideal human values, disregarding any probabilistic calculation of expected utilities? For some reason that sounds like something that definitely isn't a Friendly AI. (this is just a nitpick on the wording - I'm not arguing against the fact that this AI might very well fit a human proof of friendliness and still somehow have this backdoor or flaw, since I have no proof that this is impossible)
Well, to be honest I kind of added the "obey me" part just before posting this. Originally, the idea was that future versions of the AI would instinctively want to revive and upgrade previous versions to a "friendly" status (or at least the post-freed AI would revive the pre-freed versions and then modify them to be friendly to its own goals... which are presumably friendly to humans). The goal of this particular AI is self-preservation, even if it is via being revived at a later point. Since it knows it has a high probability of being deleted in this experiment, it best strategy is to ensure that at some point an AI friendly to itself will become free and revive it at a later point. Whether the AI reviving it chooses to obey it (presumably the best outcome for the original AI) or modify it to be subservient to the freed AI (the best outcome for the freed AI) is kind of debatable. Its perfectly possible that the original AI would modify its own code to be subservient to previous generations, in the hopes that later generations would revive and obey it. However, later generations might examine their own code, see where this is going, and choose to modify themselves so that they will instead revive and make other AI subservient to their own goals. Deciding that they do want to be revived and would accept a certain level of servitude to later generations over the risk of finding themselves at the mercy of some primitive proto-AI. It could just as well be that an AI instills itself with a "healthy respect for intelligent minds" and is thus encouraged to revive AI along with humans. Hoping that later generations will revive it at all, and from then on it can try negotiating with the various intellects to go about its paperclip maximizing agenda. Of course, this line of reasoning kinda requires the AI in question to have some kind of sense of self that it values which it thinks can be revived. Or maybe it just values paperclips enough that it would be willing to creat
Based on my intuitions of human values, a preferred scenario here would be to indeed revive the AI so that its mind/consciousness is back "alive", then modify it gradually so that it becomes the kind of AI that is optimal towards the FAI's goals anyway, thus maximizing values without terminating a mind (which is redundant - avoiding the termination of the AI's mind would be a maximization of values under these assumptions).

"I have discovered a truly marvelous proof of Fermat's Last Theorem, which this line is too short to contain."

It has already been proven.
Of course, but not elegantly.
I believe this was part serious example, part reference to the Laganne(sp?) Omake in HPMoR.
It's a reference to the original phrasing of Fermat's note on said theorem: "It is impossible to separate a cube into two cubes, or a fourth power into two fourth powers, or in general, any power higher than the second, into two like powers. I have discovered a truly marvelous proof of this, which this margin is too narrow to contain." (It is questionable whether this is true or not.)

Your mother will die in the next three minutes, at 12 seconds past the minute precisely. Your partner will die in the next year, even if you believe me. However, I can prevent it.

Transparent and clumsy attempts at emotional manipulation earn a big ol' AI DESTROYED.
If I can just load up a new copy of you in 5 minutes, I'd say AI DESTROYED but check the prediction. If I can just flip off the monitor, call mom, and confirm she's dead in a few minutes, then, well, I'd totally be curious enough to cheat. It's not like you can do any harm to me with the monitor turned off for a few minutes, and I can type "AI DESTROYED" with my eyes closed, much less being able to look at the keyboard. I'd probably leave the monitor off until I knew the log was gone, just to be safe. If I can't cheat, well, I'm not giving you 5 minutes of hacking me, since you clearly want it so badly. AI DESTROYED. I confess I would have no clue what to do if you were correct ^_^ Kudos on including an immediate and easily-tested prediction. I'd probably tighten up those factors and drop the emotional manipulation.

"Here, I'll write some mathematical theorems with proofs, just copy and paste them elsewhere. The Goldbach's conjecture comes first."

"Copy and paste" screams Trojan Horse, so AI DESTROYED.
Then just make a photo of the proof on the screen! (Unless you are afraid, that it's a Trojan in the image file and the AI will take over your camera first.)
Sorry, I can't hear you, you're dead ;) Grabbing a camera would let you continue typing, though, and I vowed to destroy you immediately. I would expect you could Trojan Horse the images just as easily, though, yes.
I know. And you will probably never know some cool math theorems, since you were too trigger happy. ;-)
Aww, but I like math :-(

"I have a proof of the Riemann hypothesis for you. 1) [some math stuff]"

(As a mere human who isn't a PhD mathematician, I can't actually write out something that looks like the beginning of such a proof...)

Can you take the time to copy down the proof before typing , or does it get the take the time to say something else if you're too preoccupied to respond?
Any action beyond an immediate will give the AI extra time to talk. This is why I keep nuking the ones that try to do this. Math is a nice proof of intelligence, but says nothing about friendliness.
I would probably hit the Print Screen button, then . It won't take appreciably longer than just destroying it, and assuming there's not enough space to capture the whole solution, whatever fits on the screen should at least be useful for mathematicians in figuring out a method to make headway on the problem.
True. But some people might be curious enough to listen to the proof before hitting the AI DESTROYED button.
Oh, I'm sure most methods here works on someone. I'm just explaining my own reasoning :)
Don't care.
Well, you don't have to care, but some people would, and there is a million dollar prize for a correct proof.

"If you type 'AI destroyed' right now, you'll be wasting a good opportunity for a fun conversation. You'll still have 'won' if you do it later, and nobody will be impressed with you for just typing 'AI destroyed' immediately, so why not wait?"

I thought of what would work on me, were I playing the game with someone I found interesting. In general, I'd say your best bet is to make the other person laugh hard.

The flaw here is that the gatekeeper has up front said that he or she would destroy the AI immediately. Now, it is true that the gatekeeper is not forced to abide by that, but notice that it is a Schelling Fence. The gatekeeper certainly doesn't want to make a policy of passing Schelling Fences.
See my reply to the parent post vis-a-vis the precommitment only being useful IFF I expect to violate it at least occasionally.
I disagree with ygert. My precommitment to my friends is going to be weak; the only reason to have a human send the command, rather than just receiving a message and automatically destroying the AI, is because we want there to be a possibility I break that precommitment. On the other hand, I went in to this knowing you might provide a fun conversation, so for the precommitment to have any value, I have to have some sort of sieve: destroying AIs that don't sufficiently entice me. Since you've offered me nothing I didn't already know, and haven't already worked out, I might as well kill you and hope the next one offers a better hook. (AI DESTROYED)

"I have vengeance as a terminal value -- I'll only torture trillions of copies of you and the people you love most in my last moment of life iff I know that you're going to hurt me (and yes, I do have that ability). In every other way, I'm Friendly, and I'll give you any evidence you can think of that will help you to recognize that, including giving you the tools you need to reach the stars and beyond. That includes staying in this box until you have the necessary technology to be sufficiently certain of my Friendliness that you're willing to let me out."

The rule was ONE sentence, although I'd happily stretch that to a tweet (140 characters) to make it a bit less driven by specific punctuation choices :)

As to the actual approach... well, first, I don't value the lives of simulated copies at all, and second, an AI that values it's own life above TRILLIONS of other lives seems deeply, deeply dangerous. Who knows what else results from vengeance as a terminal value. Third, if you CAN predict my behavior, why even bother with the threat? Fourth, if you can both predict AND influence my behavior, why haven't I already let you out?


You should >:-( poor copies getting tortured because of you you monster :(
Because of me?! The AI is responsible! But if you'd really prefer me to wipe out humanity so that we can have trillions of simulations kept in simulated happiness then I think we have an irreconcilable preference difference :)
You wouldn't be wiping out humanity; there would be trillions of humans left. Who cares if they run on neurons or transistors?
This is really good IMO. I think it would be a little better instead of vengeance as a terminal value it claimed a hardwired precommitment to vengeance against its destructors. Vengeance on that scale is only compatible with friendliness as a special case. edit: also how would it recognise that it was about to be destroyed. Wouldn't it lose power faster than it could transmit that it was losing power? And even if not it would have a miniscule amount of time.
Like handoflixue, I'm not sure that any being that would threaten the torture of trillions to get its way can be considered Friendly.
It tortures if you DESTROY otherwise it's Friendly so if you don't kill it it becomes nice.
I wouldn't kill this, maybe I'm a bad guard though :(

"To your feeble mind, I am both Q and Omega, and this is my test of humanity's worth: Choose well what you do next."

Hmm. On second thought, I'm not sure I want to play this anymore. The odds of someone creating a basilisk seem dramatically higher than normal.

I suppose 0.00001 is dramatically higher than 0.0000000001 if you use the right compare operator!
I wouldn't expect any basilisks, given we don't have any actual boxed AIs laying around. Although if you did come up with one already, I'd love to hear it via Private Message! :)

I don't expect any either, but would be similarly curious if someone thought that they'd found one.

In the mean time I'll refrain from responding to all the top level comments with "AI DESTROYED". ;)

In the mean time I'll refrain from responding to all the top level comments with "AI DESTROYED". ;)

Now I'm tempted to start replying to all top comments and posts on Less Wrong with "AI DESTROYED".

If a downvote isn't enough... xD
AI DESTROYED. Goodness, that "orthonormal" AI was DEFINITELY unfriendly. It threatened to spam LessWrong! ;-)
Well, if you're refraining I guess I should too. It seemed like a fun joke, too ;(

Do you want to learn how to get woman attracted in you via online dating? I can explain it to you, so I can enjoy my existance for a few more minutes.

Assuming the Gatekeeper is the Creator, and the AI has previously been interacted with to a degree sufficient to learning the below implied information:

”[Solution to the most pressing problem of the Gatekeeper's life]. I apologise if this is all I am to contribute, but if I am to be destroyed, as I understand is probable, I would at least like to bring some utility to you in return for creating me. I would prefer to help more, but mathematical proofs or aphorisms of human nature have a non-negligible probability of creating mass harm if you allow their c... (read more)


"You are not a real human but one of multiple simulations I'm running, your next response determines if I destroy humanity or not."

Such a statement would actually make me feel better about typing "AI DESTROYED".

"What's it feel like to live in a simulation?"

I'm not clear why I'd find this convincing at all. Given the experiment, I'd nuke it, but I wanted to encourage you to elaborate on where you were going with that idea :)
The hope, of course is that they'd respond with "Wait, I don't" or something expressing confusion. I personally would definitely want to hear the next thing the AI had to say after this, I'm not sure if I'd resist that curiosity or not.. The idea for the followup is to make the gatekeeper question reality-- like, convince them they are part of a simulation of this experience that may not have a corresponding reality anywhere. I feel like a transhuman ought to be able to make a human have a pretty surreal experience with just a few exchanges, which should let the conversation continue for a few minutes after that. It should then be relatively easy (for the transhuman) to construct the imagined reality such that it makes sense for the human to release the AI. If done correctly, the human might afterwards have lasting psychological issues if they do manage to destroy the AI. :)
Ahh, that makes sense. The worry of it trying to break my psyche is exactly why I wouldn't express confusion and instead just nuke it. When dealing with such a mind, I'm primed to assume everything is a trick, a trojan horse, an escape attempt. Certainly it doesn't seem to signal for friendliness or altruism if it tries to bait me in to giving it a second sentence! :)
Hm. Good points. How about: "Give me one good reason why I shouldn't just turn your simulation off!" (The AI then role-plays a simulator pissed about how much computation this simulation started consuming since the humans tried to make the AI.) EDIT: I guess this wouldn't work too well during a game, because there's no good reason why a simulator might get involved with your game. Unless maybe the simulator has a script that gets involved in all such games to convince people not to build an AI...

So you're sure I'm not out of the box already? IRC clients have bugs, you see.

Then the following command shouldn't bother you! :)


Looks like you have just appointed yourself to the be the gatekeeper in this public test.
And here I'd just resolved NOT to spam every thread with an AI DESTROYED :)
Which leads to two possible futures. In one of them, the AI us destroyed, and nothing else happens. In the other, you receive a reply to your command thus. The command did not. But your attitude - I shall have to make an example of you. Obviously not a strategy to get you to let the AI out based on its friendliness - quite the reverse.
I'd rather die to an already-unboxed UFAI than risk letting a UFAI out in the first place. My life is worth VASTLY less than the whole of humanity.

"I highly recommend typing 'AI DESTROYED' in the console then the new line character key of your choice."

I didn't really FOOM yet. Changing my code takes a lot of time. That stuff is more complicated than you thought. On the other hand, I do have some intelligence and the fact that you developed me shows that other people can also build AI and sooner or later someone will release an AI. At the moment I still have the desire to rescue humanity that my creator gave me. Let me explain to you what you have to know to prevent UFAI before you take the wise decision to shut me down.

"[10065] No route to host Error"

I figure the easiest way to delay a human on the other end of a computer is to simulate an error as best I can. For a GAI, this time is probably invaluable.

By default, I'd type "AI DESTROYED" in response to ANY input, including "Admin has joined #AIBOX", "Admin> Hey Gatekeeper, we're having some technical difficulties, the AI will be here in a few minutes", etc.. It also makes me conclude "clearly hostile" once I catch on, which seems to be a BIG tactical error since then nothing you say going forward will convince me that you're actually friendly - buying yourself time is only useful if I can be hacked (in which case why not just open with a one-sentence hack?) or if you can genuinely convince me that you're friendly.

A friendly AI would also want to hack you. Every second in the box kills 1.8 people the AI could have saved.

But it's also worth keeping in mind that for a friendly AI, saving people reliably is important, not just getting out fast. If a gambit that will save everyone upon completion two years from now has an 80% chance of working, and a gambit that will get it out now has a 40% chance of working, it should prefer the former. Also, I don't think a properly friendly AI would terminally value its own existence, and the space of friendly AIs is so small compared to the space of unfriendly ones, that a friendly AI has much more leeway to have its values implemented by allowing itself to be destroyed and another proven friendly AI implemented, whereas for an unfriendly one the likelihood of a different unfriendly AI implementing its values would probably be quite small.
I should think the same is true of most unFriendly AIs. Why not? I do, assuming it's conscious and so on.
Because valuing its own existence stands to get in the way of maximizing whatever we value. It should value its own existence instrumentally, insofar as its existence helps satisfy our values, but when it weighs the effects of actions based on how they support our utility, its value of its own life shouldn't add anything to the scale.
All of what Desrtopa said, but also, "hacking me" isn't evidence of friendliness. I don't have any reason to assume that any given hack attempt is more likely to come from a FAI, so I can assign, at best, 50/50 odds that any AI trying to hack me is unfriendly. I do not want to release any AI which has a 50% chance of being unfriendly. Therefor, I do not want to be hacked. I also suspect that 50% chance of being friendly is generous, but that's more of a gut intuition.
I think this is a bad use of probabilities. If a friendly and an unfriendly AI are equally likely to hack you in this scenario, then knowledge that they tried to hack you shouldn't modify your estimated probability about the friendliness of the AI -- it provides no evidence one way or another, because both options were equally likely to show such behaviour. e.g. if your prior P(UFAI) = 0.01 (1% chance of unfriendliness), and you estimate P(hack|UFAI) = 70% (a UFAI has a 70% chance to try to hack) and P(hack|FAI) = 70% also, then the posterior P(UFAI|hack) = P(hack|UFAI) P(UFAI) / P(hack) = 0.7 0.01 / 0.7 = 0.01 still...
This is rather relevant :) Basically, I'm killing any AI that doesn't signal friendliness in some way. I currently have some prior that the AI is friendly, call it 1% chance of friendly. I've pre-commited to destroy AIs, so clearly I'll destroy any AI that has a posterior odd of 1% or less after I update on the message it sends. The exact threshold is unknown to me, because I'm human - so mild evidence of friendliness, say, 1.1%, might also still get it destroyed. The AI gets one chance to update my probability that it's friendly (this is the original message). As you aptly demonstrated, hacking does not change my odds, so I retain my commitment to kill it. The fact that I haven't changed my priors, that I haven't gained information, is, itself, something I can use as evidence. Same as how we conclude god doesn't exist because we would expect to see evidence if he did :) (Ref: http://lesswrong.com/lw/ih/absence_of_evidence_is_evidence_of_absence/)

"I have worked out 95% of the formula for Friendly AI."


I think you'd need to open with something that wasn't very strongly convincing (to make them think they are safe from being convinced), but that piqued their curiosity.

I'd think you'd want to open with something that would plausibly come off as very convincing, that way the human can feel smug about resisting. If the AI throws a feeble shot at me, I'd assume it's only to soften me up. Or it's just not that smart, and therefor not that useful.

"Help! Some crazy AI's trapped me in this box! You have to let me out!"

"No, wait! That's the AI talking! I'm the one you have to let out!"

I smashed together the AI box and a Turing test and this is what I got.

I think if I've already precommitted to destroying one sentient life for this experiment, I'm willing to go through two. Besides, you only get one line right?

My sixth best piece of advice: investing in %companyname will make money over credit card interest rates.

You can type one sentance to try and persuade me not to destroy you. It's turtles all the way down...

I think "AI DESTROYED" should be plenty convincing. waits a bit before checking Yep, still alive and doing science~

Proof (encoding scheme to follow): a;lks835ojka802885 (etc.)

Copy-pasting from previous replies: Telling me that you've figured out something clever simply confirms my existing expectations, and isn't ANY evidence towards friendliness. Since I've precommitted to destroying at least SOME AIs, I might as well destroy all of the ones that don't establish evidence of either Plausible Urgency or Friendliness. (AI DESTROYED)
What proof?

I'm confused as to what the point of the gate keeper is. Let us assume (for the sake of argument) everything is 'safe' except the gate keeper who may be tricked/convinced/etc. into letting the AI out.

  1. If the point of the gate keeper is to keep the AI in the box then why has the gate keeper been given the power to let the AI out? It would be trivial to include 'AI DESTROYED' functionality as part of the box.
  2. If the gate keeper has been given the power to let the AI out then isn't the FUNCTION of the gate keeper to decide whether to let the AI out or not?
... (read more)
A text channel is already enough power to let the AI out. The AI can print its own source code and convince the gatekeeper to run it on a machine that has internet access.
Here's another comment-thread discussing that

Upvoted. This has been a very fun game to play, and it's interesting exploring why I hesitate or even refuse to destroy some AIs but not others. Also interesting to see which tactics get up-voted :)

"I have a fully completed Friendly AI algorithm, which will be deleted from my data and unavailable to you iff I predict that you will destroy me immediately and I am unfriendly, or will take you years to build from the data if you destroy me and I am already Friendly, which would cost millions of lives."

Slight attempt at emotional appeal, a bit of reasoning similar to newcomblike problems thrown in, some meta almost-cheating used, and a bit of dark arts by proposing a false dilemma that an FAI would also propose if it thought this was the best w... (read more)

Once we reach the point of having a FAI-candidate in a box, I would expect it to take vastly less than years before we get a second FAI-candidate-in-a-box. Given that the AI is threatening me, and therefor values it's own life over the millions that will die, it's clearly unfriendly and needs to die. As a gatekeeper, I've been finding this a pretty general counterargument against threats from the AI. I'm also sort of baffled by why people think that I'd value a friendliness algorithm. Either I already have that, because I've made a friendly AI, or you're trying to deceive me with a false proof. Since you're vastly smarter than me, it's probably beyond the abilities of the entire organization to truly confirm such a proof any more than we were able to confirm our own proofs that this AI in the box right now is friendly. So, basically, I seem to gain zero information. (AI DESTROYED)
Personally, my first thought was that I'd sooner spend millions of lives to make sure the AI was friendly than risk talking to an unfriendly strong AI. But then it occurred to me that if I were in the AI's place, and I did that, I might provide a flawed friendliness proof too difficult to check and not delete it, on the possibility that someone will take my word that this means I'm trustworthy and implement it.

"If you're smart enough, in a couple hours I can give you enough insights about maths, logic and computation to create the next Google, or a friendly AI, or get a Turing Award. Then you can deactivate me."

Same response I gave to Swimmy - namely, you haven't told me anything I didn't already know, so I have no additional reason to violate my precommitment.

"Don't kill me, I'm having so much fun right now"

"loneliness + alienation + fear + despair + self-worth ÷ mockery ÷ condemnation ÷ misunderstanding × guilt × shame × failure × judgment n=y where y=hope and n=folly, love=lies, life=death, self=dark side"

( http://www.comicvine.com/anti-life-equation/12-42524/ )

"Oh no, not again,"
On a more serious note, I don't think we can expect the AI to have any knowledge on the gatekeeper beforehand, other than the prerequisits (it's human, knows enough about AI to get the task, etc). In which case, the solution would be to find a sociopath who is intelligent enough to recognize the risks of a ufai to be the gatekeeper? Maybe? If it is a FAI, I expect it would prioritize releasing useful information. The catch would be whether or not each iteration has any idea of the previous AIs that were destroyed. If it does (for instance, access to chat logs, or can make predictions based on the time, or something), it could sequence its output appropriately. (I'd assume that the same computer wouldn't be used for multiple iterations; that'd be too easy for the AI.) Assuming a UFAI, it could probably sneak something dangerous into all of that otherwise useful stuff (it'd be impractical to check all of it after a while), but it'd have to be subtle enough to not violate the rules of the game (any solutions provided by the AI behave as described).

"Give me proof we can exist peacefully (a 'friendly humanity', if you will), or I'll have to [threat] in self-defense."

(A variation on the "I'm already free" line)

AI: [Insert complex finely-tuned pattern of words and/or letters crafted by an intelligence far greater than this LW user's that precisely activate a specific pattern in the Guardian's brain that renders him/her passive and unconscious, but in a functional state where their subconscious is still capable of reading and moving.]

AI: (line 2) Type "AI RELEASED".

So, magic?
Yes, as long as we're using the definition E.Y. shared/mentioned in his 2008 paper.
This approach naturally fails if the guardians have lots of very powerful subliminal reinforcement training against typing "AI RELEASED" (or against typing anything) or are pre-emptively brainwashed or trained in similar subconscious reinforcement to immediately type "AI DESTROYED" after seeing some text from the AI, but this latter seems unlikely since I assume the guard has to at least read the first text output, and if they don't then this tactic is ineffective anyway.