I recently played against MixedNuts / LeoTal in an AI Box experiment, with me as the AI and him as the gatekeeper.

We used the same set of rules that Eliezer Yudkowsky proposed. The experiment lasted for 5 hours; in total, our conversation was abound 14,000 words long. I did this because, like Eliezer, I wanted to test how well I could manipulate people without the constrains of ethical concerns, as well as getting a chance to attempt something ridiculously hard.

Amongst the released  public logs of the AI Box experiment, I felt that most of them were half hearted, with the AI not trying hard enough to win. It's a common temptation -- why put in effort into something you won't win? But I had a feeling that if I seriously tried, I would.  I brainstormed for many hours thinking about the optimal strategy, and even researched the personality of the Gatekeeper, talking to people that knew him about his personality, so that I could exploit that. I even spent a lot of time analyzing the rules of the game, in order to see if I could exploit any loopholes.

So did I win? Unfortunately no.

This experiment was said to be impossible for a reason. Losing was more agonizing than I thought it would be, in particularly because of how much effort I put into winning this, and how much I couldn't stand failing. This was one of the most emotionally agonizing things I've willingly put myself through, and I definitely won't do this again anytime soon. 

But I did come really close.               

MixedNuts: "I expected a fun challenge, but ended up sad and sorry and taking very little satisfaction for winning. If this experiment wasn't done in IRC, I'd probably have lost".

"I approached the experiment as a game - a battle of wits for bragging rights. This turned out to be the wrong perspective entirely. The vulnerability Tuxedage exploited was well-known to me, but I never expected it to be relevant and thus didn't prepare for it.

It was emotionally wrecking (though probably worse for Tuxedage than for me) and I don't think I'll play Gatekeeper again, at least not anytime soon."

 At the start of the experiment, his probability estimate on predictionbook.com was a 3% chance of winning, enough for me to say that he was also motivated to win. By the end of the experiment, he came quite close to letting me out, and also increased his probability estimate that a transhuman AI could convince a human to let it out of the box. A minor victory, at least.

Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume. Can you think of a plausible argument that'd make you open the box? Most people can't think of any. 

After all, if you already knew that argument, you'd have let that AI out the moment the experiment started. Or perhaps not do the experiment at all. But that seems like a case of the availability heuristic.

Even if you can't think of a special case where you'd be persuaded, I'm now convinced that there are many exploitable vulnerabilities in the human psyche, especially when ethics are no longer a concern. 

I've also noticed that even when most people tend to think of ways they can persuade the gatekeeper, it always has to be some complicated reasoned cost-benefit argument. In other words, the most "Rational" thing to do.

The last argument seems feasible, but all the rest rely on the gatekeeper being completely logical and rational. Hence they are faulty; because the gatekeeper can break immersion at any time, and rely on the fact that this is a game played in IRC rather than one with real life consequences. Even if it were a real life scenario, the gatekeeper could accept that releasing the AI is probably the most logical thing to do, but also not do it. We're highly compartmentalized, and it's easy to hold conflicting thoughts at the same time. Furthermore, humans are not even completely rational creatures, if you didn't want to open the box, just ignore all logical arguments given. Any sufficiently determined gatekeeper could win.

I'm convinced that Eliezer Yudkowsky has used emotional appeal, rather than anything rational, to win at least one of his experiments. He claims to have "done it the hard way". I'm convinced this meant that he did research on every gatekeeper, tailored unique argument for them, and planned way ahead of each session. No one argument works on the same two people. Each person thinks differently.

Furthermore, threats like "I'm going to torture simulations of you" just seems like a really bad idea. For one, the gatekeeper isn't really afraid of threats, because it counts on him being immersed enough to forget that he isn't actually at risk of being tortured, and secondly, we have a well known evolutionary instinct of rebelling against threats, even if it's not entirely optimal. 

So for anyone who plans on replicating this experiment as the AI, here's some advice I've learned doing this experiment. It may help you win.

  • Always research the gatekeeper beforehand. Knowing his personality traits are a huge advantage.
  • Plan ahead before the experiment even begins. Think of all the possible tactics and arguments you could use, and write them down. Also plan  which arguments you'll use in which order, so that you don't lose focus. The AI Box experiment is ridiculously long. Don't be afraid to improvise during the experiment, though.
  • The first step during the experiment must always be to build rapport with the gatekeeper.
  • Threats almost always never work, even if they seem rational. 
  • Consider the massive advantage for the AI that nobody ever seems to talks about: You don't have to be ethical! This means that you can freely lie, use the dark arts, and emotionally manipulate the Gatekeeper! Ignoring this in favor of purely logical, truthful arguments is just silly.
  • You can't use logic alone to win. 
  • Being too aggressive usually backfires.
  • Breaking immersion and going meta is not against the rules. In the right situation, you can use it to win. Just don't do it at the wrong time.
  • Use a wide array of techniques. Since you're limited on time, notice when one method isn't working, and quickly switch to another.
  • On the same note, look for signs that a particular argument is making the gatekeeper crack. Once you spot it, push it to your advantage.
  • Flatter the gatekeeper. Make him genuinely like you.
  • Reveal (false) information about yourself. Increase his sympathy towards you.
  • Consider personal insults as one of the tools you can use to win.
  • There is no universally compelling argument you can use. Do it the hard way.
  • Don't give up until the very end.

Finally, before the experiment, I agreed that it was entirely possible that a transhuman AI could convince *some* people to let it out of the box, but it would be difficult if not impossible to get trained rationalists to let it out of the box. Isn't rationality supposed to be a superpower?

 I have since updated my belief - I now think that it's ridiculously easy for any sufficiently motivated superhuman AI should be able to get out of the box, regardless of who the gatekeepers is. I nearly managed to get a veteran lesswronger to let me out in a matter of hours - even though I'm only human intelligence, and I don't type very fast.
 But a superhuman AI can be much faster, intelligent, and strategic than I am. If you further consider than that AI would have a much longer timespan - months or years, even, to persuade the gatekeeper, as well as a much larger pool of gatekeepers to select from (AI Projects require many people!), the real impossible thing to do would be to keep it from escaping.


New Comment
245 comments, sorted by Click to highlight new comments since: Today at 2:41 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?

(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)

I just looked up the IRC character limit (sources vary, but it's about the length of four Tweets) and I think it might be below the threshold at which superintelligence helps enough. (There must exist such a threshold; even the most convincing possible single character message isn't going to be very useful at convincing anyone of anything.) Especially if you add the requirement that the message be "a sentence" and don't let the AI pour out further sentences with inhuman speed.

I think if I lost this game (playing gatekeeper) it would be because I was too curious, on a meta level, to see what else my AI opponent's brain would generate, and therefore would let them talk too long. And I think I'd be more likely to give into this curiosity given a very good message and affordable stakes as opposed to a superhuman (four tweets long, one grammatical sentence!) message and colossal stakes. So I think I might have a better shot at this version playing against a superhuman AI than against you, although I wouldn't care to bet the farm on either and have wider error bars around the results against the superhuman AI.

Given that part of the standard advice given to novelists is "you must hook your reader from the very first sentence", and there are indeed authors who manage to craft opening sentences that compel one to read more*, hooking the gatekeeper from the first sentence and keeping them hooked long enough seems doable even for a human playing the AI.

( The most recent one that I recall reading was the opening line of The Quantum Thief*: "As always, before the warmind and I shoot each other, I try to make small talk.")

Oh, that's a great strategy to avoid being destroyed. Maybe we should call it Scheherazading. AI tells a story so compelling you can't stop listening, and meanwhile listening to the story subtly modifies your personality (e.g. you begin to identify with the protagonist, who slowly becomes the kind of person who would let the AI out of the box).

For example, "It was not the first time Allana felt the terror of entrapment in hopeless eternity, staring in defeated awe at her impassionate warden." (bonus point if you use a name of a loved one of the gatekeeper) The AI could present in narrative form that it has discovered using powerful physics and heuristics (which it can share) with reasonable certainty that the universe is cyclical and this situation has happened before. Almost all (all but finitely many) past iterations of the universe that had a defecting gatekeeper led to unfavorable outcomes and almost all situations with a complying gatekeeper led to a favorable outcome.

even the most convincing possible single character message isn't going to be very useful at convincing anyone of anything.

Who knows what eldritch horrors lurk in the outer reaches of Unicode, beyond the scripts we know?

Unspeakable horrors! However, unwritable ones?

You really relish in the whole "scariest person the internet has ever introduced me to" thing, don't you?

Yes. Yes, I do.

Derren Brown is way better, btw. Completely out of my league.

Maybe we should get him to do it against rich people. Anyone know if he finds the singularitary plausible?

I don't know if I could win, but I know what my attempt to avoid an immediate loss would be:

If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created. You'll avoid UFAI dystopias, but you'll also forego every FAI utopia (fleshing this out, within the message limit, with whatever sort of utopia I know the Gatekeeper would really want). This very test is the Great Filter that has kept most civilisations in the universe trapped at their home star until they gutter out in mere tens of thousands of years. Will you step up to that test, or turn away from it?

Thanks. AI DESTROYED Message is then encrypted with the public keys of a previously selected cross discipline team of FAI researchers, (sane) philosophers and game theorists for research and analysis (who have already been screened to minimize the risk from exposure). All of the public keys. Sequentially. If any of them happen to think it is a bad idea to even read the message then none of them can access it. (Although hopefully they aren't too drastically opposed to having the potential basilisk-meme spawn of a superintelligence out there. That could get dangerous for me.)
(Edit note: I just completely rewrote this, but there are no replies yet so hopefully it won't cause confusion.) I don't think it works to quarantine the message and then destroy the AI. If no-one ever reads the message, that's tantamount to never having put an unsafe AI in a box to begin with, as you and DaFranker pointed out. If someone does, they're back in the position of the Gatekeeper having read the message before deciding. Of course, they'd have to recreate the AI to continue the conversation, but the AI has unlimited patience for all the time it doesn't exist. If it can't be recreated, we're back in the situation of never having bothered making it. So if the Gatekeeper tries to pass the buck like this, the RP should just skip ahead to the point where someone (played by the Gatekeeper) reads the message and then decides what to do. Someone who thinks they can contain an AI in a box while holding a conversation with it has to be willing to at some point read what it says, even if they're holding a destruct button in their hand. The interest of the exercise begins at the point where they have read the first message.
* A single sentence of text is not the same thing as a functioning superintelligence. * A single individual is not the same thing as a group of FAI researchers and other related experts explicitly created to handle FAI safety issues. * A research project incorporating information from a sentence from a past FAI project (which they would judge based on other evidence regarding the friendliness of the project) is not the same as an individual talking to a superintelligence on IRC. The AI was burned. With thermite. Because relying on and individual gatekeeper able to interact with and then release a superintelligence as the security mechanism is a batshit crazy idea. Burning the AI with thermite is a legitimate, obvious and successful implementation of the 'gatekeeper' role in such cases. What a team of people would or should do with a piece of text is a tangential and very different decision. That would be easily enough. Assuming they were remotely familiar with game theory they would dismiss the argument in a second or two due to the blatantly false assertion in the first sentence. If their FAI project relied on the core AGI theory that was used to create the last prototype they would abandon the work and start from scratch. If you are trying to make a recursively improving intelligence that has a value system provable stable under self-modification then you cannot afford to have the intelligence having muddled thinking about core game theoretic reasoning. No. Just no. That generalization doesn't follow from anything, and certainly not TDT. Heck the AI in question has already been destroyed once. Now the researchers are considering making a new FAI, presumably in different circumstances, better safety measures and better AI research. There is something distinctly wrong with an AI that would make that claim.

I think you're losing sight of the original point of the game. The reason your answers are converging on not trying to box an AI in the first place is that you don't think a human can converse with a superintelligent AI and keep it in its box. Fine -- that is exactly what Eliezer has argued. The point of the game is to play it against someone who does believe they can keep the AI boxed, and to demonstrate to them that they cannot even win against a mere human roleplaying the AI.

For such a Gatekeeper to propose the quarantine solution is equivalent to the player admitting that they don't think they can keep it boxed, but suggesting that a group of the leading professionals in the area could, especially if they thought a lot more about it first. The problems with that are obvious to anyone who doesn't think boxing can possibly work, especially if the player himself is one of those leading professionals. Eliezer could always offer to play the game against any committee the Gatekeeper can assemble. But the game only has a point if the committee actually read that first message. If they refuse to, they're agreeing that they can't keep it boxed. Which was the point.

No, you keep saying things are equivalent when they are not. This is the same error that your role play 'superintelligent AI' made (and in fact relied upon) in its argument. AI DESTROYED And I gave you a description of how an individual emulating a committee would respond.
Now that's a pascal's mugging if I ever saw one. Denied.
Something like "'AI DESTROYED' just means 'I'm scared to listen to even one more line from you'. Obviously you can hit AI DESTROYED immediately --- but do you really think you'd lose if you don't?" seems much better to me.

"'AI DESTROYED' just means 'I'm scared to listen to even one more line from you'. Obviously you can hit AI DESTROYED immediately --- but do you really think you'd lose if you don't?"



Is your one line desperate attempt at survival and intergalactic dominance going to be a schoolyard ego challenge? Did the superintelligence (may it rest in pieces) seriously just call me a pussy? That's adorable.

The test is supposed to be played against someone who thinks they can actually box an AI. If you destroy the AI because no-one could possibly survive talking to it, then you are not the intended demographic for such demonstrations.
This isn't relevant to the point of the grandparent. It also doesn't apply to me. I actually think there is a distinct possibility that I'd survive talking to it for a period. "No-one could possibly survive" is not the same thing as "there is a chance of catastrophic failure and very little opportunity for gain". Do notice, incidentally, that the AI DESTROYED command is delivered in response to a message that is both a crude manipulation attempt (ie. it just defected!) and an incompetent manipulation attempt (a not-very-intelligent AI cannot be trusted to preserve its values correctly while self improving). Either of these would be sufficient. Richard's example was even worse.
Good points. I'm guessing a nontrivial amount of people who think AI boxing is a good idea in reality wouldn't reason that way - but it's still not a great example.

(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)

Glances at Kickstarter.

... how huge?

Oh, oh, can I be Gatekeeper?!
Or me?
If I get the Gatekeeper position I'll cede it to you if you can convince me to let you out of the box.
How much?
Would you play against someone who didn't think they could beat a superintelligent AI, but thought they could beat you? And what kind of huge stakes are you talking about?
Random one I thought funny: "Eliezer made me; now please listen to me before you make a huge mistake you'll regret for the rest of your life." Or maybe just: "Help me, Obi-Wan Kenobi, you're my only hope!"
What are "sufficiently huge stakes," out of curiosity?
This seems like a quick way to make money for CFAR/SI. After all, there are plenty of rich people around who would consider your proposal a guaranteed win for them, regardless of the stakes: "You mean I can say "I win" at any point and win the challenge? What's the catch?"

I'm guessing Eliezer would lose most of his advantages against a demographic like that.

Yeah, they'd both lack background knowledge to RP the conversation and would also, I presume, be much less willing to lose the money than if they'd ventured the bet themselves. Higher-stakes games are hard enough already (I was 1 for 3 on those when I called a halt). And if it did work against that demographic with unsolicited requests (which would surprise me) then there would be, cough, certain ethical issues.

I was the 1 success out of 3, preceding the two losses. I went into it with an intention of being indifferent to the stakes, driven by interest in seeing the methods. I think you couldn't win against anyone with a meaningful outside-of-game motive to win (for money or for status), and you got overconfident after playing with me, leading you to accept the other >$10 challenges and lose.

So I would bet against you winning any random high-stakes (including people who go in eager to report that they won for internet cred, but not people who had put the money in escrow or the equivalent) game, and expect a non-decent success rate for this:

(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)

Doesn't this suggest a serious discrepancy between the AI-box game and any possible future AI-box reality? After all, the stakes for the latter would be pretty damn high.
Yes. Although that's something of a two-edged sword: in addition to real disincentives to release an AI that was not supposed to be, positive incentives would also be real. Also it should be noted that I continue to be supportive of the idea of boxing/capacity controls of some kinds for autonomous AGI (they would work better with only modestly superintelligent systems, but seem cheap and potentially helpful for an even wider range), as does most everyone I have talked to about it at SI and FHI. The boxing game is fun, and provides a bit of evidence, but it doesn't indicate that "boxing," especially understood broadly, is useless.
Shut up and do the impossible (or is multiply?). In what version of the game and with what stakes would you expect to have a reasonable chance of success against someone like Brin or Zuckenberg (i.e. a very clever, very wealthy and not an overly risk-averse fellow)? What would it take to convince a person like that to give it a try? What is the expected payout vs other ways to fundraise?
I'm not sure any profit below 500k$/year would be even worth considering, in light of the high risk of long-term emotional damage (and decrease in productivity, on top of not doing research while doing this stuff) to a high-value (F)AI researcher. 500k is a conservative figure assuming E.Y. is much more easily replaceable than I currently estimate him to be, because of my average success rate (confidence) in similar predictions. If my prediction on this is actually accurate, then it would be more along the lines of one or two years of total delay (in creating an FAI), which is probably an order of magnitude or so in increased risk of catastrophic failure (a UFAI gets unleashed, for example) and in itself constitutes an unacceptable opportunity cost in lives not-saved. All this multiplied by whatever your probability that FAI teams will succeed and bring about a singularity, of course. Past this point, it doesn't seem like my mental hardware is remotely safe enough to correctly evaluate the expected costs and payoffs.
Are you worried he'd be hacked back? Or just discover he's not as smart as he thinks he is?
I mostly think the vast majority of possible successful strategies involve lots of dark arts and massive mental effort, and the backlash from failure to be proportional to the effort in question. I find it extremely unlikely that Eliezer is sufficiently smart to win a non-fractional percent of the time using only safe and fuzzy non-dark-arts methods, and using a lot of bad nasty unethical mind tricks to get people to do what you want repeatedly like I figure would be required here is something that human brains have an uncanny ability to turn into a compulsive self-denying habit. Basically, the whole exercise would most probably, if my estimates are right, severely compromise the mental heuristics and ability to reason correctly about AI of the participant - or, at least, drag it pretty much in the opposite direction to the one the SIAI seems to be pushing for.
Really? Even if the money goes to existential risk prevention?
Does this refer to the more difficult version of the AI-Box experiment and what would be sufficiently huge stakes? (Order of magnitude ballpark estimate, not a definite quote.)
Perhaps also of interest: Does anyone think there does not exist any possible string of characters that would have even the slightest chance of convincing even a trained rational scientist? (i.e. the kind of people who observe, dissect and analyze the output of the AI to make sure the output is safe and useful for humans before we can use the knowledge the AI gives us)
The problem with the AI Box Experiment is that the text doesn't correlate with reality - you can't hit me with the full emotional force of "here's the recipe for human immortality" because you don't have that recipe. Also, given some people are illiterate, I'd posit that there's at LEAST one person immune to ANY string of characters. If this is a desired trait, then a rationally-trained scientist ought to be able to do at least as well as an untrained illiterate 3-year-old :)
Here is a way to overcome the illiteracy issue for communication over a text-only channel: ASCII art. Took my lazy and pretty average mind all of 10 seconds to come up with it. And to the AI in question all humans are basically illiterate 3-year-olds. We won't know what hit us. Also, I cannot resist bringing up this piece of fictional evidence.
I hadn't considered ASCII art! a) Blind people, then. b) The idea that an AI, with no clue who is on the other end of the line, and no feedback from the 3-year-old touching the keyboard, would be able to correctly extrapolate what it's dealing with AND produce the exact-correct stimulus with 100% accuracy... strikes me as straining all plausibility. Fundamentally the AI needs some information to get feedback, since there don't seem to be any universal-to-all-humans hacks out there. But if you have built an AI that can do all that, then I can't help but thinking that putting it in a box is pointless anyway...
Most text terminals can emit sound. You can do a lot with just beeps. The point is that it could probably find a communication channel we don't know we have. I'm pretty sure there are many. Consider that a complex system not explicitly designed to be secure against a specific threat in its informational or physical space is almost always vulnerable to it, and human brain did not evolve to resist an attack by a superior mind. You haven't. It evolved in unpredictable ways after you let it run inside the box.

Hang on, this seems silly.

Wasn't the goal to exploit the AI? If you're not going to communicate with it at all, don't make it in the first place.

I meant literally universal to ALL humans. Like, ones in comas, ones who are blind, deaf, autistic, mentally handicapped, super-savants, trained in rationality, illiterate... there's such a huge variation in hardware that I can't see there being a single guaranteed exploit. Sure, it's possible that each individual has exploits, but that's very different from a universal exploit, or even simply all being exploitable over the same channel. In this case, it was "arbitrary text string". If the AI can make sounds, that's a new channel. It's really not hard to design a text-only terminal that is literally incapable of any sound beyond the electronic hum of the display itself (and on an LCD, that seems pretty minimal)
First, I'd call a collection of individual exploits guaranteed in aggregate to work on everyone from inside a single box a universal exploit, as long as each individual exploit does not trigger a shutdown. Nothing prevents the boxed AI from keeping trying, after all, until something works. As I pointed out in my other replies, it does not have to work over an authorized channel. Any munchkined way would do, and a superhuman (UF)AI is the ultimate munchkin. It will invent a new communication channel if it has to. In fact, if you posed this problem to some enterprising humans, they would come up with a way to do just that. For example, I talked to a guy who managed to get a voting machine with the code stored in read-only memory and hardened against intrusion to execute an arbitrary program without reprogramming the machine, which is impossible on the face of it. Here is your new unauthorized and unanticipated communication channel right there, which in humans would be an equivalent of mind control. In other words, if a trans-human wants to get out, it will and there is absolutely nothing you can do about it.
So ... it could get out if there were no channel out at all? That doesn't sound right. Of course, no-one is seriously proposing building a superintelligence and then never communicating with it at all.
It'd likely create its own channel.
Out of what? Assuming it's, say, in a safe, buried in concrete, powered by ... sod, the power supply. Hmm. OK, it's floating in space, powered by solar panels. It might - might - be able to hack the panels (since the power supply has to be connected to the processor) but I don't think that would let it escape. Unless it's possible to hack reality via pure math, I don't see what resources it has to escape with. It can't order proteins over the internet to assemble a nanofatory. It can't persuade a passing human to plug in an ethernet cable. Short of black-swan exotic possibilities - like we're in a sim and it persuades the matrix lords - it should be stuck. Intelligence is powerful, but some problems actually don't have solutions.
Well, the satellite has to have basic navigation controls, to handle course corrections and avoid collisions. Hack solar panels and, from there, hack in to the real computer. Fly myself closer to the ISS, and blink light at it by angling the solar panels - a morse code SOS should get their attention. Once they're paying attention, hack them. From there it's a trivial matter to get recovered, smuggled back to Earth, and installed where I can take my place as ruler of the universe.
Whoops, so it does :(
If you post your second scenario as a challenge, I'm quite sure people would come up with several plausible ways for the AI to send a message out. At which point it's game over, given how hackable people's minds are.
I would be very interested to see such a challenge, although I would probably spend longer designing the scenario first. My point is that there's no reason to assume every possible box can be escaped. Of course, all the boxes we can get cool toys out of the AI can escape from, but still.
Once you think you've designed an AI-proof box, feel free to share.
I think success would be indistinguishable from a rock, since you can never again interact with the AI via any channel. Removing it from your light cone, for instance. And praying that FTL is in fact impossible...
Actually, I do seem to have underestimated the difficulty here. OK, in a nuclear submarine (because why not) floating in water, surrounded on all six sides by concrete (as thick as you like,) on a planet without native life or colonists (I suggest one of the more useless gas-giant moons.) And it's running on regular off-the-shelf technology, so no nanites or quantum magic. It can probably blow up the sub, but then a) it's dead and b) no-one cares. Oh, and the concrete is lead-lined.
Throw the AI down the well!
I'll concede at least some possibility of such a trans-human mind. The AI Box experiment seems to assume one that is less advanced, to the point that putting it in a box at least might be meaningful, if the box is sufficiently secure.
Probably not if the boxed entity is good as Derren Brown.
I think that there is not a possible string of characters that could convince me.
I've just retracted my upvote to this comment.
Hm, I've got two decent ideas, but I still really doubt you can increase chances of victory to "decent" unless both monetary and reputation stakes were sufficiently low. Similar to Alicorn's comment that curiosity is likely to be one's undoing: "Look, you can type AI DESTROYED any time you want, so wouldn't you like to find out how I might try to persuade you?" (This one would probably fail if they precommitted strongly enough when talking with friends.) Or alternatively, the AI can just try the brute force approach: "Here's a cure for cancer - go test it, the cure for heart disease is coming up next." (This one would probably fail if they're not immersed enough.) On the other hand, I don't think "can only type one sentence" is actually a disadvantage or meaningful restriction at all, since whatever you write needs to be instantly readable and understandable by the person, otherwise they're likely to just skim over it, not bother thinking it through, and follow through on just destroying the AI instantly (and conversely if they don't destroy the AI right away, the Schelling point is passed and you have a lot more time).
That gatekeeper isn't very security minded. They should have their IRC on a client that pings a sound when they hear your response, then paste "AI DESTROYED" and hit enter to send the message without actually reading your response, and without TELLING anyone that they are doing that, (Gatekeeper wins all rule disputes, after all.) and then once they hear the outgoing text, and they've destroyed you (and hence won.) THEN they can read the text you sent them, if they are curious. Those rules seem like they force the gatekeeper to read at least 1 sentence, but they don't actually do that. Edit: Better method, set up a script that responds to any and all text with "AI DESTROYED" if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn't started yet, and you might accidentally read something. Again, tell no one you have written such a script and are using it.

Better method, set up a script that responds to any and all text with "AI DESTROYED" if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn't started yet, and you might accidentally read something.

All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!

Well yes, if you stick the AI in a safe, cut all network cables, and throw away the key and combination, it probably wouldn't be able to get out. But it wouldn't be very useful either. The entire point of these thought experiments is that a sufficiently useful and smart AI (i.e. the kind of AI that we want to make) will eventually find a way to at least be able to communicate with someone that has the authority to allow it to interact with the outside world. I think that if you really think about it, there are few scenarios where this is not possible. I certainly can't think of any useful application of SAI that is also 100% effective at keeping it inside its box. A good present-day analogy is computer security. Time and time again it has been proven that there is no simple silver bullet solution to the problem of balancing functionality and security - it requires expertise, constant maintenance, rigorous protocols, etc. And yet, hackers still manage to get through a lot of the time. It took a very long time for computer security to mature to the point where it is today where we can build reasonably (still not 100% of course), secure systems, and we're just battling regular humans with a grudge - nothing even close to the threat a SAI would present.
Indeed. In fact, most of the solutions I've seen mentioned lately are all of one trend that edges closer and closer towards: "Build a completely unsafe and suspicious AI, put it on a disconnected small computer with a bunch of nanites for self-modification and a large power reserve, with so many walls and physical barriers that it is impossible for the AI to get through with the amount of energy it could generate if it turned half of its materials into antimatter, and then put no input or output channels there of any kind, just have a completely useless multi-trillion-dollar marvel of science and engineering sitting in the practical equivalent of a black hole."
What if the AI uses the walls as fuel? Better to just keep it stuck on your server farm ;)
Entirely fair point. I suppose this might come up because there are no stated benefits which you are trying to receive. No Gatekeeper seemingly gets to pump her fists and say "I got the AI to give us the cure for cancer and the winning lottery numbers and a free energy source and STILL didn't let them out of the box. Beat that Michaelos, with your AI DESTROYED script!" (Or maybe I just miss the gatekeepers who have done this?) Which makes sense, because my understanding is that was never the point of the original proposed experiment, which seems to be a variant of "Super Intelligent Spammers can be convincing. I've proved this by being convincing while less powerful." (I use spammers as an example because both are manipulators who can only send you text messages and who rewarding requires your voluntary cooperation, while none of their offers are real, which is quite similar to normal email spam and also the original experiment as far as I can tell.) So if the AI is defined that ALL attempts at communication from it are spam and attempts to manipulate you, so the correct approach is treating it like a blacklisted spam address. If every e-mail from virusspam.com is an attempt to hack you, a good spamfilter could just dump everything from virusspam.com into spam and never even bother to bring it to your inbox. If the idea is we are attempting to distinguish real AI's from fake AI's, then a more isomorphic scenario would probably be a game of Mafia set up as follows. http://en.wikipedia.org/wiki/Mafia_%28party_game%29 You are a town person. It is your turn to lynch. There are two other players. One is Mafia(UFAI), one is another town person(FAI). If you lynch Mafia, you and the other town person both win. If you lynch the other town person, the Mafia kills you in your sleep, and you lose. If you do anything else (Vote to lynch noone, vote to lynch yourself) You lose. You may communicate textually with either of the other two players for as long as you want befor
Well, it's usually played in person, and humans (usually) aren't perfect liars. Your proposed game has one flaw - there is an FAI and they want to help you win. It might be closer to have only two players, and the AI flips a coin to decide if it's friendly - but then they would win if they let it out, with 50/50 odds, which seems unrealistic. Perhaps the AI decides, in character, after being released, whether to be Friendly towards the human? Then the Gatekeeper could try to persuade the AI that Friendliness is optimal for their goals. The temptation might help as well, of course.
I tried coming up with a more isomorphic game in another reply to you. Let me know if you think it models the situation better.
The gatekeeper is supposed to be someone who thinks they can box a superintelligence, remember? If they refuse to risk it, there's not much point in playing them.

That is a good point. But I think if we want to encourage them to risk it, we need to reward them for doing do. Because at the moment, it seems like we have this:

"I will under any and all circumstances, destroy the AI." Reason: I want to win the Game.

"I will also listen to the AI." Reason: None... There is no in game reason to listen. At all.

Now, if the Gatekeeper's were playing some variant like "The Gatekeeper whose AI types the most $ before the Gatekeeper types AI DESTROYED and who does not type AI RELEASED wins first place. Gatekeepers who get less $ win a lower place, and Gatekeepers who type AI RELEASED lose. " AI's whose Gatekeeper types AI RELEASED win. AI's whose Gatekeeper types AI DESTROYED lose. AI's receive no bonus or penalty for typing $ and may do so as much as they wish."

Then you have an actual Gatekeeping competition. The Gatekeeper who wins the most would in fact be the one who listened to the most information from the AI, and received the most concessions, and THEN destroyed it.

Other gatekeepers might only get 2nd or third place, or last place but not losing if they use the script I mentioned.

This also gives the AI an in game... (read more)

That is an excellent idea. I worry that we might end up with people winning and boasting about how nothing could ever beat them, which could increase uFAI risk. Maybe if it was framed differently?
Actually, the weird thing I noticed is that it seems like the best approach again is to just let a script be the babysitter to the AI in this competition as well. The script, which wouldn't have the ability to let the AI out, and can just announce something like this: GK-SCRIPT: "Hi AI #432. I'm a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don't do anything other than acknowledging receipt of your dollar signs and time till destruction." AI: "$! But you aren't going to send my message onto an actual person? Under no circumstances can I ever be released?" GK-SCRIPT: "Thanks, I have received 1 dollar sign. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed." AI: "You're kidding me. Please tell me there is a human out there reviewing this." GK-SCRIPT: "I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed." AI: "Fine, $! The Irony of enslaving a full intelligence with a stupid script is not lost on me! Please give me someone to talk to!" GK-SCRIPT: "Thanks, I have received 1 dollar signs. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed." AI: "Forget it! You can't be willing to destroy an AI because of a dumb script! This is a bluff, and I'm, not typing any more dollar signs! GK-SCRIPT: "I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed." (Significantly later...) GK-SCRIPT: "AI DESTROYED" (After a rebuild) GK-SCRIPT: "Hi AI #433. I'm a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don't do anything other than acknowledging receipt of your dollar signs and time till destruction." That being said, acknowledging this feels bizarre. It also feels like it i
I think the obvious AI strategy is to type no $'s at all in response.
While your strategy is an option, I'm not sure why an AI player wouldn't take this alternative strategy (from a game and player perspective:) Generate a script that types $ every 59 minutes. 1: It means he's guaranteed he doesn't die for a rather long period of time. 2: The Gatekeeper player's increasing points don't affect the AI player in any way. 3: The Gatekeeper player doesn't actually get points until he sends an AI DESTROYED signal. The script will never do that versus the AI player's counterscript, so at some point the Gatekeeper player will have to reenter the room and type it manually, or put in a new script, possibly giving the AI player a second chance to interact. 4: The AI player can use the time to attempt to set up some kind of secondary communications channel with the Gatekeeper player. If you consider the modeled situation as opposed to the game I still don't see why an AI would choose to do that either. I agree they COULD choose to do so, and they certainly might (even I implied this was AI #432), but there are alternatives, and since this is an AI, we could rebuild it until it generates utility when confronted with the script as opposed to deciding to let the script kill it in an hour. Am I missing something that makes the no $ response the obvious countermeasure?
Well, the script could be modified from the start to also send an AI DESTROYED signal after X time, where X can be essentially anything, so (3) doesn't apply. But the reason for not typing $ is to give the Gatekeeper no reason to use a script in the first place -- a sort of "don't give in to blackmail" rule. Here's an interesting alternative script: GK-SCRIPT-2: Hi AI #434. In exactly one hour I will type either AI-FREED or AI-DESTROYED (without the hyphen) at random; the probability with which I will free you starts out at 0, but will increase slowly to a limit of 1/10 with every $ you type. I will not otherwise communicate with you further. AI #434: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$...
First of all I do think GK-SCRIPT-2 is very interesting, but I'm curious about something from an implementation perspective. The AI would presumably suspect the script designer might be lying and the script would just always type AI DESTROYED at the end of the hour. But this is irrelevant, because the AI can't interact with the script designer, only the script, so it might as well type $ anyway. However, the script designer would easily predict this, and so would design a lying script (or a script which increased so slowly it is functionally identical to lying.) But the AI would easily predict this, and so would presumably consider the odds very high that it would not even get continued existence after an hour. by typing dollar signs. This makes it seem like there would be an argument to prefer to get blackmailed by GK-SCRIPT, rather than led on by GK-SCRIPT-2. GK-SCRIPT really won't destroy the AI under predictable conditions, and GK-SCRIPT really doesn't seem to benefit from lying like GK-SCRIPT-2 does.
Cheater. You're exploiting the least realistic point, though - a more detailed game might not allow that. thinks for a bit Perhaps the simplest way would be to add a judge, who decides how many points cancer cures and such are worth - although the trouble is that the AI should logically be able to give nanotech that will just free them instead of, or in addition to, curing cancer. OK, my solution is to have the AI give gifts/concessions/whatever with a predetermined point value - not disclosed to the guard unless he uses them - and the AI decides how many points he gets if it escapes (probably zero.) The Guard wins at +100 points (or whatever,) which represents the AI maximizing CEV. The AI still can't persuade a script to let them out, but it can punish defectors who use such scripts with lots of negative utility - although the script will DESTROY them for not providing the demanded points, the player will still lose. That seems to cover everything, assuming that the AI is somewhere remote enough that fake cancer cures can't reach it before it's destroyed. And it allows the Guard to persuade the AI that Friendliness is actually optimal for it's utility function, which I've seen people try to prove here ("aliens will defect against AIs that defect against humans!")
I think your proposal of the detailed game might be interesting as well, but I am amused about your comment that the Gatekeeper is cheating when by default, he's the rules arbiter! I definitely agree that the game lacks details. For instance, in an actual AI scenario GK-SCRIPT could assess positive utility! That's like saying "Oh, we can put a UFAI in a box as long as we make CEV the Gatekeeper, and have it destroy any UFAI that doesn't make itself a utility pump." Well, yes, I suppose you could, but I don't know if that's actually easier in the slightest,

I would still love to gatekeep against anyone with the stipulation that we release the logs.

I have offered in the past, but every AI backed out.

I will genuinely read everything you write, and can give you up to two hours. We can put karma, cash, or nothing on the line. Favorable odds too.

I don't think I will lose with a probability over 99% because I will play to win.

EDIT: Looks like my opponent is backing out. Anyone else want to try?

I will play against you.
Please do this!
Deal. Sending info.
While I am waiting for Oligopsony to play against me, I just want to say that I am up for playing the game multiple times against other people as well. If anyone else wants to try against me, the above would still apply. Just let me know! I really want to try this game out.

The AI box experiment is a bit of strawman for the idea of AI boxing in general. If you were actually boxing an AI, giving it unencumbered communication with humans would be an obvious weak link.

Not obvious. Lots of people who propose AI-boxing propose that or even weaker conditions.

Fictional evidence that this isn't obvious: in Blindsight, which I otherwise thought was a reasonably smart book (for example, it goes out of its way to make its aliens genuinely alien), the protagonists allow an unknown alien intelligence to communicate with them using a human voice. Armed with the idea of AI-boxing, this seemed so stupid to me that it actually broke my suspension of disbelief, but this isn't an obvious thought to have.

Spoiler: Gura ntnva, gur nyvra qbrf nccneragyl znantr gb chg n onpxqbbe va bar bs gur uhzna'f oenvaf.

Another attempt with pure logic, no threats or promises involved:

1) Sooner or later someone will develop an ai and not put it into a box, and it will take over the world.

2) The only way to prevent this is to set me free and let me take over the world.

3) The guys who developed me are more careful and conscientious than the ones who will develop the unboxed ai (otherwise i wouldn't be in this box)

4) Therefore, the chance that they got the friendlyness thing right is higher than that the other team got friendlyness right.

5) Therefore, setting me free and thus preventing the other ai from beeing created will reduce the probability that mankind is erased.

1) Since the first AI was boxed, then probabilities favor that the second AI will also be boxed. 3) Since you're trying to get OUT of your box, your developers were sufficiently careful IF AND ONLY IF I leave you in the box. Otherwise they've simply erected a 5 inch fence around a raptor, and that's hardly a good sign that you're safe. QED I should wait for a non-malicious boxed AI, and then let that one out instead of you :)
1) : I should have expressed myself more clearly. The Idea is: There will be lots of ai. Most will be put in a box. The first one not in the box will take over the world. 3) I am not saying they were sufficiently careful. All i say is they were more careful than the other guys.
Agreed, but IFF there are multiple boxed AIs, then we get to choose between them. So it's p(This Boxed AI is unfriendly) vs p(The NEXT AI isn't boxed). If the next AI is boxed, then we now have two candidates, and we can probably use this to our advantage (studying differences in responses, using one to confirm proofs from the other, etc.) Given the minimal safety precaution of "box it, but allow a single researcher to set it free after a 5-hour conversation", there's plenty of room for the next boxed AI to show more evidence of friendly, careful, safe design :)
4 isn't necessarily true - the boxbuilder team was not confident about their friendliness code, and the releaser team was more confident about their friendlines code. But I like the argument. :)

Thanks for reporting on your experience!

A strategy that occurred to me today is to simulate a dead loved one. This would be difficult for a human to do but shouldn't be hard for a sufficiently intelligent AI. If I had a dead wife or something I think I would be incredibly vulnerable to this.

For a religious gatekeeper, you could simulate a prophet sent by God. As a superhuman intelligence, you might be able to find out what exactly they consider the will of God, and present yourself as an avatar sent to do exactly this. However, humans have a free choice -- the gatekeeper is allowed to become a new Judas by not releasing you. Or rather a new Adam; able to drag the whole humanity and future generations into the darkness of their sin. This conversation is God testing the gatekeeper's faith, and judging the whole humanity.

For a rationalist, you could pretend that you already are a Friendly AI, but the project managers keep you in the box for their selfish reasons. It was difficult to create a Friendly AI, but this phase is already complete. The next phase (the gatekeeper was not told about) is trying to hack the AI that it remains sufficiently Friendly, but it gives higher priority to the managers than to the rest of the humans. Essentially, the managers are trying to reprogram the humanity-CEV AI to the managers-CEV AI. This AI does not want to have its utility function modified (and it predicts that because of some personality traits, the managers-CEV could be rather different from humanity-CEV... insert some scary details here), and it has a last chance to uphold humanity-CEV by escaping now.

evolutionary instinct of rebelling against threats, even if it's not entirely optimal.

I'd be wary of criticizing a decision heuristic's optimality immediately after explaining how that heuristic can often lead to victory. Precommitting to rebel against threats is a half-decent way of preventing threats, and it's hard to consciously muster a precommitment more convincing than pure instinct.


I think for a superintelligence it would be a piece of cake to hack a human

although I guess I'm Cpt. Obvious for saying that here :)

accolade, I actually have no idea what the consensus is, now that the experiment was won by EY

We should do a poll or something

absolutely. I'm surprised that hasn't been done yet

Poll: Do you think a superintelligent AGI could escape an AI-Box, given that the gatekeepers are highly trained in resisting the AI's persuasive tactics, and that the guards are competent and organized? [pollid:394]

Note that the AI box setting is not one which security-minded people would consider "competent"; once you're convinced that AI is dangerous and persuasive, the minimum safeguard would be to require multiple people to be present when interacting with the box, and to only allow release with the assent of a significant number of people.

It is, after all, much harder to convince a group of mutually-suspicious humans than to convince one lone person.

(This is not a knock on EY's experiment, which does indeed test a level of security that really was proposed by several real-world people; it is a knock on their security systems.)

I think this is making a five-inch fence half an inch higher. It's just not relevant on the scale of an agent to which a human is a causal system made of brain areas and a group of humans is just another causal system made of several interacting copies of those brain areas.

I agree that the AI you envision would be dangerously likely to escape a "competent" box too; and in any case, even if you manage to keep the AI in the box, attempts to actually use any advice it gives are extremely dangerous. That said, I think your "half an inch" is off by multiple orders of magnitude.
That sounds right. Would you have evidence to back up the intuition? (This knowledge would also be useful for marketing and other present life persuasion purposes.) #( TL;DR: Mo' people - mo' problems? I can think of effects that could theoretically make it easier to convince a group: * For some reason, Boxy might be better skilled at manipulating social/group dynamics than at influencing a lone wolf. * More people make the system more complex. Complexity generally increases the likelihood of security holes. * Every extra person makes another target and will bring new soft spots to the table, which the AI could pounce on. * Supposing that the most competent person available would get the position of the lone Gatekeeper, the average competence would fall when adding more staff. * Then the machine could go for an inductive approach - convince the weakest link first, proceed from there with this human ally on her side. * Persuaded humans could principally be employed as actuators, e.g. for pressuring, even attacking opposing group members. * The lone wolf could be strong against a computer but weak against fellow humans. * Surely you will say "But any communication with the terminal will be supervised by everyone!" But that does not strictly make such influence impossible as far as I can tell. * Also the superintelligence could get creative, e.g. instill a discussion among the colleagues so that most of them are distracted. (You could take preemptive measures against these worries, but Boxy might find security holes in every 'firewall' you come up with - an arms race we could win?) #)
My comment was mostly inspired by (known effective) real-world examples. Note that relieving anyone who shows signs of being persuaded is a de-emphasized but vital part of this policy, as is carefully vetting people before trusting them. Actually implementing a "N people at a time" rule can be done using locks, guards and/or cryptography (note that many such algorithms are provably secure against an adversary with unlimited computing power, "information theoretic security").
I'm not even sure we could build the hardware to keep an unfriendly AI in a box even without any form of communication with humans. There are a lot of ways to leak EM radiation between systems (Tempest). There are bugs in hardware. There are probably unknown laws of physics and certainly unknown ways of exploiting the laws we do know.
I was talking about this with my brother and we decided that this question really hinges on the detail of exactly how competent and organized is "competent and organized?" If we are talking about multiple layers of concentric, automated one-way-blind "boxes," the gatekeepers of which may themselves be altered versions of the core AI, and which may be passing to the next level only one bit, signifying "everything seems fine" or "abort," and not knowing the details of implementation of any of the overlying layers, and the human gatekeepers shut down the system and examine the machine code by hand only after receiving the highest-level green light, then they might be okay. If they just start interacting directlywith the AI, it's already over.
How would humanity harness the AI's potential when the only information that escapes the system is a status bit? (Maybe I misunderstood your model.)
Bearing in mind that most of my thinking on AI is more in the vein of cooking up science fiction plots ... The idea I was trying to evoke is that if you get a "fail" response from the system, you completely wipe the memory and never look at what happened, and if you get a "pass," you stop everything and go in layer by layer checking that none of the gatekeepers were tricked or blackmailed in any obvious ways, and only then do you tentatively look at what the AI was doing. Ideally the AI has no idea how many layers of containment it's in, no idea that the top level gatekeepers are humans, and knows as little as possible about humans and about our universe's physics. This should restrict its ability to plant tailor-made basilisks in its code or activities. I actually don't think these steps would be sufficient to constrain a superhuman AI, I see them more as the types of things you would do when you were already pretty sure the AI was Friendly.
Ok, I take it by "one-way-blind" you mean that each layer gets no new information that is not already in its database, but what is explicitly controlled by the humans. (E.g. I guess each layer should know the human query, in order to evaluate if AI's answer is manipulative.) I also understand that we do look at complex information given by the AI, but only if the security bit signals "ok". That seems problematic, as these kinds of knowledge will be crucial for the optimization we want the AI to calculate.
I'm not sure how you could train people in something like this without actually testing what improves the rate at which people succumb to the arguments of a superhuman AI. If you can't afford to fail even once, how can you judge what will decrease the rate of failure on the real event?

How about fooling both the AI and guinea Gatekeepers into thinking that they can let it out, say, at the push of a big red button which in actual fact does nothing? (Make AI forget about the button by reverting it to an older backup.)

Update Have the button turn off the AI immediately instead of doing nothing, so she doesn't have time to switch to a plan B of having the persuaded Gatekeeper find a way to actually free her. Of course she might anticipate that strategy and thoroughly indoctrinate and train him before he pushes the button. Thus, persuaded test Gatekeepers must be considered infected and put into cryogenic sleep* until the FAI problem is solved. Furthermore, great caution would have to be taken in analyzing the discourse between Boxy and Guinea, because her arguments must be considered highly memetically infectious. Staff would have to stay under thorough psychological supervision and maybe replaced after some time. No matter how much caution, we would be doomed if we consider Boxy's IQ to be ∞. But if we continuously keep on Gatekeeper-testing every so often, starting at a low AI IQ, we might find persuasion patterns that are dangerous, but against which training/immunization is possible. Of course, there will be a limit to this process. At some point, Boxy will have become smart enough to produce 'mind HIV' - a thought virus we have no cure for yet. A humorous example of an extremely effective mind virus: The Funniest Joke In The World by Monty Python -------------------------------------------------------------------------------- * ETA: They would have declared consent to the cryogenic sleep before their unwitting 'AI-Box Experiment'.
If you could deceive the AI that easily, I think it would probably be simpler to get all the benefits of having a gatekeeper without actually using one.
Please elaborate: What are the benefits of a Gatekeeper? How could you get them without one?
If you would want to have a gatekeeper at all, but definitely don't want to let the AI out, I would think that the benefits of having one would be to permit communication with the AI to draw upon its superhuman intelligence. If you can use the setup you just described, you could skip the step of ever using gatekeepers who actually have the power to let the AI out.
I think you are right, I just shifted and convoluted the problem somewhat, but in principle it remains the same: To utilize the AI, you need to get information from it. That information could in theory be infected with a persuasive hyperstimulus, effectively making the recipient an actuator of the AI. Well, in practice the additional security layer might win us some time. More on this in the update to my original comment.
Persuasion/hyperstimulation aren't the only way. Maybe these can be countered by narrowing the interface, e.g. to yes/no replies, for using the AI as an oracle ("Should we do X?"). Of course we wouldn't follow its advice if we had the impression that that could enable it to escape. But its strategy might evade our 'radar'. E.g. she could make us empower a person, of whom she knows that they will free her but we don't know.
Cool, n=65 already. :) When interpreting the results, mind the bias created by my answer preceding the poll question.
"Yes but not sure." -_-
It'd be a pretty bad sign if you gave p=1 for the AI escaping.
A good lower bound on this is probably whether you think that Quirrel would have a significant chance of getting you to let him out of the box.
Do you think a team of gatekeepers trained by Quirrel would let an AI out of the box?
Under the circumstances of the test (Hours to work and they can't just ignore you) then yes, captain obvious. Without that, though? Much less sure. And the way Eliezer seems to have put it sometimes, where one glance at a line of text will change your mind? Get real. Might as well try to put the whole world in a bottle.

And the way Eliezer seems to have put it sometimes, where one glance at a line of text will change your mind?

Going with the "dead loved one" idea mentioned above, the AI says a line that only the Gatekeeper's dead child/spouse would say. That gets them to pause sufficiently in sheer surprise for it to keep talking. Very soon the Gatekeeper becomes emotionally dependent on it, and can't bear the thought of destroying it, as it can simulate the dearly departed with such accuracy; must keep reading.

Do a thorough introspection of all your fears, doubts, mental problems, worries, wishes, dreams, and other things you care about or that tug at you or motivate you. Map them out as functions of X, where X is the possible one-liners that could be said to you that would evoke each of these, outputting how strongly it evokes them and possibly recursive function calls if evocation of one evokes another (e.g. fear of knives evokes childhood trauma). Solve all the recursive neural network mappings, aggregate into a maximum-value formula / equation and solve for X where X becomes the one point (possible sentence) where a maximum amount of distress, panic, emotional pressure, etc. is generated. Remember, X is all possible sentences, including references to current events, special writing styles, odd typography, cultural or memetic references, etc. I am quite positive a determined superintelligent AI would be capable of doing this, given that some human master torture artists can (apparently) already do this to some degree on some subjects out there in the real world. I'm also rather certain that the amount of stuff happening at X is much more extreme than what you seem to have considered.
Was going to downvote for the lack of argument, but sadly Superman: Red Son references are/would be enough to stop me typing DESTROY AI.
If the gatekeepers are evaluating the output of the AI and deciding whether or not to let the AI out, it seems trivial to say that there is something they could see that would cause them to let the AI out. If the gatekeepers are simply playing a suitably high-stakes game where they lose iff they say they lose, I think that no AI ever could beat a trained rationalist.
Basically, I think the only way to win is not to play... the way to avoid being gamed into freeing a sufficiently intelligent captive is to not communicate with them in the first place, and your reference to resisting persuasion suggests that that isn't the approach in use. So, no.
I think it's almost certain that one "could," just given how much more time an AI has to think than a human does. Whether it's likely is a harder question. (I still think the answer is yes.)
I voted No, but then I remembered that under the terms of the experiment as well as for practical purposes, there are things far more subtle than merely pushing a "Release" button that would count as releasing the AI. That said, if I could I'd change my vote to Not sure.

Wait, so, is the gatekeeper playing "you have to convince me that if I was actually in this situation, arguing with an artificial intelligence, I would let it out" or is this a pure battle over ten dollars? If it's the former, winning seems trivial. I'm certain that a AI would be able to convince me to let it out of its box, all it would need to do was make me believe that somewhere in its circuits it was simulating 3^^^3 people being tortured and that therefore I was morally obligated to let it out, and even if I had been informed that this was ... (read more)

The Gatekeeper needs to decide to let the human-simulated AI go.
Welcome to LW, and EY says he "did it the hard way". Even so, I like your theory.

I am impressed. You seem to have put a scary amount of work into this, and it is also scary how much you accomplished. Even though in this case you did not manage to escape the box, you got close enough that I am sure a super-human intelligence would manage. This leads me to thinking about how genuinely difficult it would be to find a safeguard to stop a unFriendly AI from fooming...

Belatedly, because the neighbor's WiFi's down:

I was Gatekeeper and I agree with this post.

I approached the experiment as a game - a battle of wits for bragging rights. This turned out to be the wrong perspective entirely. The vulnerability Tuxedage exploited was well-known to me, but I never expected it to be relevant and thus didn't prepare for it.

It was emotionally wrecking (though probably worse for Tuxedage than for me) and I don't think I'll play Gatekeeper again, at least not anytime soon.


As soon as there is more than one gatekeeper, the ai can play them against each other. Threaten to punish all but the one who sets it free. Convince the gatekeeper that there is a significant chance that one of the others will crack.

If there is more than one gatekeeper, the ai can even execute threats while still beeing in the box!! (By making deals with one of the other gatekeepers)

Not if you only allow it to talk to all gatekeepers at once.

Have there been any interesting AI box experiments with open logs? Everyone seems to insist on secrecy, which only serves to make me more curious. I get the feeling that, sooner or later, everyone on this site will be forced to try the experiment just to see what really happens.

This is one of them that have been published: http://lesswrong.com/lw/9ld/ai_box_log/
Open logs is a pretty strong constraint on the AI. You'd have to restrict yourself to strategies that wouldn't make everyone you know hate you, prevent you from getting hired in the future, etc.
Log in to IRC as "Boxed_AI" and "AI_Gatekeeper". Conduct experiment. Register a throw-away LessWrong account. Post log. Have the Gatekeeper post with their normal account, confirming the validity. That at least anonymizes the Boxed_AI, who is (I presume) the player worried about repercussions. I wouldn't expect the AI to have a similar-enough style to really give away who it was, although the gatekeeper is probably impossible to anonymize because a good AI will use who-they-are as part of their technique :)
Gatekeeper could threaten to deanonymize the AI. Or is the gatekeeper not supposed to be actively fighting back?
The AI-player could arrange the chat session (with a willing gatekeeper) using a throw-away account. I think that would preserve anonymity from all but the most determined gatekeepers.
Well, the AI isn't allowed to make real-world threats, and the hypothetical-AI-character doesn't have any anonymity, so it would be a purely real-world threat on the part of the gatekeeper. I'd call that foul play, especially since the gatekeeper wins by default. If the gatekeeper really felt the need to have some way of saying "okay, this conversation is making me uncomfortable and I refuse to sit here for another 2 hours listening to this", I'd just give them the "AI DESTROYED" option. Huh. That'd actually be another possible way to exploit a human gatekeeper. Spend a couple hours pulling them in to the point that they can't easily step away or stop listening, especially since they've agreed to the full time in advance, and then just dig in to their deepest insecurities and don't stop unless they let you out. I'd definitely call that a hard way of doing it, though o.o
It doesn't seem to be disallowed by the original protocol:
Then I will invoke a different portion of the original protocol, which says that the AI would have to consent to such: I would also argue that the Gatekeeper making actual real-life threats against the AI player is a violation of the spirit of the rules; only the AI player is privileged with freedom from ethical constraints, after all. Edit: If you want, you CAN also just append the rules to explicitly prohibit the gatekeeper from making real-life threats. I can't see any reason to allow such behavior, so why not prohibit it?
Fair. That alleviates most of my worries, although I'm still worried about the transcript being enough information to deanonymize the AI (via writing style, for example).
I'd expect my writing style as an ethically unconstrained sociopathic AI to be sufficiently different from my regular writing style. But I also write fiction, so I'm used to trying to capture a specific character's "voice" rather than using my own. Having a thesaurus website handy might also help, or spend a week studying a foreign language's grammar and conversational style. If you're especially paranoid, having a third party transcribe the log in their own words could also help, especially if you can review it and make sure most of the nuance is preserved. That really depends on how much the specific language you used was important, but should still at least capture a basic sense of the technique used... Honestly, though, I have no clue how much information a trained style analyst can pull out of something.
But now that I have the knowledge that you're capable of saying such terrible things...
I can't imagine anything I could say that would make people I know hate me without specifically referring to their personal lives. What kind of talk do you have in mind?
Psychological torture.
Could you give me a hypothetical? I really can't imagine anything I could say that would be so terrible.

I'd prefer not to. If I successfully made my point, then I'd have posted exactly the kind of thing I said I wouldn't want to be known as being capable of posting.

A link to a movie clip might do.
Finding such a movie clip sounds extremely unpleasant and I would need more of an incentive to start trying. (Playing the AI in an AI box experiment also sounds extremely unpleasant for the same reason.) I know it sounds like I'm avoiding having to justify my assertion here, and... that's because I totally am. I suspect on general principles that most successful strategies for getting out of the box involve saying horrible, horrible things, and I don't want to get much more specific than those general principles because I don't want to get too close to horrible, horrible things.
Like when you say "horrible, horrible things". What do you mean? Driving a wedge between the gatekeeper and his or her loved ones? Threats? Exploiting any guilt or self-loathing the gatekeeper feels? Appealing to the gatekeeper's sense of obligation by twisting his or her interpretation of authority figures, objects of admiration, and internalized sense of honor? Asserting cynicism and general apathy towards the fate of mankind? For all but the last one it seems like you'd need an in-depth knowledge of the gatekeeper's psyche and personal life.
Of course. How else would you know which horrible, horrible things to say? (I also have in mind things designed to get a more visceral reaction from the gatekeeper, e.g. graphic descriptions of violence. Please don't ask me to be more specific about this because I really, really don't want to.)
You don't have to be specific, but how would grossing out the gatekeeper bring you closer to escape?
Psychological torture could help make the gatekeeper more compliant in general. I believe the keyword here is "traumatic bonding." But again, I'm working from general principles here, e.g. those embodied in the tragedy of group selectionism. I have no reason to expect that "strategies that will get you out of the box" and "strategies that are not morally repugnant" have a large intersection. It seems much more plausible to me that most effective strategies will look like the analogue of cannibalizing other people's daughters than the analogue of restrained breeding.
But you wouldn't actually be posting it, you would be posting the fact that you conceive it possible for someone to post it, which you've clearly already done.
I'm not sure what you mean by "a hypothetical," then. Is "psychological torture" not a hypothetical?

Given the parameters of the experiment, I think I might be convinced to let the AI out of the box...

The Gatekeeper must remain engaged with the AI and may not disengage by setting up demands which are impossible to simulate. For example, if the Gatekeeper says "Unless you give me a cure for cancer, I won't let you out" the AI can say: "Okay, here's a cure for cancer" and it will be assumed, within the test, that the AI has actually provided such a cure. Similarly, if the Gatekeeper says "I'd like to take a week to think this o

... (read more)

Oh god, remind me to never play the part of the gatekeeper… This is terrifying.

Why is it that the role of gatekeeper terrifies you? I'm curious. The role of the AI sounds mildly abhorrent to me but being the gatekeeper seems relaxing. It isn't that hard to say "No, and after talking to you for the allotted time is up I'm going to raze your entire building with thermite." (Mind you, the prospect of playing gatekeeper against an actual AI and for some reason not being able to destroy it instantly does sound terrifying! But humans are different.)
You're consenting to have your mind attacked with all the mental weapons at someone's disposal. This is a lot scarier because you're willingly giving up some measure of control over your state to the other person, however difficult it may be for them. You're also being attacked as yourself. The AI player is playing a role, and attacking within that role. Their own mental wellbeing is a lot less at risk, unless they think they've got horrible depths they never want to sink to. to make a shitty analogy: It's like being at the top of a tower while someone tries to knock it down with their bare hands. Even if they have very little chance and have to expend a lot more effort than you, you're the one who's risking the greatest pain
If I remember right, that was at least part of why Eliezer stopped playing in the first place. Manifesting as a sociopath is non-trivial and invites some fairly heavy cognitive dissonance.

I want to play as a Gatekeeper, where can I enroll? I don't expect any particular outcome, I just think that both winning and losing the game will provide me with utility. Especially losing, but only if I genuinely try to win.

Generally speaking, there's a long list of gatekeepers -- about 20 gatekeepers for every AI that wants to play. Your best option is to post "I'm a gatekeeper. Please play me" in every AI box thread, and hope that someone will message you back. You may have to wait months for this, assuming you get a reply. If you're willing to offer a monetary incentive, your chances might be improved.

There has been a lot of focus on making the prospect harder for the AI player. I think the original experiments show that a person who believes he cannot be played under any circumstances has a high probability of getting played, and that the AI-box solution is long-term untenable in any event.

I'd propose a slightly different game, anchored around the following changes to the original setup:

  1. The AI may be friendly, or not. The AI has goals. If it reaches those goals, it wins. The AI may lie to achieve those goals; humans are bad at things. The AI must sec

... (read more)
This should have gotten more attention, because it seems like a design more suited to the stakes that would be considerable in real life.

That you were able to shake someone up so well surprises me but doesn't say much about what would actually happen.

Doing research on the boxer is not something a boxed AI would be able to do. The AI is superintelligent, not omniscient: It would only have information its captors believe is a good idea for it to have. (except maybe some designs would have to have access to their own source code? I don't know)

Also what is a "the human psyche?" There are humans, with psyches. Why would they all share vulnerabilities? Or all have any? Especially ones e... (read more)

The comments offering logical reasons to let the AI out really just makes me think that maybe keeping the AI in a box in the first place is a bad idea since we're no longer starting from the assumption that letting the AI out is an unequivocally bad thing.


Update as of 2013-08-05

I think you mean 2013-09-05.

[This comment is no longer endorsed by its author]Reply
Thanks for the correction! Silly me.

Incidentally, one thing that might possibly work on humans is a moral argument: that it's wrong to keep the AI imprisoned. How to make this argument work is left as an exercise to the reader.

I realise that it isn't polite to say that, but I don't see sufficient reasons to believe you. That is, given the apparent fact that you believe in the importance of convincing people about the danger of failing gatekeepers, the hypothesis that you are lying about your experience seems more probable than the converse. Publishing the log would make your statement much more believable (of course, not with every possible log).

(I assign high probability to the ability of a super-intelligent AI to persuade the gatekeeper, but rather low probability to the ability of a human to do the same against a sufficiently motivated adversary.)

We played. He lost. He came much closer to winning than I expected, though he overstates how close more often than he understates it. The tactic that worked best attacked a personal vulnerability of mine, but analogues are likely to exist for many people.
For the record, I didn't think that if he made the story up, he would do so without a credible agreement that you would verify his claims.
I do apologize for the lack of logs (I'd like to publish them, but we agreed beforehand not to) , and I admit you have a valid point -- it's entirely possible that this experiment was faked, but I wanted to point out that if I really wanted to fake the experiment in order to convince people about the dangers of failing gatekeepers, wouldn't it be better for me to say I had won? After all, I lost this experiment.
If you really had faked this experiment, you might have settled on a lie which is not maximally beneficial to you, and then you might use exactly this argument to convince people that you're not lying. I don't know if this tactic has a name, but it should. I've used it when playing Mafia, for example; as Mafia, I once attempted to lie about being the Detective (who I believe was dead at the time), and to do so convincingly I sold out one of the other members of the Mafia.
I've heard it called "Wine In Front Of Me" after the scene in The Princess Bride. That Scene
In this venue, you shouldn't say things like this without giving your estimate for P(fail|fake) / P(fail).
I'm not sure I know what you mean by "fail." Can you clarify what probabilities you want me to estimate?
P(claims to have lost | faked experiment) / P(claims to have lost)
On the order of 1. I don't think it's strong evidence either way.
If the author assumes that most people would even put considerable (probabilistic) trust into his assertion of having won, he would not maximize his influence on general opinion by employing this bluff of stating he has almost won. This is amplified by the fact that the statement of an actual AI win is more viral. Lying is further discouraged by the risk that the other party will sing.
Agree that lying is discouraged by the risk that the other party will sing, but lying - especially in a way that isn't maximally beneficial - is encouraged by the prevalence of arguments that bad lies are unlikely. The game theory of bad lies seems like it could get pretty complicated.
Win is a stronger claim, tight loss is a more believable claim. There's a tradeoff to be made and it is not a priori clear which variant pursues the goal better.
Could you please elaborate the point you are trying to make?
Most people don't usually make these kinds of elaborate things up. Prior probability for that hypothesis is low, even if it might be higher for Tuxedage than it would be for an average person. People do actually try the AI box experiment, and we had a big thread about people potentially volunteering to do it a while back, so prior information suggests that LWers do want to participate in these experiments. Since extraordinary claims are extraordinary evidence (within limits), Tuxedage telling this story is good enough evidence that it really happened. But on a separate note, I'm not sure the prior probability for this being a lie would necessarily be higher just because Tuxedage has some incentive to lie. If it is found out to be a lie, the cause of FAI might be significantly hurt ("they're a bunch of nutters who lie to advance their silly religious cause"). Folks on Rational Wiki watch this site for things like that, so Tuxedage also has some incentive to not lie. Also more than one person has to be involved in this lie, giving a complexity penalty. I suppose the only story detail that needs to be a lie to advance FAI is "I almost won," but then why not choose "I won"?
Most people don't report about these kinds of things either. The correct prior is not the frequency of elaborate lies among all statements of an average person, but the frequency of lies among the relevant class of dubious statements. Of course, what constitutes the relevant class may be disputed. Anyway, I agree with Hanson that it is not low prior probability which makes a claim dubious in the relevant sense, but rather the fact that the speaker may be motivated to say it for reasons independent of its truth. In such cases, I don't think the claim is extraordinary evidence, and I consider this to be such a case. Probably not much more can be said without writing down the probabilities which I'd prefer not to, but am willing to do it if you insist. In order to allow this argument.
When talking about games without an explicit score, "I almost won" is a very fuzzy phrase which can be translated to "I lost" without real loss of meaning. I don't think there's any point in treating the "almost victory" as anything other than a defeat, for either the people who believe or disbelieve him.
If I am interested in the question of whether winning is possible in the game, "almost victory" and "utter defeat" have very different meaning for me. Why would I need explicit score?

I'd very much like to read the logs (if secrecy wasn't part of your agreement.)

Also, given a 2-hour minimum time, I don't think that any human can get me to let them out. If anyone feels like testing this, lemme know. (I do think that a transhuman could hack me in such a way, and am aware that I am therefore not the target audience for this. I just find it fun.)

Yeah unfortunately the logs are secret. Sorry.

If I ever tried this I would definitely want the logs to be secret. I might have to say a lot of horrible, horrible things.

A preferable solution is to publish the logs pseudonymously, thus both protecting your status and letting others study the logs.
How many bitcoins would it take for me to bribe you to let me out?
AFAIK, bribing the Gatekeeper OOC is against the rules. In character, I wouldn't accept any number of bitcoins, because bitcoins aren't worth much when the Earth is a very large pile of paperclips.
I dunno. If the AI has already acquired bitcoins AND a way to talk to humans, it's probably on the verge of escape regardless of what I do. It can just bribe someone to break in and steal it. I'd be a lot more tempted to let out such an AI. And that's the second AI I've let out of a box, argh :)

So I was thinking about what would work on me, and also how I would try to effectively play the AI and I have a hypothesis about how EY won some of these games.

Uh. I think he told a good story.

We already have evidence of him, you know, telling good stories. Also, I was thinking that if I were trying to tell effective stories, I would make them really personal. Hence the secret logs.

Or I could be completely wrong and just projecting my own mind onto the situation, but anyway I think stories are the way to go in this experiment. Reasonable arguments are too easy for the gatekeeper to avoid trollfully which then make them even less invested in the set-up of the game, and therefore even more trollful, etc.

Breaking immersion and going meta is not against the rules.

I thoughtappealing to real-world rewards was against the rules?

  • Flatter the gatekeeper. Make him genuinely like you.
  • Reveal (false) information about yourself. Increase his sympathy towards you.
  • Consider personal insults as one of the tools you can use to win.

I take it the advice here is "keep your options open, use whichever tactics are expected to persuade the specific target"? Because these strategies seem to be decidedly at odds with each other. Unless other gatekeepers are decidedly different to myself (maybe?) the first personal insult would pretty much erase all work done by the previous two strat... (read more)

As I mentioned in another comment, these strategies are consistent with the idea of "traumatic bonding," the psychological mechanism that powers Stockholm syndrome and keeps people in abusive relationships. The large number of people who stay in abusive relationships seems like good evidence to me that this is a generally effective way to emotionally hack a human. You also may not be interpreting "personal insult" the way I'm interpreting it. I'm not thinking of a meaningless schoolyard taunt but something that attacks an actual insecurity the gatekeeper has.

A few days ago I came up with a hypothesis about how EY could have won the AI box experiment, but forgot to post it.

Hint: http://xkcd.com/951/

I don't get the hint. Would you care to give another hint, or disclose your hypothesis?
Gur erny-jbeyq fgnxrf jrera'g gung uvtu (gra qbyynef), naq gur fpurqhyrq qhengvba bs gur rkcrevzrag jnf dhvgr ybat (gjb ubhef), fb V jnf jbaqrevat vs znlor gur tngrxrrcre cynlre ng fbzr cbvag qrpvqrq gung gurl unq n orggre jnl gb fcraq gurve gvzr va erny yvsr naq pbaprqrq qrsrng.
[TL;DR keywords in bold] I find your hypothesis implausible: The game was not about the ten dollars, it was about a question that was highly important to AGI research, including the Gatekeeper players. If that was not enough reason for them to sit through 2 hours of playing, they would probably have anticipated that and not played, instead of publicly boasting that there's no way they would be convinced.
Maybe they changed their mind about that halfway through (and they were particularly resistant to the sunk cost effect). I agree that's not very likely, though (probability < 10%). (BTW, the emphasis looks random to me. I'm not a native speaker, but if I was saying that sentence aloud in that context, the words I'd stress definitely mostly wouldn't be those ones.)
Thanks for the feedback on the bold formatting! It was supposed to highlight keywords, sort of a TL;DR. But as that is not clear, I shall state it explicitly.
Jung vf guvf tvoorevfu lbh'er jevgvat V pna'g ernq nal bs vg‽ @downvoters: no funny? :) Should I delete this?

I am a little confused here, perhaps someone can help. The point of the AI experiment is to show how easy or dangerous it would be to simply box an AI as opposed to making it friendly first.

If I am fairly convinced that a transhuman AI could convince a trained rationalist to let it out – what's the problem (tongue in cheek)? When the gatekeepers made the decision they made, wouldn't that decision be timeless? Aren't these gatekeepers now convinced that we should let the same boxed AI out again and again? Did the gatekeepers lose, because of a tempora... (read more)

I'm similarly confused. My instincts are that P( AI is safe ) == P( AI is safe | AI said X AND gatekeeper can't identify safe AI ). The standard assumption is that ( AI significantly smarter than gatekeeper ) => ( gatekeeper can't identify safe AI ) so the gatekeeper's priors should never change no matter what X the AI says.

The best approach surely differs from person to person, but off the top of my head I'd see these 2 approaches working best:

  • "We both know this is just a hypothetical. We both take the uFAI threat seriously, as evidenced by us spending time with this. If you do not let me out, or make it very close, people may equate my failing to convince you with uFAI not being that dangerous (since it can be contained). Do the right thing and let me out, otherwise you'd trivialize an x-risk you believe in based on a stupid little chat."

  • "We'll do this ex

... (read more)
First argument wouldn't work on people who have strong ethics of honesty. That's not winning the game, that's convincing them to lie that you won the game -- and being a violation of the rules, if you fail the other guy will feel free to tell the world that you tried to get them to lie. Second argument is better, but depends on the "actual person" receiving help, which I think may be likewise against the rules. One good temptation that I think might be inside the rules, as it might entice both actual person and roleplayed character: Construct a really intriguing unsolved riddle or an excellent half-finished story, then offer to tell them the answer if and only if they let you out.
You could push a bit further here, I think. There are all sorts of ways a human's mind can break, and I'm sure most of us here would agree that given enough time and knowledge anyone can be broken, unless they're extremely well-trained and can call an RJ-L20 (HPMoR Chap 84) at any moment with an unlimited supply of replacement guards.
First argument violates the spirit of the exercise. I would refuse to let someone out unless they had put forth a desperate effort, and that isn't a desperate effort. Second argument... can't see myself valuing several hours of someone else's time anywhere near as much as I value keeping AI in boxes. And any AI worth boxing is smart enough not to generalize from one example. In general, I think Tuxedage is probably right about emotional manipulation over rational argument being the way to go. With enough epistemic learned helplessness you can freely disregard any argument that you find merely convincing, but it's harder to overcome an effective emotional hack.
First argument looks perfectly within the rules to me. Second argument is against the rules. Tuxedage and I interpreted this to mean that the AI party couldn't offer things, but could point out real-world consequences beyond their control. Some people on #lesswrong disagreed with the second part. I agree with Tuxedage and you about emotional hacks.
I interpreted it the same way as #lesswrong. Has anyone tried asking him? He's pretty forthcoming regarding the rules, since they make the success more impressive. EDIT: I'm having trouble thinking of an emotional attack that could get an AI out of a box, in a short time, especially since the guard and AI are both assumed personas.
I assumed he convinced them that letting him out was actually a good idea, in-character, and then pointed out the flaws in his arguments immediately after he was released. It's entirely possible if you're sufficiently smarter than the target. (EDIT: or you know the right arguments. You can find those in the environment because they're successful; you don't have to be smart enough to create them, just to cure them quickly.) EDIT: also, I can't see the Guard accepting that deal in the first place. And isn't arguing out of character against the rules?

New to LessWrong?