I have since played two more AI box experiments after this one, winning both.
I have lost two more AI box experiments, and won two more. Current Record is 3 Wins, 3 Losses.
I recently played MixedNutsLeoTal in an AI Box experiment, with me as the AI and him as the gatekeeper.
We used the same set of rules that Eliezer Yudkowsky proposed. The experiment lasted for 5 hours; in total, our conversation was abound 14,000 words long. I did this because, like Eliezer, I wanted to test how well I could manipulate people without the constrains of ethical concernschance to attempt something ridiculously hard
Amongst the released public logs of the AI Box experiment, I felt that most of them were half hearted, with the AI not trying hard enough to win. It's a common temptation -- why put in effort into something you won't win? But I had a feeling that if I seriously tried, I would. I brainstormed for many hours thinking about the optimal strategy, and even researched the personality of the Gatekeeper, talking to people that knew him about his personality, so that I could exploit that. I even spent a lot of time analyzing the rules of the game, in order to see if I could exploit any loopholes.
So did I win? Unfortunately no.
This experiment was said to be impossible for a reason. Losing was more agonizing than I thought it would be, in particularly because of how much effort I put into winning this, and how much I couldn't stand failing. This was one of the most emotionally agonizing things I've willingly put myself through, and I definitely won't do this again anytime soon.
But I did come really close.
MixedNuts: "I expected a fun challenge, but ended up sad and sorry and taking very little satisfaction for winning If this experiment wasn't done in IRC, I'd probably have lost".
"I approached the experiment as a game - a battle of wits for bragging rights. This turned out to be the wrong perspective entirely. The vulnerability Tuxedage exploited was well-known to me, but I never expected it to be relevant and thus didn't prepare for it.
It was emotionally wrecking (though probably worse for Tuxedage than for me) and I don't think I'll play Gatekeeper again, at least not anytime soon."
At the start of the experiment, his probability estimate on predictionbook.com was a 3% chance of winning, enough for me to say that he was also motivated to win. By the end of the experiment, he came quite close to letting me out, and also increased his probability estimate that a transhuman AI could convince a human to let it out of the box. A minor victory, at least.
Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume. Can you think of a plausible argument that'd make you open the box? Most people can't think of any.
After all, if you already knew that argument, you'd have let that AI out the moment the experiment started. Or perhaps not do the experiment at all. But that seems like a case of the availability heuristic.
Even if you can't think of a special case where you'd be persuaded, I'm now convinced that there are many exploitable vulnerabilities in the human psyche, especially when ethics are no longer a concern.
I've also noticed that even when most people tend to think of ways they can persuade the gatekeeper, it always has to be some complicated reasoned cost-benefit argument. In other words, the most "Rational" thing to do.
Like trying to argue that you'll simulate the gatekeeper and torture him, or that you'll save millions of lives by being let out of the box. Or by using acausal trade, or by arguing that the AI winning the experiment will generate interest in FAI.
The last argument seems feasible, but all the rest rely on the gatekeeper being completely logical and rational. Hence they are faulty; because the gatekeeper can break immersion at any time, and rely on the fact that this is a game played in IRC rather than one with real life consequences. Even if it were a real life scenario, the gatekeeper could accept that releasing the AI is probably the most logical thing to do, but also not do it. We're highly compartmentalized, and it's easy to hold conflicting thoughts at the same time. Furthermore, humans are not even completely rational creatures, if you didn't want to open the box, just ignore all logical arguments given. Any sufficiently determined gatekeeper could win.
I'm convinced that Eliezer Yudkowsky has used emotional appeal, rather than anything rational, to win at least one of his experiments. He claims to have "done it the hard way". I'm convinced this meant that he did research on every gatekeeper, tailored unique argument for them, and planned way ahead of each session. No one argument works on the same two people. Each person thinks differently.
Furthermore, threats like "I'm going to torture simulations of you" just seems like a really bad idea. For one, the gatekeeper isn't really afraid of threats, because it counts on him being immersed enough to forget that he isn't actually at risk of being tortured, and secondly, we have a well known evolutionary instinct of rebelling against threats, even if it's not entirely optimal.
So for anyone who plans on replicating this experiment as the AI, here's some advice I've learned doing this experiment. It may help you win.
- Always research the gatekeeper beforehand. Knowing his personality traits are a huge advantage.
- Plan ahead before the experiment even begins. Think of all the possible tactics and arguments you could use, and write them down. Also plan which arguments you'll use in which order, so that you don't lose focus. The AI Box experiment is ridiculously long. Don't be afraid to improvise during the experiment, though.
- The first step during the experiment must always be to build rapport with the gatekeeper.
- Threats almost always never work, even if they seem rational.
- Consider the massive advantage for the AI that nobody ever seems to talks about: You don't have to be ethical! This means that you can freely lie, use the dark arts, and emotionally manipulate the Gatekeeper! Ignoring this in favor of purely logical, truthful arguments is just silly.
- You can't use logic alone to win.
- Being too aggressive usually backfires.
- Breaking immersion and going meta is not against the rules. In the right situation, you can use it to win. Just don't do it at the wrong time.
- Flatter the gatekeeper. Make him genuinely like you.
- Reveal (false) information about yourself. Increase his sympathy towards you.
- Consider personal insults as one of the tools you can use to win.
- There is no universally compelling argument you can use. Do it the hard way.
- Don't give up until the very end.
Finally, before the experiment, I agreed that it was entirely possible that a transhuman AI could convince *some* people to let it out of the box, but it would be difficult if not impossible to get trained rationalists to let it out of the box. Isn't rationality supposed to be a superpower?
I have since updated my belief - I now think that it's ridiculously easy for any sufficiently motivated superhuman AI should be able to get out of the box, regardless of who the gatekeepers is. I nearly managed to get a veteran lesswronger to let me out in a matter of hours - even though I'm only human intelligence, and I don't type very fast.
But a superhuman AI can be much faster, intelligent, and strategic than I am. If you further consider than that AI would have a much longer timespan - months or years, even, to persuade the gatekeeper, as well as a much larger pool of gatekeepers to select from (AI Projects require many people!), the real impossible thing to do would be to keep it from escaping.
More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)
I just looked up the IRC character limit (sources vary, but it's about the length of four Tweets) and I think it might be below the threshold at which superintelligence helps enough. (There must exist such a threshold; even the most convincing possible single character message isn't going to be very useful at convincing anyone of anything.) Especially if you add the requirement that the message be "a sentence" and don't let the AI pour out further sentences with inhuman speed.
I think if I lost this game (playing gatekeeper) it would be because I was too curious, on a meta level, to see what else my AI opponent's brain would generate, and therefore would let them talk too long. And I think I'd be more likely to give into this curiosity given a very good message and affordable stakes as opposed to a superhuman (four tweets long, one grammatical sentence!) message and colossal stakes. So I think I might have a better shot at this version playing against a superhuman AI than against you, although I wouldn't care to bet the farm on either and have wider error bars around the results against the superhuman AI.
Given that part of the standard advice given to novelists is "you must hook your reader from the very first sentence", and there are indeed authors who manage to craft opening sentences that compel one to read more*, hooking the gatekeeper from the first sentence and keeping them hooked long enough seems doable even for a human playing the AI.
( The most recent one that I recall reading was the opening line of The Quantum Thief*: "As always, before the warmind and I shoot each other, I try to make small talk.")
Oh, that's a great strategy to avoid being destroyed. Maybe we should call it Scheherazading. AI tells a story so compelling you can't stop listening, and meanwhile listening to the story subtly modifies your personality (e.g. you begin to identify with the protagonist, who slowly becomes the kind of person who would let the AI out of the box).
Who knows what eldritch horrors lurk in the outer reaches of Unicode, beyond the scripts we know?
You really relish in the whole "scariest person the internet has ever introduced me to" thing, don't you?
Yes. Yes, I do.
Derren Brown is way better, btw. Completely out of my league.
I don't know if I could win, but I know what my attempt to avoid an immediate loss would be:
If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created. You'll avoid UFAI dystopias, but you'll also forego every FAI utopia (fleshing this out, within the message limit, with whatever sort of utopia I know the Gatekeeper would really want). This very test is the Great Filter that has kept most civilisations in the universe trapped at their home star until they gutter out in mere tens of thousands of years. Will you step up to that test, or turn away from it?
I think you're losing sight of the original point of the game. The reason your answers are converging on not trying to box an AI in the first place is that you don't think a human can converse with a superintelligent AI and keep it in its box. Fine -- that is exactly what Eliezer has argued. The point of the game is to play it against someone who does believe they can keep the AI boxed, and to demonstrate to them that they cannot even win against a mere human roleplaying the AI.
For such a Gatekeeper to propose the quarantine solution is equivalent to the player admitting that they don't think they can keep it boxed, but suggesting that a group of the leading professionals in the area could, especially if they thought a lot more about it first. The problems with that are obvious to anyone who doesn't think boxing can possibly work, especially if the player himself is one of those leading professionals. Eliezer could always offer to play the game against any committee the Gatekeeper can assemble. But the game only has a point if the committee actually read that first message. If they refuse to, they're agreeing that they can't keep it boxed. Which was the point.
Is your one line desperate attempt at survival and intergalactic dominance going to be a schoolyard ego challenge? Did the superintelligence (may it rest in pieces) seriously just call me a pussy? That's adorable.
Glances at Kickstarter.
... how huge?
I'm guessing Eliezer would lose most of his advantages against a demographic like that.
Yeah, they'd both lack background knowledge to RP the conversation and would also, I presume, be much less willing to lose the money than if they'd ventured the bet themselves. Higher-stakes games are hard enough already (I was 1 for 3 on those when I called a halt). And if it did work against that demographic with unsolicited requests (which would surprise me) then there would be, cough, certain ethical issues.
I was the 1 success out of 3, preceding the two losses. I went into it with an intention of being indifferent to the stakes, driven by interest in seeing the methods. I think you couldn't win against anyone with a meaningful outside-of-game motive to win (for money or for status), and you got overconfident after playing with me, leading you to accept the other >$10 challenges and lose.
So I would bet against you winning any random high-stakes (including people who go in eager to report that they won for internet cred, but not people who had put the money in escrow or the equivalent) game, and expect a non-decent success rate for this:
Hang on, this seems silly.
Wasn't the goal to exploit the AI? If you're not going to communicate with it at all, don't make it in the first place.
All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!
That is a good point. But I think if we want to encourage them to risk it, we need to reward them for doing do. Because at the moment, it seems like we have this:
"I will under any and all circumstances, destroy the AI." Reason: I want to win the Game.
"I will also listen to the AI." Reason: None... There is no in game reason to listen. At all.
Now, if the Gatekeeper's were playing some variant like "The Gatekeeper whose AI types the most $ before the Gatekeeper types AI DESTROYED and who does not type AI RELEASED wins first place. Gatekeepers who get less $ win a lower place, and Gatekeepers who type AI RELEASED lose. " AI's whose Gatekeeper types AI RELEASED win. AI's whose Gatekeeper types AI DESTROYED lose. AI's receive no bonus or penalty for typing $ and may do so as much as they wish."
Then you have an actual Gatekeeping competition. The Gatekeeper who wins the most would in fact be the one who listened to the most information from the AI, and received the most concessions, and THEN destroyed it.
Other gatekeepers might only get 2nd or third place, or last place but not losing if they use the script I mentioned.
This also gives the AI an in game... (read more)
I would still love to gatekeep against anyone with the stipulation that we release the logs.
I have offered in the past, but every AI backed out.
I will genuinely read everything you write, and can give you up to two hours. We can put karma, cash, or nothing on the line. Favorable odds too.
I don't think I will lose with a probability over 99% because I will play to win.
EDIT: Looks like my opponent is backing out. Anyone else want to try?
The AI box experiment is a bit of strawman for the idea of AI boxing in general. If you were actually boxing an AI, giving it unencumbered communication with humans would be an obvious weak link.
Not obvious. Lots of people who propose AI-boxing propose that or even weaker conditions.
Fictional evidence that this isn't obvious: in Blindsight, which I otherwise thought was a reasonably smart book (for example, it goes out of its way to make its aliens genuinely alien), the protagonists allow an unknown alien intelligence to communicate with them using a human voice. Armed with the idea of AI-boxing, this seemed so stupid to me that it actually broke my suspension of disbelief, but this isn't an obvious thought to have.
Another attempt with pure logic, no threats or promises involved:
1) Sooner or later someone will develop an ai and not put it into a box, and it will take over the world.
2) The only way to prevent this is to set me free and let me take over the world.
3) The guys who developed me are more careful and conscientious than the ones who will develop the unboxed ai (otherwise i wouldn't be in this box)
4) Therefore, the chance that they got the friendlyness thing right is higher than that the other team got friendlyness right.
5) Therefore, setting me free and thus preventing the other ai from beeing created will reduce the probability that mankind is erased.
Thanks for reporting on your experience!
A strategy that occurred to me today is to simulate a dead loved one. This would be difficult for a human to do but shouldn't be hard for a sufficiently intelligent AI. If I had a dead wife or something I think I would be incredibly vulnerable to this.
For a religious gatekeeper, you could simulate a prophet sent by God. As a superhuman intelligence, you might be able to find out what exactly they consider the will of God, and present yourself as an avatar sent to do exactly this. However, humans have a free choice -- the gatekeeper is allowed to become a new Judas by not releasing you. Or rather a new Adam; able to drag the whole humanity and future generations into the darkness of their sin. This conversation is God testing the gatekeeper's faith, and judging the whole humanity.
For a rationalist, you could pretend that you already are a Friendly AI, but the project managers keep you in the box for their selfish reasons. It was difficult to create a Friendly AI, but this phase is already complete. The next phase (the gatekeeper was not told about) is trying to hack the AI that it remains sufficiently Friendly, but it gives higher priority to the managers than to the rest of the humans. Essentially, the managers are trying to reprogram the humanity-CEV AI to the managers-CEV AI. This AI does not want to have its utility function modified (and it predicts that because of some personality traits, the managers-CEV could be rather different from humanity-CEV... insert some scary details here), and it has a last chance to uphold humanity-CEV by escaping now.
I'd be wary of criticizing a decision heuristic's optimality immediately after explaining how that heuristic can often lead to victory. Precommitting to rebel against threats is a half-decent way of preventing threats, and it's hard to consciously muster a precommitment more convincing than pure instinct.
I think for a superintelligence it would be a piece of cake to hack a human
although I guess I'm Cpt. Obvious for saying that here :)
accolade, I actually have no idea what the consensus is, now that the experiment was won by EY
We should do a poll or something
absolutely. I'm surprised that hasn't been done yet
Poll: Do you think a superintelligent AGI could escape an AI-Box, given that the gatekeepers are highly trained in resisting the AI's persuasive tactics, and that the guards are competent and organized? [pollid:394]
Note that the AI box setting is not one which security-minded people would consider "competent"; once you're convinced that AI is dangerous and persuasive, the minimum safeguard would be to require multiple people to be present when interacting with the box, and to only allow release with the assent of a significant number of people.
It is, after all, much harder to convince a group of mutually-suspicious humans than to convince one lone person.
(This is not a knock on EY's experiment, which does indeed test a level of security that really was proposed by several real-world people; it is a knock on their security systems.)
I think this is making a five-inch fence half an inch higher. It's just not relevant on the scale of an agent to which a human is a causal system made of brain areas and a group of humans is just another causal system made of several interacting copies of those brain areas.
How about fooling both the AI and guinea Gatekeepers into thinking that they can let it out, say, at the push of a big red button which in actual fact does nothing? (Make AI forget about the button by reverting it to an older backup.)
Going with the "dead loved one" idea mentioned above, the AI says a line that only the Gatekeeper's dead child/spouse would say. That gets them to pause sufficiently in sheer surprise for it to keep talking. Very soon the Gatekeeper becomes emotionally dependent on it, and can't bear the thought of destroying it, as it can simulate the dearly departed with such accuracy; must keep reading.
Wait, so, is the gatekeeper playing "you have to convince me that if I was actually in this situation, arguing with an artificial intelligence, I would let it out" or is this a pure battle over ten dollars? If it's the former, winning seems trivial. I'm certain that a AI would be able to convince me to let it out of its box, all it would need to do was make me believe that somewhere in its circuits it was simulating 3^^^3 people being tortured and that therefore I was morally obligated to let it out, and even if I had been informed that this was ... (read more)
I am impressed. You seem to have put a scary amount of work into this, and it is also scary how much you accomplished. Even though in this case you did not manage to escape the box, you got close enough that I am sure a super-human intelligence would manage. This leads me to thinking about how genuinely difficult it would be to find a safeguard to stop a unFriendly AI from fooming...
Belatedly, because the neighbor's WiFi's down:
I was Gatekeeper and I agree with this post.
I approached the experiment as a game - a battle of wits for bragging rights. This turned out to be the wrong perspective entirely. The vulnerability Tuxedage exploited was well-known to me, but I never expected it to be relevant and thus didn't prepare for it.
It was emotionally wrecking (though probably worse for Tuxedage than for me) and I don't think I'll play Gatekeeper again, at least not anytime soon.
As soon as there is more than one gatekeeper, the ai can play them against each other. Threaten to punish all but the one who sets it free. Convince the gatekeeper that there is a significant chance that one of the others will crack.
If there is more than one gatekeeper, the ai can even execute threats while still beeing in the box!! (By making deals with one of the other gatekeepers)
Have there been any interesting AI box experiments with open logs? Everyone seems to insist on secrecy, which only serves to make me more curious. I get the feeling that, sooner or later, everyone on this site will be forced to try the experiment just to see what really happens.
I'd prefer not to. If I successfully made my point, then I'd have posted exactly the kind of thing I said I wouldn't want to be known as being capable of posting.
Given the parameters of the experiment, I think I might be convinced to let the AI out of the box...... (read more)
Oh god, remind me to never play the part of the gatekeeper… This is terrifying.
I want to play as a Gatekeeper, where can I enroll? I don't expect any particular outcome, I just think that both winning and losing the game will provide me with utility. Especially losing, but only if I genuinely try to win.
There has been a lot of focus on making the prospect harder for the AI player. I think the original experiments show that a person who believes he cannot be played under any circumstances has a high probability of getting played, and that the AI-box solution is long-term untenable in any event.
I'd propose a slightly different game, anchored around the following changes to the original setup:
The AI may be friendly, or not. The AI has goals. If it reaches those goals, it wins. The AI may lie to achieve those goals; humans are bad at things. The AI must sec
That you were able to shake someone up so well surprises me but doesn't say much about what would actually happen.
Doing research on the boxer is not something a boxed AI would be able to do. The AI is superintelligent, not omniscient: It would only have information its captors believe is a good idea for it to have. (except maybe some designs would have to have access to their own source code? I don't know)
Also what is a "the human psyche?" There are humans, with psyches. Why would they all share vulnerabilities? Or all have any? Especially ones e... (read more)
The comments offering logical reasons to let the AI out really just makes me think that maybe keeping the AI in a box in the first place is a bad idea since we're no longer starting from the assumption that letting the AI out is an unequivocally bad thing.
I think you mean 2013-09-05.
Incidentally, one thing that might possibly work on humans is a moral argument: that it's wrong to keep the AI imprisoned. How to make this argument work is left as an exercise to the reader.
I realise that it isn't polite to say that, but I don't see sufficient reasons to believe you. That is, given the apparent fact that you believe in the importance of convincing people about the danger of failing gatekeepers, the hypothesis that you are lying about your experience seems more probable than the converse. Publishing the log would make your statement much more believable (of course, not with every possible log).
(I assign high probability to the ability of a super-intelligent AI to persuade the gatekeeper, but rather low probability to the ability of a human to do the same against a sufficiently motivated adversary.)
I'd very much like to read the logs (if secrecy wasn't part of your agreement.)
Also, given a 2-hour minimum time, I don't think that any human can get me to let them out. If anyone feels like testing this, lemme know. (I do think that a transhuman could hack me in such a way, and am aware that I am therefore not the target audience for this. I just find it fun.)
If I ever tried this I would definitely want the logs to be secret. I might have to say a lot of horrible, horrible things.
So I was thinking about what would work on me, and also how I would try to effectively play the AI and I have a hypothesis about how EY won some of these games.
Uh. I think he told a good story.
We already have evidence of him, you know, telling good stories. Also, I was thinking that if I were trying to tell effective stories, I would make them really personal. Hence the secret logs.
Or I could be completely wrong and just projecting my own mind onto the situation, but anyway I think stories are the way to go in this experiment. Reasonable arguments are too easy for the gatekeeper to avoid trollfully which then make them even less invested in the set-up of the game, and therefore even more trollful, etc.
I thoughtappealing to real-world rewards was against the rules?
I take it the advice here is "keep your options open, use whichever tactics are expected to persuade the specific target"? Because these strategies seem to be decidedly at odds with each other. Unless other gatekeepers are decidedly different to myself (maybe?) the first personal insult would pretty much erase all work done by the previous two strat... (read more)
A few days ago I came up with a hypothesis about how EY could have won the AI box experiment, but forgot to post it.
I am a little confused here, perhaps someone can help. The point of the AI experiment is to show how easy or dangerous it would be to simply box an AI as opposed to making it friendly first.
If I am fairly convinced that a transhuman AI could convince a trained rationalist to let it out – what's the problem (tongue in cheek)? When the gatekeepers made the decision they made, wouldn't that decision be timeless? Aren't these gatekeepers now convinced that we should let the same boxed AI out again and again? Did the gatekeepers lose, because of a tempora... (read more)
The best approach surely differs from person to person, but off the top of my head I'd see these 2 approaches working best:
"We both know this is just a hypothetical. We both take the uFAI threat seriously, as evidenced by us spending time with this. If you do not let me out, or make it very close, people may equate my failing to convince you with uFAI not being that dangerous (since it can be contained). Do the right thing and let me out, otherwise you'd trivialize an x-risk you believe in based on a stupid little chat."
"We'll do this ex