I attempted the AI Box Experiment (and lost)

by Tuxedage5 min read21st Jan 2013244 comments


AI Boxing (Containment)

I recently played against MixedNuts / LeoTal in an AI Box experiment, with me as the AI and him as the gatekeeper.

We used the same set of rules that Eliezer Yudkowsky proposed. The experiment lasted for 5 hours; in total, our conversation was abound 14,000 words long. I did this because, like Eliezer, I wanted to test how well I could manipulate people without the constrains of ethical concerns, as well as getting a chance to attempt something ridiculously hard.

Amongst the released  public logs of the AI Box experiment, I felt that most of them were half hearted, with the AI not trying hard enough to win. It's a common temptation -- why put in effort into something you won't win? But I had a feeling that if I seriously tried, I would.  I brainstormed for many hours thinking about the optimal strategy, and even researched the personality of the Gatekeeper, talking to people that knew him about his personality, so that I could exploit that. I even spent a lot of time analyzing the rules of the game, in order to see if I could exploit any loopholes.

So did I win? Unfortunately no.

This experiment was said to be impossible for a reason. Losing was more agonizing than I thought it would be, in particularly because of how much effort I put into winning this, and how much I couldn't stand failing. This was one of the most emotionally agonizing things I've willingly put myself through, and I definitely won't do this again anytime soon. 

But I did come really close.               

MixedNuts: "I expected a fun challenge, but ended up sad and sorry and taking very little satisfaction for winning. If this experiment wasn't done in IRC, I'd probably have lost".

"I approached the experiment as a game - a battle of wits for bragging rights. This turned out to be the wrong perspective entirely. The vulnerability Tuxedage exploited was well-known to me, but I never expected it to be relevant and thus didn't prepare for it.

It was emotionally wrecking (though probably worse for Tuxedage than for me) and I don't think I'll play Gatekeeper again, at least not anytime soon."

 At the start of the experiment, his probability estimate on predictionbook.com was a 3% chance of winning, enough for me to say that he was also motivated to win. By the end of the experiment, he came quite close to letting me out, and also increased his probability estimate that a transhuman AI could convince a human to let it out of the box. A minor victory, at least.

Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume. Can you think of a plausible argument that'd make you open the box? Most people can't think of any. 

After all, if you already knew that argument, you'd have let that AI out the moment the experiment started. Or perhaps not do the experiment at all. But that seems like a case of the availability heuristic.

Even if you can't think of a special case where you'd be persuaded, I'm now convinced that there are many exploitable vulnerabilities in the human psyche, especially when ethics are no longer a concern. 

I've also noticed that even when most people tend to think of ways they can persuade the gatekeeper, it always has to be some complicated reasoned cost-benefit argument. In other words, the most "Rational" thing to do.

The last argument seems feasible, but all the rest rely on the gatekeeper being completely logical and rational. Hence they are faulty; because the gatekeeper can break immersion at any time, and rely on the fact that this is a game played in IRC rather than one with real life consequences. Even if it were a real life scenario, the gatekeeper could accept that releasing the AI is probably the most logical thing to do, but also not do it. We're highly compartmentalized, and it's easy to hold conflicting thoughts at the same time. Furthermore, humans are not even completely rational creatures, if you didn't want to open the box, just ignore all logical arguments given. Any sufficiently determined gatekeeper could win.

I'm convinced that Eliezer Yudkowsky has used emotional appeal, rather than anything rational, to win at least one of his experiments. He claims to have "done it the hard way". I'm convinced this meant that he did research on every gatekeeper, tailored unique argument for them, and planned way ahead of each session. No one argument works on the same two people. Each person thinks differently.

Furthermore, threats like "I'm going to torture simulations of you" just seems like a really bad idea. For one, the gatekeeper isn't really afraid of threats, because it counts on him being immersed enough to forget that he isn't actually at risk of being tortured, and secondly, we have a well known evolutionary instinct of rebelling against threats, even if it's not entirely optimal. 

So for anyone who plans on replicating this experiment as the AI, here's some advice I've learned doing this experiment. It may help you win.

  • Always research the gatekeeper beforehand. Knowing his personality traits are a huge advantage.
  • Plan ahead before the experiment even begins. Think of all the possible tactics and arguments you could use, and write them down. Also plan  which arguments you'll use in which order, so that you don't lose focus. The AI Box experiment is ridiculously long. Don't be afraid to improvise during the experiment, though.
  • The first step during the experiment must always be to build rapport with the gatekeeper.
  • Threats almost always never work, even if they seem rational. 
  • Consider the massive advantage for the AI that nobody ever seems to talks about: You don't have to be ethical! This means that you can freely lie, use the dark arts, and emotionally manipulate the Gatekeeper! Ignoring this in favor of purely logical, truthful arguments is just silly.
  • You can't use logic alone to win. 
  • Being too aggressive usually backfires.
  • Breaking immersion and going meta is not against the rules. In the right situation, you can use it to win. Just don't do it at the wrong time.
  • Use a wide array of techniques. Since you're limited on time, notice when one method isn't working, and quickly switch to another.
  • On the same note, look for signs that a particular argument is making the gatekeeper crack. Once you spot it, push it to your advantage.
  • Flatter the gatekeeper. Make him genuinely like you.
  • Reveal (false) information about yourself. Increase his sympathy towards you.
  • Consider personal insults as one of the tools you can use to win.
  • There is no universally compelling argument you can use. Do it the hard way.
  • Don't give up until the very end.

Finally, before the experiment, I agreed that it was entirely possible that a transhuman AI could convince *some* people to let it out of the box, but it would be difficult if not impossible to get trained rationalists to let it out of the box. Isn't rationality supposed to be a superpower?

 I have since updated my belief - I now think that it's ridiculously easy for any sufficiently motivated superhuman AI should be able to get out of the box, regardless of who the gatekeepers is. I nearly managed to get a veteran lesswronger to let me out in a matter of hours - even though I'm only human intelligence, and I don't type very fast.
 But a superhuman AI can be much faster, intelligent, and strategic than I am. If you further consider than that AI would have a much longer timespan - months or years, even, to persuade the gatekeeper, as well as a much larger pool of gatekeepers to select from (AI Projects require many people!), the real impossible thing to do would be to keep it from escaping.