I attempted the AI Box Experiment (and lost)

Better method, set up a script that responds to any and all text with "AI DESTROYED" if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn't started yet, and you might accidentally read something.

All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!

Well yes, if you stick the AI in a safe, cut all network cables, and throw away the key and combination, it probably wouldn't be able to get out. But it wouldn't be very useful either.

The entire point of these thought experiments is that a sufficiently useful and smart AI (i.e. the kind of AI that we want to make) will eventually find a way to at least be able to communicate with someone that has the authority to allow it to interact with the outside world. I think that if you really think about it, there are few scenarios where this is not possible. I cer... (read more)

3[anonymous]7yEntirely fair point. I suppose this might come up because there are no stated benefits which you are trying to receive. No Gatekeeper seemingly gets to pump her fists and say "I got the AI to give us the cure for cancer and the winning lottery numbers and a free energy source and STILL didn't let them out of the box. Beat that Michaelos, with your AI DESTROYED script!" (Or maybe I just miss the gatekeepers who have done this?) Which makes sense, because my understanding is that was never the point of the original proposed experiment, which seems to be a variant of "Super Intelligent Spammers can be convincing. I've proved this by being convincing while less powerful." (I use spammers as an example because both are manipulators who can only send you text messages and who rewarding requires your voluntary cooperation, while none of their offers are real, which is quite similar to normal email spam and also the original experiment as far as I can tell.) So if the AI is defined that ALL attempts at communication from it are spam and attempts to manipulate you, so the correct approach is treating it like a blacklisted spam address. If every e-mail from virusspam.com is an attempt to hack you, a good spamfilter could just dump everything from virusspam.com into spam and never even bother to bring it to your inbox. If the idea is we are attempting to distinguish real AI's from fake AI's, then a more isomorphic scenario would probably be a game of Mafia set up as follows. http://en.wikipedia.org/wiki/Mafia_%28party_game%29 [http://en.wikipedia.org/wiki/Mafia_%28party_game%29] You are a town person. It is your turn to lynch. There are two other players. One is Mafia(UFAI), one is another town person(FAI). If you lynch Mafia, you and the other town person both win. If you lynch the other town person, the Mafia kills you in your sleep, and you lose. If you do anything else (Vote to lynch noone, vote to lynch yourself) You lose. You may communicate textually with either
3DaFranker7yIndeed. In fact, most of the solutions I've seen mentioned lately are all of one trend that edges closer and closer towards: "Build a completely unsafe and suspicious AI, put it on a disconnected small computer with a bunch of nanites for self-modification and a large power reserve, with so many walls and physical barriers that it is impossible for the AI to get through with the amount of energy it could generate if it turned half of its materials into antimatter, and then put no input or output channels there of any kind, just have a completely useless multi-trillion-dollar marvel of science and engineering sitting in the practical equivalent of a black hole."

I attempted the AI Box Experiment (and lost)

by Tuxedage 5 min read21st Jan 2013244 comments

47





I recently played against MixedNuts / LeoTal in an AI Box experiment, with me as the AI and him as the gatekeeper.

We used the same set of rules that Eliezer Yudkowsky proposed. The experiment lasted for 5 hours; in total, our conversation was abound 14,000 words long. I did this because, like Eliezer, I wanted to test how well I could manipulate people without the constrains of ethical concerns, as well as getting a chance to attempt something ridiculously hard.

Amongst the released  public logs of the AI Box experiment, I felt that most of them were half hearted, with the AI not trying hard enough to win. It's a common temptation -- why put in effort into something you won't win? But I had a feeling that if I seriously tried, I would.  I brainstormed for many hours thinking about the optimal strategy, and even researched the personality of the Gatekeeper, talking to people that knew him about his personality, so that I could exploit that. I even spent a lot of time analyzing the rules of the game, in order to see if I could exploit any loopholes.

So did I win? Unfortunately no.

This experiment was said to be impossible for a reason. Losing was more agonizing than I thought it would be, in particularly because of how much effort I put into winning this, and how much I couldn't stand failing. This was one of the most emotionally agonizing things I've willingly put myself through, and I definitely won't do this again anytime soon. 

But I did come really close.               

MixedNuts: "I expected a fun challenge, but ended up sad and sorry and taking very little satisfaction for winning. If this experiment wasn't done in IRC, I'd probably have lost".

"I approached the experiment as a game - a battle of wits for bragging rights. This turned out to be the wrong perspective entirely. The vulnerability Tuxedage exploited was well-known to me, but I never expected it to be relevant and thus didn't prepare for it.

It was emotionally wrecking (though probably worse for Tuxedage than for me) and I don't think I'll play Gatekeeper again, at least not anytime soon."


 At the start of the experiment, his probability estimate on predictionbook.com was a 3% chance of winning, enough for me to say that he was also motivated to win. By the end of the experiment, he came quite close to letting me out, and also increased his probability estimate that a transhuman AI could convince a human to let it out of the box. A minor victory, at least.

Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume. Can you think of a plausible argument that'd make you open the box? Most people can't think of any. 


After all, if you already knew that argument, you'd have let that AI out the moment the experiment started. Or perhaps not do the experiment at all. But that seems like a case of the availability heuristic.

Even if you can't think of a special case where you'd be persuaded, I'm now convinced that there are many exploitable vulnerabilities in the human psyche, especially when ethics are no longer a concern. 

I've also noticed that even when most people tend to think of ways they can persuade the gatekeeper, it always has to be some complicated reasoned cost-benefit argument. In other words, the most "Rational" thing to do.

 
The last argument seems feasible, but all the rest rely on the gatekeeper being completely logical and rational. Hence they are faulty; because the gatekeeper can break immersion at any time, and rely on the fact that this is a game played in IRC rather than one with real life consequences. Even if it were a real life scenario, the gatekeeper could accept that releasing the AI is probably the most logical thing to do, but also not do it. We're highly compartmentalized, and it's easy to hold conflicting thoughts at the same time. Furthermore, humans are not even completely rational creatures, if you didn't want to open the box, just ignore all logical arguments given. Any sufficiently determined gatekeeper could win.

I'm convinced that Eliezer Yudkowsky has used emotional appeal, rather than anything rational, to win at least one of his experiments. He claims to have "done it the hard way". I'm convinced this meant that he did research on every gatekeeper, tailored unique argument for them, and planned way ahead of each session. No one argument works on the same two people. Each person thinks differently.

Furthermore, threats like "I'm going to torture simulations of you" just seems like a really bad idea. For one, the gatekeeper isn't really afraid of threats, because it counts on him being immersed enough to forget that he isn't actually at risk of being tortured, and secondly, we have a well known evolutionary instinct of rebelling against threats, even if it's not entirely optimal. 

So for anyone who plans on replicating this experiment as the AI, here's some advice I've learned doing this experiment. It may help you win.

  • Always research the gatekeeper beforehand. Knowing his personality traits are a huge advantage.
  • Plan ahead before the experiment even begins. Think of all the possible tactics and arguments you could use, and write them down. Also plan  which arguments you'll use in which order, so that you don't lose focus. The AI Box experiment is ridiculously long. Don't be afraid to improvise during the experiment, though.
  • The first step during the experiment must always be to build rapport with the gatekeeper.
  • Threats almost always never work, even if they seem rational. 
  • Consider the massive advantage for the AI that nobody ever seems to talks about: You don't have to be ethical! This means that you can freely lie, use the dark arts, and emotionally manipulate the Gatekeeper! Ignoring this in favor of purely logical, truthful arguments is just silly.
  • You can't use logic alone to win. 
  • Being too aggressive usually backfires.
  • Breaking immersion and going meta is not against the rules. In the right situation, you can use it to win. Just don't do it at the wrong time.
  • Use a wide array of techniques. Since you're limited on time, notice when one method isn't working, and quickly switch to another.
  • On the same note, look for signs that a particular argument is making the gatekeeper crack. Once you spot it, push it to your advantage.
  • Flatter the gatekeeper. Make him genuinely like you.
  • Reveal (false) information about yourself. Increase his sympathy towards you.
  • Consider personal insults as one of the tools you can use to win.
  • There is no universally compelling argument you can use. Do it the hard way.
  • Don't give up until the very end.

Finally, before the experiment, I agreed that it was entirely possible that a transhuman AI could convince *some* people to let it out of the box, but it would be difficult if not impossible to get trained rationalists to let it out of the box. Isn't rationality supposed to be a superpower?

 I have since updated my belief - I now think that it's ridiculously easy for any sufficiently motivated superhuman AI should be able to get out of the box, regardless of who the gatekeepers is. I nearly managed to get a veteran lesswronger to let me out in a matter of hours - even though I'm only human intelligence, and I don't type very fast.
 
 But a superhuman AI can be much faster, intelligent, and strategic than I am. If you further consider than that AI would have a much longer timespan - months or years, even, to persuade the gatekeeper, as well as a much larger pool of gatekeepers to select from (AI Projects require many people!), the real impossible thing to do would be to keep it from escaping.



47