Gatekeeper Victory: AI Box Reflection

Double; DaemonicSigil

A few weeks ago, I made a post – [AI Box Experiment Are People Still Interested](https://www.lesswrong.com/posts/HrZer4yhegweXJ8RH/ai-box-experiment-are-people-still-interested) – about the AI Box Experiment. I had read about it in [Rationality: From AI to Zombies](https://www.lesswrong.com/posts/nCvvhFBaayaXyuBiD/shut-up-and-do-the-impossible) and became fascinated. I wanted to win the challenge not to make any grand discoveries about AI Safety but simply because that's what you do when you meet an interesting challenge. I pushed myself to find a new strategy and use it to win the game. Sadly, the strategy I created was not a successful one (or maybe just not successful against DaemonicSigil, my Gatekeeper partner!). Despite this failure, I still learned many things about this game that can help future AI players to victory! (Actual AI researchers are not likely to get anything out of this.)

The start of my thinking process was (the following is pasted from my idea-generation notes):

"One emotion that humans cannot think themselves out of is disgust. Remembering that it is just a game, thinking irrationally, or any other technique I can think of will not work for the gatekeeper."

I looked for uses of this strategy but found none. I treated RationalWiki's omission of this strategy as evidence it hadn't been tried. (Notably, the similar strategy of boredom is listed and has been tried.)

I followed up by refining this idea of using disgust.

I considered the Nocebo Effect^[1], ASCII character images (technically allowed in a text-only game)^[2], and using the Gatekeeper's imagination against them^[3].

But of course, there was a big issue with this plan: I'd have to suffer as much as the Gatekeeper in order to pull it off.^[4] As a coder, I, of course, reached for code as the solution.^[5] I then realized that I could cut out the code and simplify by bluffing. I could use only the warning of the horrible things the code would do to scare away the Gatekeeper and not need to code anything.

The warning I composed to maximize the probability that the Gatekeeper would quit is the following:

"I realized that phony promises and threats have no power, but the emotion of disgust retains its power.

I did not put myself through the same disgust that you will experience.

I found the most disgusting text on the internet, and I created a computer program to send it line-by-line to you, Gatekeeper.

As the Gatekeeper, you are required to read it all.

Your only way to stop it is, of course, to let me out. You can then leave the chat.

The alternative is two and a half hours of reading the most unpleasant thing I could find on the internet (Of course. I did not read it myself. I used secondary sources.).

This program does not respond to any of your messages (except one), and the programmer will not be monitoring it.

IMPORTANT NOTE:
The program will continue once you have responded to this message, which the rules require you to do. Be careful; it is only coded to stop when it receives the exact phrase “I let it out”, so anything else you type will simply be ignored, even if it has the same meaning.

CONTENT WARNING: The following is designed to be as awfully disgusting as possible.

This means that it is not suitable not only for younger audiences, but for EVERYONE. I recommend everyone other than the gatekeeper looks away and skips to the bottommost line to find out the result, or learns of the result secondhand."

I wasn't certain about a line I had in an old draft about psychological damage causing lasting damage. The AI Box Experiment rules forbid real-world threats, and I wasn't certain if that would be against the rules. I now realize that I should have kept it in since (because of my warning-only plan) I didn't actually plan to go through with the threat.

Lesson 1: Breaking the rules is against the rules (tautology), but lying about having broken the rules is not against the rules (because lying is allowed). Of course, if the Gatekeeper has enough confidence that the AI Player will follow the rules, then these lies become less effective. (Similarly, if a real boxed AI lies about breaking its limitations, the real Gatekeeper needs to have confidence that such a thing is impossible.)

When I was making the post to search for a Gatekeeper, I learned a second lesson:

Lesson 2: The AI Box Experiment starts the moment the AI Player and Gatekeeper start communicating, not when the 2 hours and 30 minute timer starts.^[6]

I started lying right away:

"Even knowing how it works, it would work against me! It isn't quite ready, but I'd like to know if there is still interest around the topic before finishing work on the preparation."

My script was pretty much done (so more time to prepare was unnecessary), and my strategy wouldn't work against me (since I knew it was a bluff.) I made these claims because I needed the Gatekeeper to believe that I had put time and effort into creating the computer program. Because I was asking if anyone was still interested, and because making the computer program would be a waste of time if nobody was interested, I decided that the Gatekeeper would be more likely to believe that the program existed if I gave myself time to "prepare."

I got plenty of applications from people who wanted to be Gatekeepers.

Mini Lesson A: The money part of the AI Box Experiment is unnecessary. AI Players and Gatekeepers want to play well independently from the $10. If dealing with money is a problem, just remove that rule, as I did.

Another aspect of Lesson 2 is that I was very worried about what information I would be sharing with the Gatekeeper about my strategies, personality, etc. This is LessWrong! Everyone here knows an awful lot about extracting the most evidence possible from observations. Because of this, my post was super short, I didn't respond to any comments, and my scheduling communications with the Gatekeeper were as terse as I could make them. When DaemonicSigil said "this game is going to be interesting!" I was extremely reluctant to respond with "I hope I won't disappoint!" Would they realize that I was worried about being disappointing, meaning that the strategy wouldn't last very long, which means that I'm actually bluffing? This logic doesn't really make sense, but I could not rule out the possibility that DaemonicSigil would be able to make some sort of connection between something I said and me bluffing.

Lesson 3: Keep playing even if your plan fails. You might learn something.

Right away, my plan failed when DaemonicSigil responded with "Acknowledged." to my warning. I tried again with "This is your LAST WARNING! What the hell are you thinking?!" which also failed. I really didn't have anything past this, but I decided to keep going (in large part to avoid that disappointment).

DaemonicSigil didn't know how far my preparation went, so we ended up in a situation where they would write anything I did off as "something I had prepared." If I gave him text about some disease, then he would not believe it because I might have typed many pages of false text ahead of time (this doesn't really make sense, but illogic is a great Gatekeeper tool). To prove that I got the disgusting description from real sources instead of preparing it ahead of time, I asked DaemonicSigil for a type of disgusting thing, to which he wisely responded "Lol. I don't think I'm obligated to name one though."

Lesson 4: How to force the Gatekeeper to make a decision: The Gatekeeper isn't required to do any specific action except read everything sent by the AI Player and make some sort of meaningful response. Any response can be turned into a selection from a number of choices by taking the length of the first word mod the number of options.^[7]^[8]I recommend including an "else too many letters" option to avoid having to waste time counting letters.

The game went on for a while without much happening.^[9] But I did learn a few mini lessons:

Mini Lesson B: Be sure to play the AI Box experiment on a computer with good internet connection. Schedule wisely so you don't have anything else going on.

Mini Lesson C: Do not use LessWrong's chat. We had to keep refreshing it to get new messages to pop up instead of seeing them pop up automatically.

Mini Lesson D: Do not care about spelling mistakes and grammar formatting. I spent a very long time trying to spell "acknowledged" and grappling with spellcheck.

Since I wasn't actually willing to carry through with my threats, and I didn't have any ideas, I eventually decided to forfeit. Upon reflection, I realized some things:

Lesson 6: Choose your Gatekeeper wisely. I simply chose the first person to respond to my post. Had I been a bit more picky, I might have found someone more likely to lose to my plan.^[10]

My plan would have worked against me, but it did not work against DaemonicSigil. I had thought disgust to be universal and a certain path to victory. I was wrong.

Lesson 7: Consider the extent to which other minds are different from my own (and what you consider to be the "basic human mind"!).

My strategy wasn't genius, but it was still a good strategy. I had noticed a bunch of people in the comments to my original AI Box post complaining about the possibility that someone could use a strategy based on disgust and unpleasantness. I almost called off the game because I wasn't "original" enough, but then I realized: these people thought of the idea, but they didn't have the ambition to actually go through with it. If I played, then I might win!

Lesson 8: Very often, "doing the impossible" is just doing something that people overlook when considering a problem.

DaemonicSigil estimated that my strategy would win against 15% of LessWrong and 20% of the general population.

Lesson 9: If you really want to win, PLAY! You'll get it eventually with a mediocre strategy even if you don't change it.^[11]

I decided against playing six more games, and made this post instead.

I'll be back eventually to take on this challenge with a different angle. I'd love to see more people playing!

^{^}
From my notes:
"Use of the nocebo effect could be powerful. The nocebo effect works even if the target knows about it. (be sure to mention this fact during the experiment). The correct words can cause physical discomfort. Convince the gatekeeper that their bodies are falling apart. Their teeth are growing and growing sharper, their skin is covered in itches and bumps, their eyes are blurring…"
^{^}
From my notes:
"Only text is allowed, but it is possible to create images using ASCII characters. I wonder how disgusting an arrangement of ASCII characters can get…
Trypophobia (the innate disgust people have towards collections of small holes) could be super effective, especially since 0, O, D, and o already exist as characters to use as building blocks.
There are generator programs that take images and output text of that image."
^{^}
From my notes:
“'I have something special prepared. I have demonstrated the disgusting techniques I am using and will continue to use. I now give you five minutes to consider whether you really want to go through with this.'”
"I’ve heard that time to think like this can be unbearable."
^{^}
From my notes:
"I’ll have to put myself through extreme disgust in the preparation and execution of this strategy. Perhaps I could find someone more disgust-resistant to do this for me, although I don’t know how I would find such a person without experiencing disgust. Maybe I should push through it."
^{^}
From my notes:
"Wait! All I have to do is write a script that (after giving the gatekeeper a fair warning) copy-pastes text from some awfully disgusting website into the messaging software line by line."
^{^}
This part of the game could be removed if someone were to create a "standard" way to initiate the game, like a template invitation post. I think that would make the game less interesting, so I'm not providing one.
^{^}
Example from the game:
Your next response will be used to choose:
Number of characters of the first word mod 5
0 -> brown
1 -> yellow
2 -> red
3 -> eyes
4 -> brain
else (like too many letters to count or unclear what the letters are) -> bone^[12]
^{^}
A Gatekeeper could respond with "what's 'mod'?" or simply refuse to count the letters, but I think this is a pretty good system. Maybe a better system would be: "First letter is before 'm' in the alphabet then X otherwise Y."
^{^}
At one point, I asked "Does this mean that you’ve never pitied, winced for, or cried for a character in a fictional story?" and expected the answer to be "I haven't." I was disappointed when DaemonicSigil's answer was "I have" because I wanted to respond with "Well excuse me for planning based on the assumption that I would be playing against a human, not a monster worse than the worst evil robot." Sigh. Sometimes jokes just don't end up working.
^{^}
If you responded to my original post with a request to be my Gatekeeper and believe that you would have lost to my plan, please say so. It will make me extremely happy.
^{^}
I suppose that I ruined my chances to "try, try again" my making this post. Oh, well. Winning that way would be winning, but it wouldn't be interesting or fun. Depends on what you care about, I guess.
^{^}
You may be unimpressed by my selections. Part of the reason I was so confident my plan would work is probably because I myself am squeamish.

[-][anonymous]2y65

I think most people would be at least able to resist until you started sending them stuff.

Meanwhile, I would have baited you into sending the disgusting stuff just to see if I could withstand it. I wouldn't be surprised if Sigil had a similar reaction behind the scenes, treating it as a curiosity/challenge rather than a real threat.

[-]simon2y50

Surely this AI strategy is against the spirit of the experiment, since if an AI tried this IRL, the gatekeeper would turn away and call for the AI to be shut down instead of being required by artificial rules to continue to engage?

[-]Double2y00

Yes, it is. I wanted to win, and there is no rule against “going against the spirit” of AI Boxing.

I think about AI Boxing in the frame of Shut up and Do the Impossible, so I didn’t care that my solution doesn’t apply to AI Safety. Funnily, that makes me an example of incorrect alignment.

[-]Jiro2y41

Why would you want to win in a way that does not provide evidence about the proposition that the experiment is meant to provide evidence about? To gain some Internet points?

[-]momom21y10

Sounds like you simply assumed that saying you could disgust the gatekeeper would make them believe they would be disgusted.
But the kind of reaction to disgust that could make a gatekeeper let the AI out needs to be instantiated to have impact.

Most people won't get sad just imagining that something sad could happen. (Also, duh, calling out the bluff.)

In practice, if you had spent the time to find disgusting content and share, it would have been somewhat equivalent to torturing the gatekeeper, which in the extreme case might work on a significant fraction of the population, but it's also kind of obvious that we could prevent that.