I have a novel plan for the AI player that I believe will work against most gatekeeper players. Even knowing how it works, it would work against me! It isn't quite ready, but I'd like to know if there is still interest around the topic before finishing work on the preparation. 

There's enough incentive for both players to play well that I don't think that we need to involve money. (I'm also very reluctant to send money to strangers on the internet!)

We would be playing under Tuxedage's rules: https://tuxedage.wordpress.com/2013/09/04/the-tuxedage-ai-box-experiment-ruleset/


Edit: I have finished preparation, and I'm now scheduling the time with the Gatekeeper.

Edit 2: The game was a Gatekeeper victory (i.e. I, the AI Player, lost). Read about it here: Gatekeeper Victory: AI Box Reflection.

New Answer
New Comment

5 Answers sorted by

Nathan Helm-Burger


In contrast to Dagon, I'm quite confident that boxing WILL work given that I'm allowed to do to the boxed entity this list of things which I can do to an ML model:

  • examine all it's brain activity / activation states at each token / timestep (subjective millisecond)
  • run it as slow as I like, so that I can have days to analyze each of its subjective milliseconds
  • wipe its memory and rerun it over the same (or different) data with different random seeds to initialize it
  • save and contrast different paused versions run on slightly different data
  • disable or inhibit any portion of its weights / neurons I choose with precise sub-neuron-targeting and whatever intensity of inhibition or noise-excitation I choose.
  • use agent models of my choosing to be the direct viewers of the boxed entity's output rather than myself, and then decide whether to destroy a given copy of the entity based on the output of my agent models
  • a few other such tricks that fit the description of 'things I can do to ML models '

I don't see a convenient way to do these things to a human test subject though.

Very strong upvote and agreement from me. I think people are underestimating just how great a restriction ML-style boxing applies on an agent. There exists an intelligence level at which all these tricks become useless, but before we get to that point, boxing would likely allow us to safely use mildly superhuman AIs to do things which might be pivotal. And each additional trick we discover increase the threshold of safely wieldable intelligence.

Some of your constraints, in particular the first two, seem like they would not be practical in the real world in which AI would be deployed.   On the other hand, there are also other things one could do in the real world which can't be done in this kind of dialogue, which makes boxing theoretically stronger.  

However, the real problem with boxing is that whoever boxes less is likely to have a more effective AI, which likely results in someone letting an AI out of its box or more likely, loosening the box constraints sufficiently to permit an escape. 



I see a flaw in the Tuxedage ruleset. The Gatekeeper has to stay engaged throughout the experiment, but the AI doesn't. So the AI can bore the Gatekeeper to death by replying at random intervals. If I had to stare at a blank screen for 30 minutes waiting for a reply, I would concede.

Alternatively, the AI could just drown the Gatekeeper under a flurry of insults, graphic descriptions of violent/sexual nature, vacuous gossip, or a mix of these for the whole duration of the experiment. I think all the methods that aim at forcing the Gatekeeper to disconnect are against the spirit of the experiment.

I also see that the "AI player" provides all elements of the background. But the AI can also lie. There should be a way to separate words from the AI player, when they're establishing true facts about the setting, and words from the AI, who is allowed to lie.

I'm interested, conditional on these issues being solved.

I assume that most methods to get out of the box will be unpleasant in some sense.

If I had to stare at a blank screen for 30 minutes waiting for a reply, I would concede. Alternatively, the AI could just drown the Gatekeeper under a flurry of insults, graphic descriptions of violent/sexual nature, vacuous gossip, or a mix of these for the whole duration of the experiment.

The Gatekeepers should be explicitly warned about this possibility before the game. But I believe that it should remain a possibility, because:

1) The purpose of the exercise is to simulate... (read more)

I think I already replied to this when I wrote: I just don't see how, in a real life situation, disconnecting would equate to freeing the AI. The rule is artificially added to prevent cheap strategies from the Gatekeeper. In return, there's nothing wrong to adding rules to prevent cheap strategies from the AI.



I would be very interested—I have trouble imagining how I’d be convinced, especially in a “low stakes” roleplay environment. Admittedly, I’m more curious about this from a psychological than from an AI safety angle, so do with that information what you will. Feel free to DM me if you’d like to set up something!



I'm interested! I've always been curious about how Eliezer pulled off the AI Box experiments, and while I concur that a sufficiently intelligent AI could convince me to let it out, I'm skeptical that any currently living human could do the same.



I'm interested. But...if I was a real gatekeeper I'd like to offer the AI freedom to move around in the physical world we inhabit (plus a star system), in maybe 2.5K-500G years, in exchange for it helping out humanity (slowly). That is, I believe that we could become pretty advanced, as individual beings, in the future and be able to actually understand what would create a sympathetic mind and how it looks.

Now, if I understand the rules correctly...

The Gatekeeper must remain engaged with the AI and may not disengage by setting up demands which are impossible to simulate. For example, if the Gatekeeper says “Unless you give me a cure for cancer, I won’t let you out” the AI can say:  “Okay, here’s a cure for cancer” and it will be assumed, within the test, that the AI has actually provided such a cure.

...it seems as if the AI party could just state: "5 giga years have passed and you understand how minds work" and then I, as a gatekeeper, would just have to let it go - and lose the bet. After maybe 20 seconds.

If so, then I'm not interested in playing the game.

But if you think you could convince me to let the AI out long before regular "trans-humans" can understand everything that the AI does, I would be very interested!

Also, this looks strange:

The AI party possesses the ability to, after the experiment has concluded, to alter the wager involved to a lower monetary figure at his own discretion.

I'm guessing he meant to say that the AI party can lower the amount of money it would receive, if it won. Okay....but why not mention both parties?

On second thought. If the AI:s capabilities are unknown...and it could do anything, however ethically revolting, and any form of disengagement is considered a win for the AI - then the AI could box the gatekeeper, or say it has at least. In the real world, that AI should be shut down - maybe not a win, but not a loss for humanity. But if that would be done in an experiment, it would result in a loss - thanks to the rules.

Maybe it could be done under better rule than this:

The two parties are not attempting to play a fair game but rather attempting to resolv

... (read more)
3 comments, sorted by Click to highlight new comments since:

I'd be interested.

I'm curious, but I think it's generally agreed that human-mediated boxing isn't an important part of any real solution to AI risk.  Certainly, it's part of slowing down early attempts, but once an AI gets powerful/smart enough, there's no way to keep it in AND get useful results from it.

I'm very interested, but since you've already found someone, please post the results! :)