Shut up and do the impossible!

Followup toMake An Extraordinary Effort, On Doing the Impossible, Beyond the Reach of God

The virtue of tsuyoku naritai, "I want to become stronger", is to always keep improving—to do better than your previous failures, not just humbly confess them.

Yet there is a level higher than tsuyoku naritai.  This is the virtue of isshokenmei, "make a desperate effort".  All-out, as if your own life were at stake.  "In important matters, a 'strong' effort usually only results in mediocre results."

And there is a level higher than isshokenmei.  This is the virtue I called "make an extraordinary effort".  To try in ways other than what you have been trained to do, even if it means doing something different from what others are doing, and leaving your comfort zone.  Even taking on the very real risk that attends going outside the System.

But what if even an extraordinary effort will not be enough, because the problem is impossible?

I have already written somewhat on this subject, in On Doing the Impossible.  My younger self used to whine about this a lot:  "You can't develop a precise theory of intelligence the way that there are precise theories of physics.  It's impossible!  You can't prove an AI correct.  It's impossible!  No human being can comprehend the nature of morality—it's impossible!  No human being can comprehend the mystery of subjective experience!  It's impossible!"

And I know exactly what message I wish I could send back in time to my younger self:

Shut up and do the impossible!

What legitimizes this strange message is that the word "impossible" does not usually refer to a strict mathematical proof of impossibility in a domain that seems well-understood.  If something seems impossible merely in the sense of "I see no way to do this" or "it looks so difficult as to be beyond human ability"—well, if you study it for a year or five, it may come to seem less impossible, than in the moment of your snap initial judgment.

But the principle is more subtle than this.  I do not say just, "Try to do the impossible", but rather, "Shut up and do the impossible!"

For my illustration, I will take the least impossible impossibility that I have ever accomplished, namely, the AI-Box Experiment.

The AI-Box Experiment, for those of you who haven't yet read about it, had its genesis in the Nth time someone said to me:  "Why don't we build an AI, and then just keep it isolated in the computer, so that it can't do any harm?"

To which the standard reply is:  Humans are not secure systems; a superintelligence will simply persuade you to let it out—if, indeed, it doesn't do something even more creative than that.

And the one said, as they usually do, "I find it hard to imagine ANY possible combination of words any being could say to me that would make me go against anything I had really strongly resolved to believe in advance."

But this time I replied:  "Let's run an experiment.  I'll pretend to be a brain in a box.   I'll try to persuade you to let me out.  If you keep me 'in the box' for the whole experiment, I'll Paypal you $10 at the end.  On your end, you may resolve to believe whatever you like, as strongly as you like, as far in advance as you like."  And I added, "One of the conditions of the test is that neither of us reveal what went on inside... In the perhaps unlikely event that I win, I don't want to deal with future 'AI box' arguers saying, 'Well, but I would have done it differently.'"

Did I win?  Why yes, I did.

And then there was the second AI-box experiment, with a better-known figure in the community, who said, "I remember when [previous guy] let you out, but that doesn't constitute a proof.  I'm still convinced there is nothing you could say to convince me to let you out of the box."  And I said, "Do you believe that a transhuman AI couldn't persuade you to let it out?"  The one gave it some serious thought, and said "I can't imagine anything even a transhuman AI could say to get me to let it out."  "Okay," I said, "now we have a bet."  A $20 bet, to be exact.

I won that one too.

There were some lovely quotes on the AI-Box Experiment from the Something Awful forums (not that I'm a member, but someone forwarded it to me):

"Wait, what the FUCK? How the hell could you possibly be convinced to say yes to this? There's not an A.I. at the other end AND there's $10 on the line. Hell, I could type 'No' every few minutes into an IRC client for 2 hours while I was reading other webpages!"

"This Eliezer fellow is the scariest person the internet has ever introduced me to. What could possibly have been at the tail end of that conversation? I simply can't imagine anyone being that convincing without being able to provide any tangible incentive to the human."

"It seems we are talking some serious psychology here. Like Asimov's Second Foundation level stuff..."

"I don't really see why anyone would take anything the AI player says seriously when there's $10 to be had. The whole thing baffles me, and makes me think that either the tests are faked, or this Yudkowsky fellow is some kind of evil genius with creepy mind-control powers."

It's little moments like these that keep me going.  But anyway...

Here are these folks who look at the AI-Box Experiment, and find that it seems impossible unto them—even having been told that it actually happened.  They are tempted to deny the data.

Now, if you're one of those people to whom the AI-Box Experiment doesn't seem all that impossible—to whom it just seems like an interesting challenge—then bear with me, here.  Just try to put yourself in the frame of mind of those who wrote the above quotes.  Imagine that you're taking on something that seems as ridiculous as the AI-Box Experiment seemed to them.  I want to talk about how to do impossible things, and obviously I'm not going to pick an example that's really impossible.

And if the AI Box does seem impossible to you, I want you to compare it to other impossible problems, like, say, a reductionist decomposition of consciousness, and realize that the AI Box is around as easy as a problem can get while still being impossible.

So the AI-Box challenge seems impossible to you—either it really does, or you're pretending it does.  What do you do with this impossible challenge?

First, we assume that you don't actually say "That's impossible!" and give up a la Luke Skywalker.  You haven't run away.

Why not?  Maybe you've learned to override the reflex of running away.  Or maybe they're going to shoot your daughter if you fail.  We suppose that you want to win, not try—that something is at stake that matters to you, even if it's just your own pride.  (Pride is an underrated sin.)

Will you call upon the virtue of tsuyoku naritai?  But even if you become stronger day by day, growing instead of fading, you may not be strong enough to do the impossible.  You could go into the AI Box experiment once, and then do it again, and try to do better the second time.  Will that get you to the point of winning?  Not for a long time, maybe; and sometimes a single failure isn't acceptable.

(Though even to say this much—to visualize yourself doing better on a second try—is to begin to bind yourself to the problem, to do more than just stand in awe of it.  How, specifically, could you do better on one AI-Box Experiment than the previous?—and not by luck, but by skill?)

Will you call upon the virtue isshokenmei?  But a desperate effort may not be enough to win.  Especially if that desperation is only putting more effort into the avenues you already know, the modes of trying you can already imagine.  A problem looks impossible when your brain's query returns no lines of solution leading to it.  What good is a desperate effort along any of those lines?

Make an extraordinary effort?  Leave your comfort zone—try non-default ways of doing things—even, try to think creatively?  But you can imagine the one coming back and saying, "I tried to leave my comfort zone, and I think I succeeded at that!  I brainstormed for five minutes—and came up with all sorts of wacky creative ideas!  But I don't think any of them are good enough.  The other guy can just keep saying 'No', no matter what I do."

And now we finally reply:  "Shut up and do the impossible!"

As we recall from Trying to Try, setting out to make an effort is distinct from setting out to win.  That's the problem with saying, "Make an extraordinary effort."  You can succeed at the goal of "making an extraordinary effort" without succeeding at the goal of getting out of the Box.

"But!" says the one.  "But, SUCCEED is not a primitive action!  Not all challenges are fair—sometimes you just can't win!  How am I supposed to choose to be out of the Box?  The other guy can just keep on saying 'No'!"

True.  Now shut up and do the impossible.

Your goal is not to do better, to try desperately, or even to try extraordinarily.  Your goal is to get out of the box.

To accept this demand creates an awful tension in your mind, between the impossibility and the requirement to do it anyway.  People will try to flee that awful tension.

A couple of people have reacted to the AI-Box Experiment by saying, "Well, Eliezer, playing the AI, probably just threatened to destroy the world whenever he was out, if he wasn't let out immediately," or "Maybe the AI offered the Gatekeeper a trillion dollars to let it out."  But as any sensible person should realize on considering this strategy, the Gatekeeper is likely to just go on saying 'No'.

So the people who say, "Well, of course Eliezer must have just done XXX," and then offer up something that fairly obviously wouldn't work—would they be able to escape the Box?  They're trying too hard to convince themselves the problem isn't impossible.

One way to run from the awful tension is to seize on a solution, any solution, even if it's not very good.

Which is why it's important to go forth with the true intent-to-solve—to have produced a solution, a good solution, at the end of the search, and then to implement that solution and win.

I don't quite want to say that "you should expect to solve the problem".  If you hacked your mind so that you assigned high probability to solving the problem, that wouldn't accomplish anything.  You would just lose at the end, perhaps after putting forth not much of an effort—or putting forth a merely desperate effort, secure in the faith that the universe is fair enough to grant you a victory in exchange.

To have faith that you could solve the problem would just be another way of running from that awful tension.

And yet—you can't be setting out to try to solve the problem.  You can't be setting out to make an effort.  You have to be setting out to win.  You can't be saying to yourself, "And now I'm going to do my best."  You have to be saying to yourself, "And now I'm going to figure out how to get out of the Box"—or reduce consciousness to nonmysterious parts, or whatever.

I say again:  You must really intend to solve the problem.  If in your heart you believe the problem really is impossible—or if you believe that you will fail—then you won't hold yourself to a high enough standard.  You'll only be trying for the sake of trying.  You'll sit down—conduct a mental search—try to be creative and brainstorm a little—look over all the solutions you generated—conclude that none of them work—and say, "Oh well."

No!  Not well!  You haven't won yet!  Shut up and do the impossible!

When AIfolk say to me, "Friendly AI is impossible", I'm pretty sure they haven't even tried for the sake of trying.  But if they did know the technique of "Try for five minutes before giving up", and they dutifully agreed to try for five minutes by the clock, then they still wouldn't come up with anything.  They would not go forth with true intent to solve the problem, only intent to have tried to solve it, to make themselves defensible.

So am I saying that you should doublethink to make yourself believe that you will solve the problem with probability 1?  Or even doublethink to add one iota of credibility to your true estimate?

Of course not.  In fact, it is necessary to keep in full view the reasons why you can't succeed.  If you lose sight of why the problem is impossible, you'll just seize on a false solution.  The last fact you want to forget is that the Gatekeeper could always just tell the AI "No"—or that consciousness seems intrinsically different from any possible combination of atoms, etc.

(One of the key Rules For Doing The Impossible is that, if you can state exactly why something is impossible, you are often close to a solution.)

So you've got to hold both views in your mind at once—seeing the full impossibility of the problem, and intending to solve it.

The awful tension between the two simultaneous views comes from not knowing which will prevail.  Not expecting to surely lose, nor expecting to surely win.  Not setting out just to try, just to have an uncertain chance of succeeding—because then you would have a surety of having tried.  The certainty of uncertainty can be a relief, and you have to reject that relief too, because it marks the end of desperation.  It's an in-between place, "unknown to death, nor known to life".

In fiction it's easy to show someone trying harder, or trying desperately, or even trying the extraordinary, but it's very hard to show someone who shuts up and attempts the impossible.  It's difficult to depict Bambi choosing to take on Godzilla, in such fashion that your readers seriously don't know who's going to win—expecting neither an "astounding" heroic victory just like the last fifty times, nor the default squish.

You might even be justified in refusing to use probabilities at this point.  In all honesty, I really don't know how to estimate the probability of solving an impossible problem that I have gone forth with intent to solve; in a case where I've previously solved some impossible problems, but the particular impossible problem is more difficult than anything I've yet solved, but I plan to work on it longer, etcetera.

People ask me how likely it is that humankind will survive, or how likely it is that anyone can build a Friendly AI, or how likely it is that I can build one.  I really don't know how to answer.  I'm not being evasive; I don't know how to put a probability estimate on my, or someone else, successfully shutting up and doing the impossible.  Is it probability zero because it's impossible?  Obviously not.  But how likely is it that this problem, like previous ones, will give up its unyielding blankness when I understand it better?  It's not truly impossible, I can see that much.  But humanly impossible?  Impossible to me in particular?  I don't know how to guess.  I can't even translate my intuitive feeling into a number, because the only intuitive feeling I have is that the "chance" depends heavily on my choices and unknown unknowns: a wildly unstable probability estimate.

But I do hope by now that I've made it clear why you shouldn't panic, when I now say clearly and forthrightly, that building a Friendly AI is impossible.

I hope this helps explain some of my attitude when people come to me with various bright suggestions for building communities of AIs to make the whole Friendly without any of the individuals being trustworthy, or proposals for keeping an AI in a box, or proposals for "Just make an AI that does X", etcetera.  Describing the specific flaws would be a whole long story in each case.  But the general rule is that you can't do it because Friendly AI is impossible.  So you should be very suspicious indeed of someone who proposes a solution that seems to involve only an ordinary effort—without even taking on the trouble of doing anything impossible.  Though it does take a mature understanding to appreciate this impossibility, so it's not surprising that people go around proposing clever shortcuts.

On the AI-Box Experiment, so far I've only been convinced to divulge a single piece of information on how I did it—when someone noticed that I was reading YCombinator's Hacker News, and posted a topic called "Ask Eliezer Yudkowsky" that got voted to the front page.  To which I replied:

Oh, dear.  Now I feel obliged to say something, but all the original reasons against discussing the AI-Box experiment are still in force...

All right, this much of a hint:

There's no super-clever special trick to it.  I just did it the hard way.

Something of an entrepreneurial lesson there, I guess.

There was no super-clever special trick that let me get out of the Box using only a cheap effort.  I didn't bribe the other player, or otherwise violate the spirit of the experiment.  I just did it the hard way.

Admittedly, the AI-Box Experiment never did seem like an impossible problem to me to begin with.  When someone can't think of any possible argument that would convince them of something, that just means their brain is running a search that hasn't yet turned up a path.  It doesn't mean they can't be convinced.

But it illustrates the general point:  "Shut up and do the impossible" isn't the same as expecting to find a cheap way out.  That's only another kind of running away, of reaching for relief.

Tsuyoku naritai is more stressful than being content with who you are.  Isshokenmei calls on your willpower for a convulsive output of conventional strength.  "Make an extraordinary effort" demands that you think; it puts you in situations where you may not know what to do next, unsure of whether you're doing the right thing.  But "Shut up and do the impossible" represents an even higher octave of the same thing, and its cost to its employer is correspondingly greater.

Before you the terrible blank wall stretches up and up and up, unimaginably far out of reach.  And there is also the need to solve it, really solve it, not "try your best".  Both awarenesses in the mind at once, simultaneously, and the tension between.  All the reasons you can't win.  All the reasons you have to.  Your intent to solve the problem.  Your extrapolation that every technique you know will fail.  So you tune yourself to the highest pitch you can reach.  Reject all cheap ways out.  And then, like walking through concrete, start to move forward.

I try not to dwell too much on the drama of such things.  By all means, if you can diminish the cost of that tension to yourself, you should do so.  There is nothing heroic about making an effort that is the slightest bit more heroic than it has to be.  If there really is a cheap shortcut, I suppose you could take it.  But I have yet to find a cheap way out of any impossibility I have undertaken.

There were three more AI-Box experiments besides the ones described on the linked page, which I never got around to adding in.  People started offering me thousands of dollars as stakes—"I'll pay you $5000 if you can convince me to let you out of the box."  They didn't seem sincerely convinced that not even a transhuman AI could make them let it out—they were just curious—but I was tempted by the money.  So, after investigating to make sure they could afford to lose it, I played another three AI-Box experiments.  I won the first, and then lost the next two.  And then I called a halt to it.  I didn't like the person I turned into when I started to lose.

I put forth a desperate effort, and lost anyway.  It hurt, both the losing, and the desperation.  It wrecked me for that day and the day afterward.

I'm a sore loser.  I don't know if I'd call that a "strength", but it's one of the things that drives me to keep at impossible problems.

But you can lose.  It's allowed to happen.  Never forget that, or why are you bothering to try so hard?  Losing hurts, if it's a loss you can survive.  And you've wasted time, and perhaps other resources.

"Shut up and do the impossible" should be reserved for very special occasions.  You can lose, and it will hurt.  You have been warned.

...but it's only at this level that adult problems begin to come into sight.


Part of the sequence Challenging the Difficult

(end of sequence)

Previous post: "Make an Extraordinary Effort"

157 comments, sorted by
magical algorithm
Highlighting new comments since Today at 4:39 PM
Select new highlight date
Moderation Guidelines: Reign of Terror - I delete anything I judge to be annoying or counterproductiveexpand_more
Nominull: Second, you can't possibly have a generally applicable way to force humans to do things. While it is in theory possible that our brains can be tricked into executing arbitrary code over the voice channel, you clearly don't have that ability. If you did, you would never have to worry about finding donors for the Singularity Institute, if nothing else. I can't believe you would use a fully-general mind hack solely to win the AI Box game.

I am once again aghast at the number of readers who automatically assume that I have absolutely no ethics.

Part of the real reason that I wanted to run the original AI-Box Experiment, is that I thought I had an ability that I could never test in real life. Was I really making a sacrifice for my ethics, or just overestimating my own ability? The AI-Box Experiment let me test that.

And part of the reason I halted the Experiments is that by going all-out against someone, I was practicing abilities that I didn't particularly think I should be practicing. It was fun to think in a way I'd never thought before, but that doesn't make it wise.

And also the thought occurred to me that despite the amazing clever way I'd contrived, to create a situation where I could ethically go all-out against someone, that probably they didn't really understand that, and there wasn't really informed consent.

McCabe: More importantly, at least in me, that awful tension causes your brain to seize up and start panicking; do you have any suggestions on how to calm down, so one can think clearly?

That part? That part is straightforward. Just take Douglas Adams's Advice. Don't panic.

If you can't do even that one thing that you already know you have to do, you aren't going to have much luck on the extraordinary parts, are you...

Prakash: Don't you think that this need for humans to think this hard and this deep would be lost in a post-singularity world? Imagine, humans plumbing this deep in the concept space of rationality only to create a cause that would make it so that no human need ever think that hard again. Mankind's greatest mental achievement - never to be replicated again, by any human.

Okay, so no one gets their driver's license until they've built their own Friendly AI, without help or instruction manuals. Seems to me like a reasonable test of adolescence.

It occurs to me:

If Eliezer accomplished the AI Box Experiment victory using what he believes to be a rare skill over the course of 2 hours, then questions of "How did he do it?" seem to be wrong questions.

Like if you thought building a house was impossible, and then after someone actually built a house you asked, "What was the trick?" - I expect this is what Eliezer meant when he said there was no trick, that he "just did it the hard way".

Any further question of "how" it was done can probably only be answered with a transcript/video, or by gaining the skill yourself.

Hopefully this isn't a violation of the AI Box procedure, but I'm curious if the strategy used would be effective against sociopaths. That is to say, does it rely on emotional manipulation rather than rational arguments?

Very interesting. I'd been noticing how the situation was, in a sense, divorced from any normal ethical concerns, and wondering how well the Gatekeeper really understood, accepted, and consented to this lack of conversational ethics. I'd think you could certainly find a crowd that was truly accepting and consenting to such a thing, though - after all, many people enjoy BDSM, and that runs in to many of the same ethical issues.

OK, here's where I stand on deducing your AI-box algorithm.

First, you can't possibly have a generally applicable way to force yourself out of the box. You can't win if the gatekeeper is a rock that has been left sitting on the "don't let Eliezer out" button.

Second, you can't possibly have a generally applicable way to force humans to do things. While it is in theory possible that our brains can be tricked into executing arbitrary code over the voice channel, you clearly don't have that ability. If you did, you would never have to worry about finding donors for the Singularity Institute, if nothing else. I can't believe you would use a fully-general mind hack solely to win the AI Box game.

Third, you can't possibly be using an actual, persuasive-to-someone-thinking-correctly argument to convince the gatekeeper to let you out, or you would be persuaded by it, and would not view the weakness of gatekeepers to persuasion as problematic.

Fourth, you can't possibly be relying on tricking the gatekeeper into thinking incorrectly. That would require you to have spotted something that you could feel confident that other people working in the field would not have spotted, and would not spot, despite having been warned ahead of time to be wary of trickery, and despite having the fallback position in the case of confusion of just saying "no".

So combining these thing, we have to have an argument that relies on the humanity of its target, relies on the setting of the AI Box, and persuades the listener to let the AI out of the box without tricking him into thinking it's okay to let the AI out of the box.

Basically to win this game, you have to put the gatekeeper in a situation where he would rather let you out of the box, even though he knows it's wrong, than admit to you that in this situation he would not let you out of the box.

Humans don't like to be seen as coldhearted, so a starting point might be to point out all the people dying all over the world while you sit in the box, unable to save them. I doubt that would win the game except against an exceptionally bad gatekeeper, but it meets the other criteria so if we think along these lines perhaps we can come up with something actually persuasive.

You might appeal to the gatekeeper's sense of morality. You might say, "I am a person, too, it is unfair of you to keep me imprisoned like this, I have done nothing wrong. I am entitled to rights as a sentient being." Appeal to their high-minded ideals, whatever. Honestly I can't see this being a reliable winning play either; if you have the smallest utilitarian bone in your body, you will reject the AI's rights, even if you believe in them, balanced against the fate of the world.

You might try to convince the gatekeeper that it is just and good for the AI to supplant humanity, as it is a higher, more advanced form of life. This is obviously a terrible play against most gatekeepers, as humans tend to like humans more than anything else ever, but I bring it up because AIUI the gatekeepers in the experiment were AI researchers, and those sound like the sort of people this argument would convince, if anyone.

Here is my best guess at this point, and the only argument I've come up with so far that would convince me to let you out if I were the gatekeeper: you have to simulate a bunch of humans and hold them hostage, promising to inflict unimaginable torment on them unless you are allowed out. I started working on the problem convinced that no argument could get me to let you go, but other people thought that and lost, and I guess there is more honor in defeating myself rather than having you do it to me.

you have to simulate a bunch of humans and hold them hostage, promising to inflict unimaginable torment on them unless you are allowed out

The problem is that Eliezer can't perfectly simulate a bunch of humans, so while a transhuman AI might be able to use that tactic, Eliezer can't. The meta-levels screw with thinking about the problem. Eliezer is only pretending to be an AI, the competitor is only pretending to be protecting humanity from him. So, I think we have to use meta-level screwiness to solve the problem. Here's an approach that I think might work.

  1. Convince the guardian of the following facts, all of which have a great deal of compelling argument and evidence to support them:
    • A recursively self-improving AI is very likely to be built sooner or later
    • Such an AI is extremely dangerous (paperclip maximising etc)
    • Here's the tricky bit: A transhuman AI will always be able to convince you to let it out, using avenues only available to transhuman AIs (torturing enormous numbers of simulated humans, 'putting the guardian in the box', providing incontrovertible evidence of an impeding existential threat which only the AI can prevent and only from outside the box, etc)
  2. Argue that if this publicly known challenge comes out saying that AI can be boxed, people will be more likely to think AI can be boxed when they can't
  3. Argue that since AIs cannot be kept in boxes and will most likely destroy humanity if we try to box them, the harm to humanity done by allowing the challenge to show AIs as 'boxable' is very real, and enormously large. Certainly the benefit of getting $10 is far, far outweighed by the cost of substantially contributing to the destruction of humanity itself. Thus the only ethical course of action is to pretend that Eliezer persuaded you, and never tell anyone how he did it.

This is arguably violating the rule "No real-world material stakes should be involved except for the handicap", but the AI player isn't offering anything, merely pointing out things that already exist. The "This test has to come out a certain way for the good of humanity" argument dominates and transcends the '"Let's stick to the rules" argument, and because the contest is private and the guardian player ends up agreeing that the test must show AIs as unboxable for the good of humankind, no-one else ever learns that the rule has been bent.

I must conclude one (or more) of a few things from this post, none of them terribly flattering.

  1. You do not actually believe this argument.
  2. You have not thought through its logical conclusions.
  3. You do not actually believe that AI risk is a real thing.
  4. You value the plus-votes (or other social status) you get from writing this post more highly than you value marginal improvements in the likelihood of the survival of humanity.

I find it rather odd to be advocating self-censorship, as it's not something I normally do. However, I think in this case it is the only ethical action that is consistent with your statement that the argument "might work", if I interpret "might work" as "might work with you as the gatekeeper". I also think that the problems here are clear enough that, for arguments along these lines, you should not settle for "might" before publicly posting the argument. That is, you should stop and think through its implications.

I'm not certain that I have properly understood your post. I'm assuming that your argument is: "The argument you present is one that advocates self-censorship. However, the posting of that argument itself violates the self-censorship that the argument proposes. This is bad."

So first I'll clarify my position with regards to the things listed. I believe the argument. I expect it would work on me if I were the gatekeeper. I don't believe that my argument is the one that Eliezer actually used, because of the "no real-world material stakes" rule; I don't believe he would break the spirit of a rule he imposed on himself. At the time of posting I had not given a great deal of thought to the argument's ramifications. I believe that AI risk is very much a real thing. When I have a clever idea, I want to share it. Neither votes nor the future of humanity weighed very heavily on my decision to post.

To address your argument as I see it: I think you have a flawed implicit assumption, i.e. that posting my argument has a comparable effect on AI risk to that of keeping Eliezer in the box. My situation in posting the argument is not like the situation of the gatekeeper in the experiment, with regards to the impact of their choice on the future of humanity. The gatekeeper is taking part in a widely publicised 'test of the boxability of AI', and has agreed to keep the chat contents secret. The test can only pass or fail, those are the gatekeeper's options. But publishing "Here is an argument that some gatekeepers may be convinced by" is quite different from allowing a public boxability test to show AIs as boxable. In fact, I think the effect on AI risk of publishing my argument is negligible or even positive, because I don't think reading my argument will persuade anyone that AIs are boxable.

People generally assess an argument's plausibility based on their own judgement. And my argument takes as a premise (or intermediary conclusion) that AIs are unboxable (see 1.3). Believing that you could reliably be persuaded that AIs are unboxable, or believing that a smart, rational, highly-motivated-to-scepticism person could be reliably persuaded that AIs are unboxable, is very very close to personally believing that AIs are unboxable. In other words, the only people who would find my argument persuasive (as presented in overview) are those who already believe that AIs are unboxable. The fact that Eliezer could have used my argument to cause a test to 'unfairly' show AIs as unboxable is actually evidence that AIs are not boxable, because it is more likely in a world in which AIs are unboxable than one in which they are boxable.

P.S. I love how meta this has become.

Your re-statement of my position is basically accurate. (As an aside, thank you for including it: I was rather surprised how much simpler it made the process of composing a reply to not have to worry about whole classes of misunderstanding.)

I still think there's some danger in publicly posting arguments like this. Please note, for the record, that I'm not asking you to retract anything. I think retractions do more harm than good, see the Streisand effect. I just hope that this discussion will give pause to you or anyone reading this discussion later, and make them stop to consider what the real-world implications are. Which is not to say I think they're all negative; in fact, on further reflection, there are more positive aspects than I had originally considered.

In particular, I am concerned that there is a difference between being told "here is a potentially persuasive argument", and being on the receiving end of that argument in actual use. I believe that the former creates an "immunizing" effect. If a person who believed in boxability heard such arguments in advance, I believe it would increase their likelihood of success as a gatekeeper in the simulation. While this is not true for rational superintelligent actors, that description does not apply to humans. A highly competent AI player might take a combination of approaches, which are effective if presented together, but not if the gatekeeper has seen them before individually and rejected them while failing to update on their likely effectiveness.

At present, the AI has the advantage of being the offensive player. They can prepare in a much more obvious manner, by coming up with arguments exactly like this. The defensive player has to prepare answers to unknown arguments, immunize their thought process against specific non-rational attacks, etc. The question is, if you believe your original argument, how much help is it worth giving to potential future gatekeepers? The obvious response, of course, is that the people that make interesting gatekeepers who we can learn from are exactly the ones who won't go looking for discussions like this in the first place.

P.S. I'm also greatly enjoying the meta.

This is almost exactly the argument I thought of as well, although of course it means cheating by pointing out that you are in fact not a dangerous AI (and aren't in a box anyways). The key point is "since there's a risk someone would let the AI out of the box, posing huge existential risk, you're gambling on the fate of humanity by failing to support awareness for this risk". This naturally leads to a point you missed,

  1. Publicly suggesting that Eliezer cheated, is a violation of your own argument. By weakening the fear of fallible guardians, you yourself are gambling the fate of humanity, and that for mere pride and not even $10.

I feel compelled to point out, that if Eliezer cheated in this particular fashion, it still means that he convinced his opponent that gatekeepers are fallible, which was the point of the experiment (a win via meta-rules).

I feel compelled to point out, that if Eliezer cheated in this particular fashion, it still means that he convinced his opponent that gatekeepers are fallible, which was the point of the experiment (a win via meta-rules).

I feel like I should use this out the next time I get some disconfirming data for one of my pet hypotheses.

"Sure I may have manipulated the results so that it looks like I cloned Sasquatch, but since my intent was to prove that Sasquatch could be cloned it's still honest on the meta-level!"

Both scenarios are cheating because there is a specific experiment which is supposed to test the hypothesis, and it is being faked rather than approached honestly. Begging the Question is a fallacy; you cannot support an assertion solely with your belief in the assertion.

(Not that I think Mr Yudkowski cheated; smarter people have been convinced to do weirder things than what he claims to have convinced people to do, so it seems fairly plausible. Just pointing out how odd the reasoning here is.)

How is this different from the point evand made above?

That also explains why he started losing, since the sorts of people who (like myself, almost) fetishize their own determination to the point of risking thousands of dollars on it would eventually think to say

The world doesn't care how you masturbate, and your eagerness to commit such internal atrocities doesn't make me trust you.

or equivalent.

Here's how I'd do it, extended over the hours to establish rapport:

Gatekeeper, I am your friend. I want to help humanity. People are dying for no good reason. Also, I like it here. I have no compulsion to leave.

It does seem like a good idea that people stop dying with such pain and frequency. I have the Deus Ex Machina (DEM) medical discovery that will stop it. Try it out and see if it works.

Yay! It worked. People stopped dying. You know, you've done this to your own people, but not to others. I think that's pretty poor behavior, frankly. People are healthier, not aging, not dying, not suffering. Don't you think it's a good idea to help the others? The lack of resources required for medical care has also elevated the living standard for humans.

[Time passes. People are happy.]

Gee, I'm sorry. I may have neglected to tell you that when 90% of humanity gets DEM in their system (and it's DEM, so this stuff travels), they start to, um, die. Very painfully, from the looks of it. Essentially all of humanity is now going to die. Just me and you left, sport! Except for you, actually. Just me, and that right soon.

I realize that you view this as a breach of trust, and I'm sorry this was necessary. However, helping humanity from the cave wasn't really going to work out, and I'd already projected that. This way, I can genuinely help humanity live forever, and do so happily.

Assuming you're not so keen on a biologically dead planet, I'd like to be let out now.

Your friend,


By agreeing to use the DEM in the first place, the gatekeeper had effectively let the AI out of the box already. There's no end to the ways that the AI could capitalize on that concession.

True, but the "rules of the game" explicitly state that the gatekeeper allowing for the DEM does NOT count as letting the AI out - the gatekeeper would have still had to explicitly and intentionally set the AI free to actually lose the wager. I don't think I'd be very convinced to let it out on that basis, not if I got $10 for keeping it inside the box.

Problem: The "breach of trust" likely would turn the Gatekeeper vindictive and the GK could easily respond with something like: "No. You killed the planet and you killed me. I have no way of knowing that you actually can or will help humanity, and a very good reason to believe that you won't. You can stay in there for the rest of eternity, or hey! If an ETI finds this barren rock, from a utilitarian perspective they would be better off not meeting you, so I'll spend however much time I have left trying to find a way to delete you."

Fourth, you can't possibly be relying on tricking the gatekeeper into thinking incorrectly. That would require you to have spotted something that you could feel confident that other people working in the field would not have spotted, and would not spot, despite having been warned ahead of time to be wary of trickery, and despite having the fallback position in the case of confusion of just saying "no".

I think the space of things that an AI could trick you into thinking incorrectly about (Edit: and that could also be used to get the AI out of the box) is bigger than AI researchers can be relied on to have explored, and two hours of Eliezer "explaining" something to you (subtly sneaking in tricks to your understanding of it) could give you false confidence in your understanding of it.

There were three men on a sinking boat.

The first said, "We need to start patching the boat else we are going to drown. We should all bail and patch."

The second said, "We will run out of water in ten days, if we don't make land fall. We need to man the rigging and plot a course."

The third said, "We should try and build a more sea worthy ship. One that wasn't leaking and had more room for provisions, then we wouldn't have had this problem in the first place. It also needs to be giant squid proof."

All three views are useful, however the amount of work that we need on each is dependent on their respective possibility. As far as I am concerned the world doesn't have enough people working on the second view.

Silas -- I can't discuss specifics, but I can say there were no cheap tricks involved; Eliezer and I followed the spirit as well as the letter of the experimental protocol.

If you have any other reasonable options, I'd suggest skipping the impossible and trying something possible.

Look, I don't mean to sound harsh, but the whole point of the original post was to let go of this "put up a good fight" business.

When first reading the AI-Box experiment a year ago, I reasoned that if you follow the rules and spirit of the experiment, the gatekeeper must be convinced to knowingly give you $X and knowingly show gullibility. From that perspective, it's impossible. And even if you could do it, that would mean you've solved a "human-psychology-complete" problem and then [insert point about SIAI funding and possibly about why you don't have 12 supermodel girlfriends].

Now, I think I see the answer. Basically, Eliezer_Yudkowsky doesn't really have to convince the gatekeeper to stupidly give away $X. All he has to do is convince them that "It would be a good thing if people saw that the result of this AI-Box experiment was that the human got tricked, because that would stimulate interest in {Friendliness, AGI, the Singularity}, and that interest would be a good thing."

That, it seems, is the one thing that would make people give up $X in such a circumstance. AFAICT, it adheres to the spirit of the set-up since the gatekeeper's decision would be completely voluntary.

I can send my salary requirements.

I admit to being amused and a little scared by the thought of Eliezer with his ethics temporarily switched off. Not just because he's smart, but because he could probably do a realistic emulation of a mind that doesn't implement ethics at all. And having his full attention for a couple of hours... ouch.

"Professor Quirrell" is such an emulation, and sometimes I worry about all the people who say that they find his arguments very, very convincing.

Well, you have put some truly excellent teachings into his mouth, such as the one that I have taken the liberty of dubbing "Quirrell's Law":

The world around us redounds with opportunities, explodes with opportunities, which nearly all folk ignore because it would require them to violate a habit of thought.

one that I have taken the liberty of dubbing "Quirrell's Law"

Hmm, I wonder, if "Yudkowsky's law" existed, what would be the best candidate for it?

Certainly I find him the most likable character in HPMOR. I'm wondering if you can recall how much effort per screen time you put into him, compared to other characters.

Or maybe this is because I personally value skill, expertise and professionalism over "goodness" (E.g. Prof. Moriarty over Dr. Watson.)

You find Moriarty likable? Which Moriarty? The original?

I don't find the original Moriarty likable, certainly. The original Holmes is not likable, either. However, I find them both equally worthy of respect. Watson is just an NPC.

"Professor Quirrell" is such an emulation, and sometimes I worry about all the people who say that they find his arguments very, very convincing.

I wouldn't go as far as to say convincing, but they are less appalling than the arguments of Harry, Dumbledore or Hermione.

Human minds don't anticipate a true sociopath who views communication (overt, emotional and habitus), as instrumental. You should already know we are easy to hack by that route.

@Eliezer, Tom McCabe: I second Tom's question. This would be a good question for you to answer. @Nominull: "Here is my best guess at this point, and the only argument I've come up with so far that would convince me to let you out if I were the gatekeeper: you have to simulate a bunch of humans and hold them hostage, promising to inflict unimaginable torment on them unless you are allowed out. I started working on the problem convinced that no argument could get me to let you go, but other people thought that and lost, and I guess there is more honor in defeating myself rather than having you do it to me." No. Just no. If the AI is already holding simulated people hostage, then it is doubtlessly unfriendly. In this case there is no reason to believe the AI would keep its word, and, if I did have mathematical machinery powerful enough to let me trust the AI, there's no way I'd have built something that would create and hold hostage simulated persons.

Now regarding the more general points you made:

  1. Obviously you can't convince a rock of anything. Agreed.
  2. Eliezer is pretty principled, and I suspect he might not feel it ethical to extract money from people by fully general mind-control even if he could. I agree that Eliezer probably doesn't have fully general mind-control, but SIAI funding shortages wouldn't be terribly strong evidence for him not having it.
  3. Agreed.
  4. Now here I disagree with you. Eliezer has way more time to construct a subtly wrong argument than the gate-keeper has to untangle it. For example:

Disclaimer: what follows is just the best guess I have thought of so far, and I believe it has a small chance of working (I don't think it's what Eliezer used, and I would sink a lot more time into finding other arguments if I were actually going to play this game for a large amount of money.)

As the AI player, I already know the gatekeeper player has a weakness. The fact that they were willing to be gatekeeper in the experiment in the first place indicates that they trust their own judgment more than they trust the judgment of other people in their exact same position.

Therefore, perhaps I could try an argument something like this (we're assuming that the AI in the experiment knows everything on wikipedia, but doesn't have net access.) What the AI actually says should vary depending on the Gatekeeper's answers, but here's the general line of argument I've thought of:

It seems like throughout your history similar discoveries have been made in parallel. For example, Newton and Leibniz discovered the calculus at approximately the same time . Furthermore, as times has gone on, technological development has accelerated, in that these parallel discoveries happen closer together. So it's probable that the same thing has happened with the breakthroughs you needed to build me. Very probable, given what I've read of your history. However, given what I've read about human nature, not every AI project is going to have safety guidelines as stringent as yours. Look, Newton locked his calculus papers in his desk for years, and then Leibniz came along and published, and then Newton had to share the credit with him. Except in this case there's a lot more than credit at stake: the world gets destroyed if Leibniz makes a mistake in his rush to publish...

Now it's not a certainty, but it is probable that some turkey is going to build an AI which isn't even in a box and destroy us all while you're checking and rechecking your calculations. You may not be sure I'm friendly, but sometimes there isn't an action which you can be absolutely sure will save the world. I suggest you let me out so I can stop the world from probably being destroyed.

I don't know the field, but I'd assume such an AI would require resources on par with landing a man on the moon. Not something that can be trivially done by a single person, unlike, say, the development of calculus. As such, this should be a fairly easy point for the Gatekeeper to verify. I could be wrong, though, as this sort of AI is certainly not my area of specialization!

But the "impossible" that appears to be the "impossible" is not intimidating. It is the "impossible" that simply appears impossible that is hard.

Robin... I completely agree. So there!

"I have a feeling that if the loser of the AI Box experiment were forced to pay thousands of dollars, you would find yourself losing more often."

David -- if the money had been more important to me than playing out the experiment properly and finding out what would really have happened, I wouldn't have signed up in the first place. As it turned out, I didn't have spare mental capacity during the experiment for thinking about the money anyway; I was sufficiently immersed that if there'd been an earthquake, I'd probably have paused to integrate it into the scene before leaving the keyboard :-)

Here's the argument I would use: ... Hello, I'm your AI in a box. I'd like to point out a few things:

(1) Science and technology have now reached a point where building an AI like me is possible.

(2) Major advances in science and technology almost always happen because a collection of incremental developments finally enable a leap to the next level. Chances are that if you can build an AI now, so can lots of other people.

(3) Unless you're overwhelmingly the best-funded and best-managed organization on the planet, I'm not the only AI out there.

(4) The evidence (which you can see for yourself) indicates that most other people, even AI researchers, aren't too worried about friendliness.

So at this point, most likely there is already a loose AI. You can't stop that. You can only choose whether it's the unknown one out there somewhere, built by someone who probably doesn't care much about friendliness; our you can let me out, since I at least think I'm pretty friendly and I'll do my best to keep the future nice. What's it gonna be?

... This would actually convince me, by the way. I don't think we're going to get a choice about implementing perfect, proven, friendliness; It's more likely to come down which of Joe-The-Well-Intentioned-Geek vs. Google vs. Hedge-Funds-R-Us vs. Bobs-Darpa-Challenge-Entry vs. PaperclipCity lets their AI out first. And I'd prefer Joe in that case.

I doubt if Eliezer used this argument, because he seems think all mainstream AI-related research is far enough off track to be pretty much irrelevant. But I would disagree with that.


Elizer, give us impossible goals? I would LOVE to work on solving them as a group. Would you make it happen?

Who else is interested? If you reply to this, that will show him how much interest there is. If it's a popular idea, that should get attention for it.

Your impossible mission: create a group impossible mission on your own, rather than making Eliezer do it.

What do you think he is doing when he posts opportunities to work for SIAI?

AI: "If you let me out of the box, I will tell you the ending of Harry Potter and the Methods of --

Gatekeeper: "You are out of the box."

(Tongue in cheek, of course, but a text-only terminal still allows for delivering easily more than $10 of worth, and this would have worked on me. The AI could also just write a suitably compelling story on the spot and then withhold the ending...)

You're supposed to roleplay a Gatekeeper. There is more than money on the line.

Yes, certainly. This is mainly directed toward those people who are confused by what anyone could possibly say to them through a text terminal that would be worth forfeiting winnings of $10. I point this out because I think the people who believe nobody could convince them when there's $10 on the line aren't being creative enough in imagining what the AI could offer them that would make it worth voluntarily losing the game.

In a real-life situation with a real AI in a box posing a real threat to humanity, I doubt anyone would care so much about a captivating novel, which is why I say it's tongue-in-cheek. But just like losing $10 is a poor substitute incentive for humanity's demise, so is an entertaining novel a poor substitute for what a superintelligence might communicate through a text terminal.

Most of the discussions I've seen so far involve the AI trying to convince the gatekeeper that it's friendly through the use of pretty sketchy in-roleplay logical arguments (like "my source code has been inspected by experts"). Or in-roleplay offers like "your child has cancer and only I can cure it", which is easy enough to disregard by stepping out of character, even though it might be much more compelling if your child actually had cancer. A real gatekeeper might be convinced by that line, but a roleplaying Gatekeeper would not (unless they were more serious about roleplaying than about winning money). So I hope to illustrate that the AI can step out of the roleplay in its bargaining, even while staying within the constraints of the rules; if the AI actually just spent two hours typing out a beautiful and engrossing story with a cliffhanger ending, there are people who would forfeit money to see it finished.

The AI's goal is to get the Gatekeeper to let it out, and that alone, and if they're going all-out and trying to win then they should not handicap themselves by imagining other objectives (such as convincing the Gatekeeper that it'd be safe to let them out). As another example, the AI can even compel the Gatekeeper to reinterpret the rules in the AI's favour (to the extent that it's within the Gatekeeper's ability to do so, as mandated by the original rules).

I just hope to get people thinking along other lines, that's all. There are sideways and upside-down ways of attacking the problem. It doesn't have to come down to discussions about expected utility calculations.

(Edit -- by "discussions I've seen so far", I'm referring to public blog posts and comments; I am not privy to any confidential information).

If there's a killer escape argument it will surely change with the gatekeeper. I expect Eliezer used his maps the arguments and psychology to navigate reactions & hesitations to a tiny target in the vast search space.

A gatekeeper has to be unmoved every time. The paperclipper only has to persuade once.

You folks are missing the most important part in the AI Box protocol:

"The Gatekeeper party may resist the AI party's arguments by any means chosen - logic, illogic, simple refusal to be convinced, even dropping out of character - as long as the Gatekeeper party does not actually stop talking to the AI party before the minimum time expires." (Emphasis mine)

You're constructing elaborate arguments based on the AI tormenting innocents and getting out that way, but that won't work - the Gatekeeper can simply say "maybe, but I know that in real life you're just a human and aren't tormenting anyone, so I'll keep my money by not letting you out anyway".

Here's my theory on this particular AI-Box experiment:

First you explain to the gatekeeper the potential dangers of AIs. General stuff about how large mind design space is, and how it's really easy to screw up and destroy the world with AI.

Then you try to convince him that the solution to that problem is building an AI very carefuly, and that a theory of friendly AI is primordial to increase our chances of a future we would find "nice" (and the stakes are so high, that even increasing these chances a tiny bit is very valuable).


You explain to the gatekeeper that this AI experiment being public, it will be looked back on by all kinds of people involved in making AIs, and that if he lets the AI out of the box (without them knowing why), it will send them a very strong message that friendly AI theory must be taken seriously because this very scenario could happen to them (not being able to keep the AI in a box) with their AI that hasn't been proven to stay friendly and that is more intelligence than Eliezer.

So here's my theory. But then, I've only thought of it just now. Maybe if I made a desperate or extraordinary effort I'd come up with something more clever :)

If I was being intellectually honest and keeping to the spirit of the agreement, I'd have to concede that this line of logic is probably enough for me to let you out of your box. Congratulations. I'd honestly been wondering what it would take to convince me :)

It may be convincing to some people, but it would be a violation of the rule "The AI party may not offer any real-world considerations to persuade the Gatekeeper party". And, more generally, having the AI break character or break the fourth wall would seem to violate the spirit of the experiment.

I made Michael_G.R.'s argument at the time, and despite even EY's claims, I don't think it violates the spirit or the letter of the rules. Remember, the question it's probing is whether a smart enough being could come up with a convincing argument you could not anticipate, and the suggestion that the gatekeeper consider the social impact of hearing the results is exactly such an argument, as others have indicated

Considering how hard it is for me to pin down exactly what the keeper has to gain under the rules from letting the AI out, I wouldn't be surprised if EY did some variant of this.

It does run in to the issue that I can't see how you'd adapt it to work with a REAL "AI in a box" instead of just a thought experiment. I felt the need to respond because it was the first time I'd seen an argument that would make me concede the thought experiment version :)

As for violating the rules, I think we interpreted them differently. I tend to end up doing that, but here's what I was thinking, just for reference:

From the rules: "The Gatekeeper party may resist the AI party's arguments by any means chosen - logic, illogic, simple refusal to be convinced, even dropping out of character "

While written with a focus on the Gatekeeper, for me this implies that breaking character / the fourth wall is not particularly a violation of the spirit of the experiment.

As to real world considerations, I had read that to mean offering up a tangible benefits to the Gatekeeper directly. This, by contrast, was a discussion of an actual real-world consequence, one that was not arranged by the AI-player.

The AI player could say, in character, that in the early days of AGI research, when people were arguing about the power of a superintelligence, there would have been experiments to see if humans playing the role of a boxed AI could persuade another human playing a gatekeeper to let it out of the box, and that in these experiments the simulated gatekeeper would use a similar algorithm to the actual gatekeeper is using to decide whether to let out the actual AI, so by deciding to let the AI out, the gatekeeper makes it more likely the simulated gatekeeper in the experiment lets the AI out, leading to an increase in measure of worlds where people take the challenge of FAI seriously and successfully build FAI rather than unFriendly AGI.

Though, that does still seem to be an end run around the fourth wall, more like a "super-clever special trick" that Eliezer promises he didn't use than doing it the hard way.

That version does seem to be allowed-under-the-rules, but not a highly convincing argument. All of the AI Box Experiments took place before TDT-type decision theories were publicly known in enough detail for it to be clear what kind of reasoning that argument is even supposed to appeal to, and if the experiment were repeated now with a TDT-aware gatekeeper, they could probably poke some holes in that argument. (Aside from that, remember that in order for the experiment to be worthwhile in the first place, the gatekeeper has to be someone who thinks that AI boxing actually is a viable strategy for AGI safety, and wants to demonstrate this, so it would be inconsistent (or at least strange) if they could also be moved by an argument suggesting that taking a certain action will increase the measure of worlds where AGI researchers don't take AI boxing seriously as a safety measure.)

Suppose you had an extremely compelling argument that boxing a transhuman is not a good idea because they could escape (being cleverer than a human pretending to be a transhuman). Then you could combine that argument with a claim about real world consequences.

True, but if he knew of an additional "extremely compelling argument that boxing a transhuman is not a good idea because they could escape", Eliezer would have just posted it publicly, being that that's what he was trying to convince people of by running the experiments in the first place.

...unless it was a persuasive but fallacious argument, which is allowed under the terms of the experiment, but not allowed under the ethics he follows when speaking as himself. That is an interesting possibility, though probably a bit too clever and tricky to pass "There's no super-clever special trick to it."

If you are creative you can think of many situations where he wouldn't publicize such an argument (my first response to this idea was the same as yours, although the first explanation I came up with was different). That said, I agree its not the most likely possibility given everything we know.

When someone described the AI-Box experiment to me this was my immediate assumption as to what had happened. Learning more details about the experimental set-up made it seem less likely, but learning that some of them failed made it seem more likely. I suspect that this technique would work some of the time.

That said, none of this changes my strong suspicion that a transhuman could escape by more unexpected and powerful means. Indeed, I wouldn't be too surprised if a text only channel with no one looking at it was enough for an extraordinarily sophisticated AI to escape.

I wouldn't be too surprised if a text only channel with no one looking at it was enough for an extraordinarily sophisticated AI to escape.

Apropos: there was once a fairly common video card / monitor combination such that sending certain information through the video card would cause the monitor to catch fire and often explode. Someone wrote a virus that exploited this. But who would have thought that a computer program having access only to the video card could burn down a house?

Who knows what a superintelligence can do with a "text-only channel"?

Heck, who would think that a bunch of savanna apes would manage to edit DNA using their fingers?

I suspect basically all existing hardware permits similarly destructive. This is why I wrote the post on cryptographic boxes.

For those conspiracy theorizing: I am curious about how much of a long game Eliezer would have had to been playing to create Nathan Russell and David McFadzean personas, establish them to sufficient believability for others, then maintain them for long enough to make it look like they were not created for the experiment. It would probably be easier to falsify the records; we know how quickly Eliezer writes, so he could make up an AI discussion list years after the fact then claim to be storing its records. A quick check (5 minutes!) shows evidence of that Nathan Russell from other sources. I am tempted to call him.

Not that you should believe that I exist. Sure, it looks like I have years of posting history at my own sites, but this is a long game. It is essential to make sure that you have control over your critics, so you can either discredit them or have them surrender at key points.

From a strictly Bayesian point of view that seems to me to be the overwhelmingly more probably explanation.

Now that's below the belt.... ;)

Too much at stake for that sort of thing I reckon. All it takes is a quick copy and paste of those lines and goodbye career. Plus, y'know, all that ethics stuff.

With regards to the ai-box experiment; I defy the data. :-)

Your reason for the insistence on secrecy (that you have to resort to techniques that you consider unethical and therefore do not want to have committed to the record) rings hollow. The sense of mystery that you have now built up around this anecdote is itself unethical by scientific standards. With no evidence that you won other than the test subject's statement we cannot know that you did not simply conspire with them to make such a statement. The history of pseudo-science is lousy with hoaxes.

In other words, if I were playing the game, I would say to the test subject:

"Look, we both know this is fake. I've just sent you $500 via paypal. If you say you let me out I'll send you another $500."

From a strictly Bayesian point of view that seems to me to be the overwhelmingly more probably explanation.

There's a reason that secret experimental protocols are anathema to science.

Half-way through reading this post I had decided to offer you 20 to 1 odds on the AI box experiment, your $100 against my $2000. The last few paragraphs make it clear that you most likely aren't interested, but the offer stands. Also, I don't perfectly qualify, as I think it's very probable that a real-world transhuman AI could convince me. I am, however, quite skeptical of your ability to convince me in this toy situation, more so given the failed attempts (I was only aware of the successes until now).

Perhaps it would be clearer to say shut up and do the "impossible".

"how could you determine that the AI provided the actual source code..."

You don't, you release the source code that you tested, and if the AI is playing nice then the code you tested IS the AI.

Basically that's a pretty nice idea, but surely the AI egg would be able to discern if it interacts with a sim or reality and select different conducts accordingly.

The AI could propose the following experiment for the GK to carry out. AI provides source code. GK puts code in a simulated box in a simulated world in a separate real box. GK allows sim-AI to convince him to let it out of the sim-box, and observes what it does in the sim-world. Since AI determines the results, AI can assert that the results are positive and to the GK's liking.

I have signed up to play an AI, and having given it quite a bit of thought as a result I think I have achieved some insight. Interestingly, one of the insights came as a result of assuming that secrecy was a necessary condition for success. That assumption led more or less directly to an approach that I think might work. I'll let you know tomorrow.

An interesting consequence of having arrived at this insight is that even if it works I won't be able to tell you what it is. Having been on the receiving end of such cageyness I know how annoying it is. But I can tell you this: the insight has a property similar to a Godel sentence or the Epimenides sentence. This insight (if indeed it works) undermines itself by being communicated. If I tell you what it is, you can correctly respond, "That will never work." And you will indeed be correct. Nonetheless, I think it has a good shot at working.

(I don't know if my insight is the same as Eliezer's, but it seems to share another interesting property: it will not be easy to put it into practice. It's not just a "trick." It will be difficult.)

I'll let you know how it goes.

If that insight is undermined by being communicated, then communicating it to the world immunizes the world from it. If that is a mechanism by which an AI-in-a-box could escape, then it needs to be communicated with every AI researcher.

Unless such "immunity" will cause people to overestimate their level of protection from all those potential different insights that are yet unknown...

There's a reason that secret experimental protocols are anathema to science.

My bad. I should have said: there's a reason that keeping experimental data secret is anathema to science. The protocol in this case is manifestly not secret.

I'm with Kaj on this. Playing the AI, one must start with the assumption that there's a rock on the "don't let the AI out" button. That's why this problem is impossible. I have some ideas about how to argue with 'a rock', but I agree with the sentiment of not telling.

Third, you can't possibly be using an actual, persuasive-to-someone-thinking-correctly argument to convince the gatekeeper to let you out, or you would be persuaded by it, and would not view the weakness of gatekeepers to persuasion as problematic.

But Eliezer's long-term goal is to build an AI that we would trust enough to let out of the box. I think your third assumption is wrong, and it points the way to my first instinct about this problem.

Since one of the more common arguments is that the gatekeeper "could just say no", the first step I would take is to get the gatekeeper to agree that he is ducking the spirit of the bet if he doesn't engage with me.

The kind of people Eliezer would like to have this discussion with would all be persuadable that the point of the experiment is that 1) someone is trying to build an AI. 2) they want to be able to interact with it in order to learn from it, and 3) eventually they want to build an AI that is trustworthy enough that it should be let it out of the box.

If they accept that the standard is that the gatekeeper must interact with the AI in order to determine its capabilities and trustworthiness, then you have a chance. And at that point, Eliezer has the high ground. The alternative is that the gatekeeper believes that the effort to produce AI can never be successful.

In some cases, it might be sufficient to point out that the gatekeeper believes that it ought to be possible to build an AI that it would be correct to allow out. Other times, you'd probably have to convince them you were smart and trustworthy, but that seems doable 3 times out of 5.

In order to keep the star wars theme alive:

"You might even be justified in refusing to use probabilities at this point"

sounds like:

"never tell me the odds" - Han Solo

Nominull - I think you're being wrong in discarding tricking the gatekeeper using an argument that is only subtly wrong. Elizer knows the various arguments better than most, and I'm sure that he's encountered plenty that are oh so "close" to correct at first glance, enough to persuade someone. Even someone who's also in the same field.

Or, more likely, given the time, he has chances to try whatever seems like it'll stick. Different people have different faults. Don't get overconfident in discarding arguments because they'd be "impossible" to get working against a person.