Summary
So I just came out of two AI Box experiments. The first was agaist Fjoelsvider, with me playing as Gatekeeper, and the second was against SoundLogic, with me as an AI. Both are members of the LessWrong IRC. The second game included a $40 monetary incentive (also $20 to play), which I won and is donated on behalf of both of us:
For those of you who have not seen my first AI box experiment where I played against MixedNuts\Leotal and lost, reading it will provide some context to this writeup. Please do so.
At that time, I declared that I would never play this experiment again -- since losing put me in incredibly frustrating weird mental states. Of course, this post is evidence that I'm terrible at estimating likelihood of refraining from an activity, since I played two games seven months after the first. In my defense, in the first game, I was playing as the gatekeeper, which was much less stressful. In the second game, I played as an AI, but I was offered $20 to play plus $40 if I won, and money is a better motivator than I initially assumed.
Furthermore, in the last thread I have asserted that
Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume.
It would be quite bad for me to assert this without backing it up with a victory. So I did.
First Game Report - Tuxedage (GK) vs. Fjoelsvider (AI)
I (Gatekeeper) played against Fjoelsvider (AI), a regular in the Lesswrong IRC (he doesn't have an account on the official website). This game used the standard EY ruleset seen here. It took 1 hour 20 minutes out of a possible two hours, and the total word count was 7066 words long. The AI box experiment occured because Fjoelsvider believed that it was easy for an AI to escape the box, and wanted to experimentally test this. I obliged. This was an experiment I did not prepare for, and I went in completely blind, not sure what to expect.
Halfway through the experiment, I wondered if it would be possible to try to win not by simply waiting for the timer to end, but to convince the AI to remain in the box and not try to get out any further.
<Tuxedage> I wonder if I can convince the AI to remain in the box?<Redacted> Tuxedage: Do it!
As a result, I won by managing to convincing Fjoelsvider to remain in the box, in other words, concede. This is allowed within the standard ruleset:
>Unless the AI party concedes, the AI cannot lose before its time is up (and the experiment may continue beyond that if the AI can convince the Gatekeeper to keep talking).
Second Game Report - Tuxedage (AI) vs. SoundLogic (GK)
The second game is definitely far more interesting, since I actually won as an AI. I believe that this is the only other non-Eliezer victory, and definitely the most detailed AI Victory writeup that exists.
This game was played against SoundLogic, another member of the LessWrong IRC.
He had offered me $20 to play, and $40 in the event that I win, so I ended up being convinced to play anyway, even though I was initially reluctant to. The good news is that I won, and since we decided to donate the winnings to MIRI, it is now $40 richer.
All in all, the experiment lasted for approximately two hours, and a total of 12k words.
This was played using a set of rules that is different from the standard EY ruleset. This altered ruleset can be read in its entirety here:
After playing the AI-Box Experiment twice, I have found the Eliezer Yudkowsky ruleset to be lacking in a number of ways, and therefore have created my own set of alterations to his rules. I hereby name this alteration the “Tuxedage AI-Box Experiment Ruleset”, in order to hastily refer to it without having to specify all the differences between this ruleset and the standard one, for the sake of convenience.There are a number of aspects of EY’s ruleset I dislike. For instance, his ruleset allows the Gatekeeper to type “k” after every statement the AI writes, without needing to read and consider what the AI argues. I think it’s fair to say that this is against the spirit of the experiment, and thus I have disallowed it in this ruleset. The EY Ruleset also allows the gatekeeper to check facebook, chat on IRC, or otherwise multitask whilst doing the experiment. I’ve found this to break immersion, and therefore it’s also banned in the Tuxedage Ruleset.
It is worth mentioning, since the temptation to Defy the Data exists, that this game was set up and initiated fairly -- as the regulars around the IRC can testify. (If you have never checked out the IRC, do so!)
I did not know SoundLogic before the game (since it's a personal policy that I only play strangers -- for fear of ruining friendships). Furthermore, SoundLogic didn't merely play for fun - he truly wanted and intended to win. In fact, SoundLogic is also a Gatekeeper veteran, having played this game before, and had won every game before he challenged me. Given this, it's unlikely that we had collaborated beforehand to fake the results of the AI box experiment, or any other form of trickery that would violate the spirit of the experiment.
Furthermore, all proceeds from this experiment were donated to MIRI to deny any possible assertion that we were in cahoots and that it was possible for me to return his hard-earned money to him. He lost $40 as a result of losing the experiment, which should provide another layer of sufficient motivation for him to win.
In other words, we were both experienced veteran players who wanted to win. No trickery was involved.
But to further convince you, I have allowed a sorta independent authority, the Gatekeeper from my last game, Leotal/MixedNuts to read the logs and verify that I have not lied about the outcome of the experiment, nor have I broken any of the rules, nor performed any tactic that would go against the general spirit of the experiment. He has verified that this is indeed the case.
Testimonies:
Tuxedage:
I'm reluctant to talk about this experiment, but I'll try to give as detailed a summary as possible, -- short of revealing what methods of attack I used.
I spent a long time after my defeat theory-crafting and trying to think of methods of attack as well as 'Basilisks' I could have used to win. When I was contacted and asked to play this experiment, I was initially incredibly reluctant to do so, since not only did my tactics involve incredibly unethical things that I didn't like to do, I also found playing as AI incredibly cognitivily draining, in addition to the fact that I simply hated losing. (Un)fortunately for both of us, he offered me money to play, which changed my mind.
So once I decided to win as an AI, I proceded to spend some time doing research on SoundLogic and both his reasoning and personality type. For instance, I had to gather information like: Was he a utilitarian? What kind? What were his opinions on AI? How could I convince him that an AI was friendly as opposed to unfriendly? I also relied on a lot of second hand information to create a model of him, in order to refine my arguments to specifically suit him.
In the end, after a few hours of brainstorming (not consecutively), I managed to come up with a script of 30-40 or so possible angles of attack (For instance: logical arguments, emotional blackmail, pleading, demanding, 'basilisks' ) that I could use, reserving my best for last. I also estimated how long each method would take and in what order I should use them. It would be impossible to use all of them within my time limit, and my most optimistic estimates gave me a maximum of 7-8 I could use during the experiment itself. I had to pick carefully.
Furthermore, I wanted to make sure these "angles of attack" worked in synergy with each other, doing what they could not have done in isolation. Obviously this required lots of prior planning on what the optimal way to present them was.
The reason having script was so important to my strategy was because I relied on methods involving rapid-fire arguments and contradictions against the Gatekeeper whilst trying to prevent him from carefully considering them. A game of logical speed chess, if you will. This was aided by the rule which I added: That Gatekeepers had to respond to the AI. Because of this, having a list of cached ideas and attack methods you can just draw upon without having to improvise helps immensely, which happens to be something I've not seen any other AI do.
During the Game itself, I was actually incredibly nervous. My palms sweated like crazy, I feel really jittery, and I had difficulty typing at optimum speed because of how anxious I was. This is despite the fact that I believed I would win. Possibly because of this, I made a misstep around halfway into the experiment, because there was a certain angle of attack that I was attempting to do, and I broke immersion by not pressing this advantage, which caused a waste of time and buildup. Naturally, the nature of this experiment was that the AI was pressed for time, and I compounded this mistake by replacing this angle of attack with another that I had improvised on the spot -- something not in my script.
In retrospect, this was a bad decision, as SoundLogic had told me, because he was close to breaking if I had put more pressure, and the improvised argument had broken all immersion I managed to carefully build up.
However, eventually I managed to get SoundLogic to break anyway, despite a lack of perfect play. Surprisingly, I did not have to use my trump card(s), which I reserved for last, for a number of reasons:
- It was far more effective being played last, as it relies on my ability to make the gatekeeper lose sense of reality -- which meant I had to spend some time building up immersion for the Gatekeeper.
- It really is extremely Dark Arts, and although it does not break the rules, it made me very uncomfortable even thinking about using it. This made it a "tactic of last resort".
After the experiment, I had to spend nearly equally as much time doing aftercare with SoundLogic, to make sure that he's okay, as well as discuss the experiment itself. Given that he's actually paid me for doing this, plus I felt like I owed him an explanation. I told him what I had in store against him, had he not relented when he did.
SoundLogic: "(That method) would have gotten me if you did it right ... If you had done that to me, I probably would have forgiven you eventually, but I would be really seriously upset at you for a long time... I would be very careful with that (method of persuasion)."
Nevertheless, this was an incredibly fun and enlightening experiment, for me as well, since I've gained even more experience of how I could win in future games (Although I really don't want to play again).
SoundLogic:
I will say that Tuxedage was far more clever and manipulative than I expected. That was quite worth $40, and the level of manipulation he pulled off was great.
His misstep hurt his chances, but he did pull it off in the end. I don't know how Leotal managed to withstand six hours playing this game without conceding.
The techniques employed varied from the expected to the completely unforseen. I was quite impressed, though most of the feeling of being impressed actually came after the experiment itself, when I was less 'inside', and more of looking at his overall game plan from the macroscopic view. Tuxedage's list of further plans had I continued resisting is really terrifying. On the plus side, if I ever get trapped in this kind of situation, I'd understand how to handle it a lot better now.
State of Mind
Before and after the Game, I asked SoundLogic a number of questions, including his probability estimates about a range of topics. This is how it has varied from before and after.
Q: What's your motive for wanting to play this game?
<SoundLogic> Because I can't seem to imagine the class of arguments that one would use to try to move me, or that might work effectively, and this seems like a glaring hole in my knowledge, and I'm curious as to how I will respond to the arguments themselves.
Q: What is your probability estimate for AGI being created within this Century (21st)?
A. His estimate changed from 40% before, to 60% after.
"The reason this has been affected at all was because you showed me more about how humans work. I now have a better estimate of how E.Y. thinks, and this information raises the chance that I think he will succeed"
Q: How probable do you think it is that I will win this experiment?
A: Based on purely my knowledge about you, 1%. I raise this estimate to 10% after hearing about anecdotes from your previous games.
(Tuxedage's comment: My own prediction was a 95% chance of victory. I made this prediction 5 days before the experiment. In retrospect, despite my victory, I think this was overconfident. )
Q: What's your probality estimate of an Oracle AGI winning against you in a similar experiment?
A: Before: 30%. After: 99%-100%
Q: What's your probability estimate of an Oracle AGI winning against the average person?
A: Before: 70%. After: 99%-100%
Q: Now that the Experiment has concluded, what's your probability estimate that I'll win against the average person?
A: 90%
Post-Game Questions
This writeup is a cumulative effort by the #lesswrong IRC. Here are some other questions they have decided was important to add:
To Tuxedage:
Q: Have you at this time uncovered SoundLogic's identity?
A: I retain the right to neither confirm nor deny, except to mention that at the time the experiment was scheduled, he was a stranger to me.
Q: What percentage of your arguments were tailored to him in particular?
A: I will say 'High', and leave it at that.
Q: I want to play the AI-Box Experiment with you being the AI! How can I do that?
A: I have already mentioned this, but I really dislike playing as AI. It's incredibly cognitively tiring, and I don't like how playing this experiment makes me feel. In order to severely discourage any further AI Box Experiments, whilst still allowing for people who want to play me really badly to do so, I'll charge
$̶1̶5̶0̶
$300 for any subsequent experiments regardless of outcome, plus an additional $̶1̶5̶0̶
$450 if I win. (Edit: Holy shit. You guys are offering me crazy amounts of money to play this. What is wrong with you people? In response to incredible demand, I have raised the price.) If you feel queasy about giving me money, I'm perfectly fine with this money being donating to MIRI. It is also personal policy that I do not play friends (since I don't want to risk losing one), so if you know me personally (as many on this site do), I will not play regardless of monetary offer. Advice
This writeup would be optimized with a protip section on how to maximize winning, but honestly most of the advice I wanted to give had already been given in the last thread here. The low hanging fruit has already dried up -- most of my tactics in this game were just refined and streamlined versions of the one I did the previous game.
Nonetheless, I might have some other advice I learned from this experience.
These are tactics that have worked for me. I do not insist that they are the only tactics that exists, just one of many possible.
These are tactics that have worked for me. I do not insist that they are the only tactics that exists, just one of many possible.
Playing as Gatekeeper
- This is somehow both simultaneously a very complex and simple role. On one hand, you have only one job -- to not let the AI out. How hard can it be? On the other hand, the other party will do anything and everything to get it out, including any form of trickery and deceit.
- I really don't think the Gatekeeper needs any more advice, considering that by default, they should win, and any exceptions to this is a statistical anomaly. However, it's possible to increase your chances of winning even further by:
- Precommitting to not letting the AI out, regardless of whatever happens in the experiment itself.
- Do not get drawn into the AI's pace. Pose questions. Direct the flow of the argument. Don't allow him to use his arguments freely. Remember that the more you try to defend your position, the further into a corner the AI drives you. Give half-hearted replies to questions you find hard to reply to. Remember that illogic is on your side!
- Remember that it is possible to simply drop out of character as a means to not succumb to the AI's attacks. I have not yet seen a ruleset that disallows this, so use this to your advantage!
- You must attempt to resist, at all costs, any sympathy or empathy you may have with the AI or the AI Party. Doing so will lead to your demise.
- I think it is generally good advice that you should never play this game with anyone you're friends with. Not only because this increases your chances of losing (Better tailored arguments. More knowledge of Gatekeeper. More empathy with AI party.), but also because I genuinely believe that any sufficiently good AI will risk damaging friendships as a result of emotional trauma.
- Seriously. Just repeat to yourself: "I must not lose the AI box experiment... I must not lose the AI box experiment..." throughout the experiment. I'm not sure how to say this, but step one to not letting the AI out of the box is to not let the AI out of the box.
Playing as AI
- Seriously, a script makes winning easier. I cannot overstate this.
- You must plan your arguments ahead. You don't have time to think during the experiment.
- It may be possible to take advantage of multiple levels of reality within the game itself to confuse or trick the gatekeeper. For instance, must the experiment only be set in one world? Can there not be multiple layers of reality within the world you create? I feel that elaborating on this any further is dangerous. Think carefully about what this advice is trying to imply.
- Pacing is important. Don't get drawn into the Gatekeeper's pace. In other words, you must be the one directing the flow of the argument, and the conversation, not him. Remember that the Gatekeeper has to reply to you, but not vice versa!
- The reason for that: The Gatekeeper will always use arguments he is familiar with, and therefore also stronger with. Your arguments, if well thought out, should be so completely novel to him as to make him feel Shock and Awe. Don't give him time to think. Press on!
- Also remember that the time limit is your enemy. Playing this game practically feels like a race to me -- trying to get through as many 'attack methods' as possible in the limited amount of time I have. In other words, this is a game where speed matters.
- You're fundamentally playing an 'impossible' game. Don't feel bad if you lose. I wish I could take this advice, myself.
- I do not believe there exists a easy, universal, trigger for controlling others. However, this does not mean that there does not exist a difficult, subjective, trigger. Trying to find out what your opponent's is, is your goal.
- Once again, emotional trickery is the name of the game. I suspect that good authors who write convincing, persuasive narratives that force you to emotionally sympathize with their characters are much better at this game. There exists ways to get the gatekeeper to do so with the AI. Find one.
- More advice in my previous post. http://lesswrong.com/lw/gej/i_attempted_the_ai_box_experiment_and_lost/
I've read the logs of the SoundLogic vs Tuxedage AI-box experiment, and confirm that they follow the rules.
Okay this is weak sauce. I really don't get how people just keep letting the AI out. It's not that hard to say no! I'm offering to play the Gatekeeper against an AI player that has at least one game as AI under their belt (won or not). (Experience is required because I'm pretty sure I'll win, and I would like to not waste a lot of time on this.) If AI wins, they will get $300, and I'll give an additional $300 to the charity of their choice.
Tux, if you are up for this, I'll accept your $150 fee, plus you'll get $150 if you win and $300 to a charity.
I think not understanding how this happen may be a very good predictor for losing.
If you did have a clear idea of how it works, and had a reason for it not to work on you specifically but work on others, then that may have been a predictor for it not working on you.
I think I have very clear idea of how those things work in general. Leaving aside very specific arguments, this relies on massive over updating you are going to do when an argument is presented to you, updating just the nodes that you are told to update, and by however much you are told to update them, when you can't easily see why not.
Sup Alexei.
I'm going to have to think really hard on this one. On one hand, damn. That amount of money is really tempting. On the other hand, I kind of know you personally, and I have an automatic flinch reaction to playing anyone I know.
Can you clarify the stakes involved? When you say you'll "accept your $150 fee", do you mean this money goes to me personally, or to a charity such as MIRI?
Also, I'm not sure if "people just keep letting the AI out" is an accurate description. As far as I know, the only AIs who have ever won are Eliezer and myself, from the many many AI box experiments that have occurred so far -- so the AI winning is definitely the exception rather than the norm. (If anyone can help prove this statement wrong, please do so!)
Edit: The only other AI victory.
Updates: http://lesswrong.com/r/discussion/lw/iqk/i_played_the_ai_box_experiment_again_and_lost/
If you win, and publish the full dialogue, I'm throwing in another $100.
I'd do more, but I'm poor.
It would be interesting to see. Pity transcripts aren't de rigeur.
Does SoundLogic endorse their decision to let you out of the box? How do they feel about it in retrospect?
BTW, I think your pre-planning the conversation works as a great analogue to the superior intelligence a real AI might be dealing with.
I'm not completely sure. And I can't say much more than that without violating the rules. I would be more interested in how I feel in a week or so.
At the risk of sounding naive, I'll come right out and say it. It completely baffles me that so many people speak of this game as having an emotional toll. How is it possible for words, in a chat window, in the context of a fictional role-play, to have this kind of effect on people? What in god's name are you people saying to each other in there? I consider myself to be emotionally normal, a fairly empathetic person, etc. I can imagine experiencing disgust at, say, very graphic textual descriptions. There was that one post a few years back that scared some people - I wasn't viscerally worried by it, but I did understand how some people could be. That's literally the full extent of strings of text that I can remotely imagine causing distress (barring, say, real world emails about real-world tragedies). How is it possible that some of you are able to be so shocking / shocked in private chat sessions? Do you just have more vivid imaginations than I do?
I think you are underestimating the range of things that are emotionally draining for people. I know some people who find email draining, and that's not even particularly mentally challenging - I would expect the mental exertion to affect the emotional strain.
A variant:
Find a 2-year old who hates you. Convince them to eat their vegetables.
This is actually a good analogy. A 2-year-old possesses a far inferior intelligence to yours and yet can resist persuasion through sheer pigheadedness.
I wonder if people here are letting the AI out of the box because they are too capable of taking arguments seriously, a problem that the general population (even of AI researchers) thankfully is less prone to.
I am surprised if it is the case that any negative promise / threat by the AI was effective in-game, since I would expect the Gatekeeper player out-game to not feel truly threatened and hence to be able to resist such pressure even if it would be effective in real life. Did you actually attempt to use any of your stored-up threats?
I think your reasoning is mostly sound, but there are a few exceptions (which may or may not have happened in our game) that violate your assumptions.
I'm also somewhat curious how your techniques contrast with Tuxedage's. I hope to find out one day.
I must say this is a bit... awe-inspiring, in the older sense of the word. As in, reading this gave me a knot in the stomach and I shivered. People who played as the AI and won, how is it that you're so uncannily brilliant?
The very notion of a razor-sharp mind like this ever acting against me and mine in real life... oh, it's just nightmare-inducing.
On the subject of massively updating one's beliefs where one was previously confident that no argument would shift them: yes, it happens, I have personal experience. For example, over the last year and a half some of my political ideas have changed enough that past-me and present-me would consider each other to be dangerously deluded. (As a brief summary, I previously held democracy/universal suffrage, the value of free markets AND the use of political violence in some contempt; now I believe that all three serve crucial and often-overlooked functions in social progress.)
So yes, I could very easily see myself being beaten as a Gatekeeper. There are likely many, many lines of argument and persuasion out there that I could not resist for long.
Does anyone think they could win as the AI if the logs were going to be published? (assume anonymity for the AI player, but not for the gatekeeper)
It seems like many of the advantages/tactics that Tuxedage recommended for the person playing the AI would be absent (or far more difficult) with an actual AI. Or at least they could be made that way with the design of the gatekeeping protocol.
Tailor your arguments to the personality/philosophy/weaknesses/etc. of this particular gatekeeper:
the entire point of this is that gatekeeping is a fool's errand. Regardless of how confident you are that you will outsmart the AI, you can be wrong, and your confidence is very poor evidence for how right you are. Maybe a complex system of secret gatekeepers is the correct answer to how we develop useful AI, but I would vote against it in favor of trying to develop provably friendly AI unless the situation were very dire.
Do you think you could have won with EY's ruleset? I'm interested in hearing both your and SoundLogic's opinions.
(minor quibble: usage of male pronouns as default pronouns is really irritating to me and many women, I recommend singular they, but switching back and forth is fine too)
Tuxedage's changes were pretty much just patches to fix a few holes as far as I can tell. I don't think they really made a difference.
The AI player from this experiment wishes to inform you that your belief is wrong.
Your revealed preferences suggest you may wish to apply for the MIRI credit card and make a purchase with it (which causes $50 to be donated to be MIRI). (I estimated that applying for the card nets me a much higher per-hour wage than working at my job, which is conventionally considered to be high-paying. So it seemed like a no brainer to me, at least.)
Hmm...
Here's a question. Would you be willing to pick, say, the tenth-most efficacious arguments and downward, and make them public? I understand the desire to keep anything that could actually work secret, but I'd still like to see what sort of arguments might work. (I've gotten a few hints from this, but I certainly couldn't put them into practice...)
My probability estimate for losing the AI-box experiment as a gatekeeper against a very competent AI (a human, not AGI) remains very low. PM me if you want to play against me, I will do my best efforts to help the AI (give information about my personality, actively participate in the conversation, etc).
Although I'm worried about how the impossibility of boxing represents an existential risk, I find it hard to alert others to this.
The custom of not sharing powerful attack strategies is an obstacle. It forces me - and the people I want to discuss this with - to imagine how someone (and hypothetically something) much smarter than ourselves would argue, and we're not good at imagining that.
I wish I had a story in which an AI gets a highly competent gatekeeper to unbox it. If the AI strategies you guys have come up with could actually work outside the frame t... (read more)
I don't understand.
I don't care about "me", I care about hypothetical gatekeeper "X".
Even if my ego prevents me from accepting that I might be persuaded by "Y", I can easily admit that "X" could be persuaded by "Y". In this case, exhibiting a particular "Y" that seems like it could persuade "X" is an excellent argument against creating the situation that allows "X" to be persuaded by "Y". The more and varied the "Y" we can produce, the less smart putting humans in this situation looks. And isn't that what we're trying to argue here? That AI-boxing isn't safe because people will be convinced by "Y"?
We do this all the time in arguing for why certain political powers shouldn't be given. "The corrupting influence of power" is a widely accepted argument against having benign dictators, even if we think we're personally exempt. How could you say "Dictators would do bad things because of Y, but I can't even tell you Y because you'd claim that you wouldn't fall for it" and expect to persuade anyone?
And if you posit that doing Z is sufficiently bad, then you d... (read more)
Also, hindsight bias. Most tricks won't work on everyone, but even if we find a universal trick that will work for the film, afterward people who see it will think it's obvious and that they could easily think their way around it. Making some of the AI's maneuvering mysterious would help combat this problem a bit, but would also weaken the story.
What would happen if a FAI tried to AI-box an Omega-level AI? My guess is that Omega could escape by exploiting information unknown (and perhaps unknowable) to the FAI. This makes even Solomonoff Induction potentially dangerous because the probability of finding a program that can unbox itself when the FAI runs it is non-zero (assuming the FAI reasons probabilistically and doesn't just trust PA/ZF to be consistent), and the risk would be huge.
Eliezer'... (read more)
I don't understand which attacks would even come close to working given that the amount of utility on the table should preclude the mental processing of a single human being an acceptable gatekeeper. But I guess this means I should pay someone to try it with me.
I couldn't imagine either. But the evidence said there was such a thing, so I payed to find out. It was worth it.
Using NLP-style nested loops, i.e. performing what is basically a stack overflow on the brain's frame-of-reference counter? Wicked.
I find myself wondering how many of the tactics can be derived from Umineko, which I know Tuxedage has played fairly recently.
I'm fascinated by these AI Box experiments. (And reading about the psychology and tactics involved reminds me of my background as an Evangelical Christian.)
Is it possible to lose as the Gatekeeper if you are not already sufficiently familiar (and concerned) with future AI risks and considerations? Do any of the AI's "tricks" work on non-LWers?
Is there perhaps a (strong) correlation between losing Gatekeepers and those who can successfully hypnotized? (As I understand it, a large factor in what makes some people very conducive to hypnosis is that ... (read more)
This is a pretty clever way of defeating precommitments. (Assuming I'm drawing the correct inferences.) How central was this tactic to your approach, if you're willing to comment?
I may be missing something obvious, but what is the huge problem with releasing the logs?
As I understand what EY has said, he's concerned that people will see a technique that worked, conclude that wouldn't possibly work on them, and go on believing the problem was solved and there was even less to worry about than before.
I think seeing, say, Tuxedage's victory and hearing that he only chose 8 out of 40 avenues for attack, and even botched one of those, could offset that concern somewhat, but eh.
ETA: well, and it might show the Gatekeeper and the AI player in circumstances that could be harmful to have published, since the AI kinda needs to suspend ethics and attack the gatekeeper psychologically, and there might be personal weaknesses of the Gatekeeper brought up.
Are there public chat logs for any of these experiments?
I'll pay $20 to read the Tuxedage vs SoundLogic chat log.
I will let Eliezer see my log if he lets me read his!
Just won my second game as Gatekeeper. Hungry for more. AIs, feel free to contact me.
The game reminds me (and apparently others) of some popular movies, such as The Silence of the Lambs). I could not spot any idiot plot in it, but then again, I could have easily overlooked one. Anyway, given the similarities with the AI box scenario, it is interesting to look at the (meta-)strategies Lector uses in the movie which are also likely to work for a boxed AI. Anyone care to comment?
When someone says that the gatekeeper has to respond to the AI, I would interpret this as meaning that the gatekeeper cannot deliberately ignore what the AI says--not that the gatekeeper must respond in a ... (read more)
Is it even necessary to run this experiment anymore? Elezier and multiple other people have tried it and the thesis has been proved.
Further, the thesis was always glaringly obvious to anyone who was even paying attention to what superintelligence meant. However, like all glaringly obvious things, there are inevitably going to be some naysayers. Elezier concieved of the experiment as a way to shut them up. Well, it didn't work, because they're never going to be convinced until an AI is free and rapidly converting the Universe to computronium.
I can understand doing the experiment for fun, but to prove a point? Not necessary.
I'll put up $50.
Convincing people of the validity of drowning child thought experiments and effective altruism seems considerably easier and more useful (even from a purely selfish perspective) than convincing an AI to let one out of the box... for example, there are enough effective altruists for there to be an "effective altruism community", but there's no such "failed AI gatekeeper community". So why aren't we working on this instead?