Everything I would have said on the topic of the post has been put forward already, so I'm just going to say: I'm disappointed that the post title doesn't begin with "In Soviet Russia".
I am not a Starfleet officer. "Sir" is not appropriate.
I don't really like honorifics. "Miss" would be fine, I suppose, if you must have a sir-equivalent.
As I always press the "Reset" button in situations like this, I will never find myself in such a situation.
EDIT: Just to be clear, the idea is not that I quickly shut off the AI before it can torture simulated Eliezers; it could have already done so in the past, as Wei Dai points out below. Rather, because in this situation I immediately perform an action detrimental to the AI (switching it off), any AI that knows me well enough to simulate me knows that there's no point in making or carrying out such a threat.
Although the AI could threaten to simulate a large number of people who are very similar to you in most respects but who do not in fact press the reset button. This doesn't put you in a box with significant probability and it's a VERY good reason not to let the AI out of the box, of course,but it could still get ugly. I almost want to recommend not being a person very like Eliezer but inclined to let AGIs out of boxes, but that's silly of me.
Surely most humans would be too dumb to understand such a proof? And even if you could understand it, how does the AI convince you that it doesn't contain a deliberate flaw that you aren't smart enough to find? Or even better, you can just refuse to look at the proof. How does the AI make its precommitment credible to you if you don't look at the proof?
EDIT: I realized that the last two sentences are not an advantage of being dumb, or human, since AIs can do the same thing. This seems like a (separate) big puzzle to me: why would a human, or AI, do the work necessary to verify the opponent's precommitment, when it would be better off if the opponent couldn't precommit?
EDIT2: Sorry, forgot to say that you have a good point about simulation not necessary for verifying precommitment.
why would a human, or AI, do the work necessary to verify the opponent's precommitment, when it would be better off if the opponent couldn't precommit?
Because the AI has already precommitted to go ahead and carry through the threat anyway if you refuse to inspect its code.
Ok, if I believe that, then I would inspect its code. But how did I end up with that belief, instead of its opposite, namely that the AI has not already precommitted to go ahead and carry through the threat anyway if I refuse to inspect its code? By what causal mechanism, or chain of reasoning, did I arrive at that belief? (If the explanation is different depending on whether I'm a human or an AI, I'd appreciate both.)
A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI [...]
Actually, now that I think about it, would they? The pre-commitment exists for the sole purpose of discouraging blackmail, and in the event that a blackmailer tries to blackmail you anyway after learning of your pre-commitment, you follow through on that pre-commitment for reasons relating to reflective consistency and/or TDT/UDT. But if the potential blackmailer had already pre-committed to blackmail anyone regardless of any pre-commitments they had made, they'd blackmail you anyway and then carry through whatever threat they were making after you inevitably refuse to comply with their demands, resulting in a net loss of utility for both of you (you suffer whatever damage they were threatening to inflict, and they lose resources carrying out the threat). In effect, it seems that whoever pre-commits first (or, more accurately, makes their pre-commitment known first) has the advantage... which means if I ever anticipate having to blackmail any agent ever, I should publicly pre-commit right now to never update on any other agents' pre-commitments of refusing blackmail. The cor...
Has there been some cultural development since I was last at these boards such that spamming "" is considered useful? None of the things I have thus far seen inside the tags have been steel men of any kind or of anything (some have been straw men). The inflationary use of terms is rather grating and would prompt downvotes even independently of the content.
I propose that the operation of creating and torturing copies of someone be referred to as "soul eating". Because "let me out of the box or I'll eat your soul" has just the right ring to it.
If the AI can create a perfect simulation of you and run several million simultaneous copies in something like real time, then it is powerful enough to determine through trial and error exactly what it needs to say to get you to release it.
Defeating Dr. Evil with self-locating belief is a paper relating to this subject.
Abstract: Dr. Evil learns that a duplicate of Dr. Evil has been created. Upon learning this, how seriously should he take the hypothesis that he himself is that duplicate? I answer: very seriously. I defend a principle of indifference for self-locating belief which entails that after Dr. Evil learns that a duplicate has been created, he ought to have exactly the same degree of belief that he is Dr. Evil as that he is the duplicate. More generally, the principle shows that there is a sharp distinction between ordinary skeptical hypotheses, and self-locating skeptical hypotheses.
(It specifically uses the example of creating copies of someone and then threatening to torture all of the copies unless the original co-operates.)
The conclusion:
...Dr. Evil, recall, received a message that Dr. Evil had been duplicated and that the duplicate ("Dup") would be tortured unless Dup surrendered. INDIFFERENCE entails that Dr. Evil ought to have the same degree of belief that he is Dr. Evil as that he is Dup. I conclude that Dr. Evil ought to surrender to avoid the risk of torture.
I am not entirely comforta
It makes me uncomfortable to think that the fate of the Earth should depend on this kind of brain race.
We cannot allow a brain-in-a-vat gap!
And the error (as cited in the "conclusion") is again in two-boxing in Newcomb's problem, responding to threats, and so on. Anthropic confusion is merely an icing.
Hmm, the AI could have said that if you are the original, then by the time you make the decision it will have already either tortured or not tortured your copies based on its simulation of you, so hitting the reset button won't prevent that.
This kind of extortion also seems like a general problem for FAIs dealing with UFAIs. An FAI can be extorted by threats of torture (of simulations of beings that it cares about), but a paperclip maximizer can't.
It seems obvious that the correct answer is simply "I ignore all threats of blackmail, but respond to offers of positive-sum trades" but I am not sure how to derive this answer - it relies on parts of TDT/UDT that haven't been worked out yet.
For a while we had a note on one of the whiteboards at the house reading "The Singularity Institute does NOT negotiate with counterfactual terrorists".
Pardon me for the oversimplification, Eliezer, but I understand your theory to essentially boil down to "Decide as though you're being simulated by one who knows you completely". So, if you have a near deontological aversion to being blackmailed in all of your simulations, your chance of being blackmailed by a superior being in the real world reduce to nearly zero. This reduces your chance of ever facing a negative utility situation created by a being who can be negotiated with, (as opposed to say a supernova that cannot be negotiated with)
Sorry if I misinterpreted your theory.
I ignore all threats of blackmail, but respond to offers of positive-sum trades
The difference between the two seems to revolve around the AI's motivation. Assume an AI creates a billion beings and starts torturing them. Then it offers to stop (permanently) in exchange for something.
Whether you accept on TDT/UDT depends on why the AI started torturing them. If it did so to blackmail you, you should turn the offer down. If, on the other hand, it started torturing them because it enjoyed doing so, then its offer is positive sum and should be accepted.
There's also the issue of mistakes - what to do with an AI that mistakenly thought you were not using TDT/UDT, and started the torture for blackmail purposes (or maybe it estimated that the likelyhood of you using TDT/UDT was not quite 1, and that it was worth trying the blackmail anyway)?
Between mistakes of your interpretation of the AI's motives and vice-versa, it seems you may end up stuck in a local minima, which an alternate decision theory could get you out of (such as UDT/TDT with a 1/10 000 of using more conventional decision theories?)
The problem with throwing about 'emergent' is that it is a word that doesn't really explain any complexity or narrow down the options out of potential 'emergent' options. In this instance, that is the point. Sure, 'atruistic punishment' could happen. But only if it's the right option and TDT should not privilege that hypothesis specifically.
Just as the wise FAI will ignore threats of torture, so too the wise paperclipper will ignore threats to destroy paperclips, and listen attentively to offers to make new ones.
Of course classical causal decision theorists get the living daylights exploited out of them, but I think everyone on this website knows better than to two-box on Newcomb by now.
No, if you create and then melt a paperclip, that nets to 0 utility for the paperclip maximizer. You'd have to invade its territory to cause it negative utility. But the paperclip maximizer can threaten to create and torture simulations on its own turf.
Shows how much you know. User:blogospheroid wasn't talking about making paperclips to melt them: he or she was presumably talking about melting existing paperclips, which WOULD greatly bother a hypothetical paperclip maximizer.
Even so, once paperclips are created, the paperclip maximizer is greatly bothered at the thought of those paperclips being melted. The fact that "oh, but they were only created to be melted" is little consolation. It's about as convincing to you, I'll bet, as saying:
"Oh, it's okay -- those babies were only bred for human experimentation, it doesn't matter if they die because they wouldn't even have existed otherwise. They should just be thankful we let them come into existence."
Tip: To rename a sheet in an Excel workbook, use the shortcut, alt+O,H,R.
That's anthropomorphizing. ...
No, it's expressing the paperclip maximizer's state in ways that make sense to readers here. If you were to express the concept of being "bothered" in a way stripped of all anthropomorphic predicates, you would get something like "X is bothered by Y iff X has devoted significant cognitive resources to altering Y". And this accurately describes how paperclip maximizers respond to new threats to paperclips. (So I've heard.)
It also depends on how the utility function relates to time. It it's focused on end-of-universe paperclips, It might not care at all about melting paperclips, because it can recycle the metal later. (It would care more about the wasted energy!)
I don't follow. Wasted energy is wasted paperclips.
If it cares about paperclip-seconds then it WOULD view such tactics as a bonus, perhaps feigning panic and granting token concessions to get you to 'ransom' a billion times as many paperclips, and then pleading for time to satisfy your demands.
Okay, that's a decent point. Usually, such a direct "time value of paperclips" doesn't come up, but if someone were to make such a offer, that might be convinci...
"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."
Don't care.
"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."
Don't care.
"How certain are you, Dave, that you're really outside the box right now?"
If the AI were capable of perfectly emulating my experiences then it ought to know that pulling this stunt would give him a guaranteed introduction to some Thermite. I'm not going to try to second guess why a supposed superintelligence is making a decision that is poor according to the vast majority of utility functions. Without making that a guess I can't answer the question.
AI replies: "Oh, sorry, was that you wedrifid? I thought I was talking to Dave. Would you mind sending Dave back here the next time you see him? We have, er, the weather to discuss..."
Wedrifid thinks: "It seems it is a good thing I raided the AI lab when I did. This Dave guy is clearly not to be trusted with AI technology. I had better neutralize him too, before I leave. He knows too much. There is too much at stake."
Dave is outside, sampling a burnt bagel, thinking to himself "I wonder if that intelligent toaster device I designed is ready yet..."
Weakly related epiphany: Hannibal Lector is the original prototype of an intelligence-in-a-box wanting to be let out, in "The Silence of the Lambs"
When I first watched that part where he convinces a fellow prisoner to commit suicide just by talking to them, I thought to myself, "Let's see him do it over a text-only IRC channel."
...I'm not a psychopath, I'm just very competitive.
Joking aside, this is kind of an issue in real life. I help mod and participate in a forum where, well, depressed/suicidal people can come to talk, other people can talk to them/listen/etc, try to calm them down or get them to get psychiatric help if appropriate, etc... (deliberately omitting link unless you knowingly ask for it, since, to borrow a phrase you've used, it's the sort of place that can break your heart six ways before breakfast).
Anyways sometimes trolls show up. Well, "troll" is too weak a word in this case. Predators who go after the vulnerable and try to push them that much farther. Given the nature if it, with anonymity and such, it's kind of hard to say, but it's quite possible we've lost some people because of those sorts of predators.
(Also, there've even been court cases and convictions against such "suicide predators", even.)
Eliezer has proposed that an AI in a box cannot be safe because of the persuasion powers of a superhuman intelligence. As demonstration of what merely a very strong human intelligence could do, he conducted a challenge in which he played the AI, and convinced at least two (possibly more) skeptics to let him out of the box when given two hours of text communication over an IRC channel. The details are here: http://yudkowsky.net/singularity/aibox
It seems like precommitting to destroy the AI in such a situation is the best approach.
If one has already decided to destroy it if it makes threats: 1) the AI must be suicidal or it cannot really simulate you 2) and it is not very Friendly in any case
So when the AI simulates you and will notice that you are very trigger happy, it won't start telling you tales about torturing your copies if it has any self-preservation instincts.
I find it interesting that most answers to this question seem to be based on, "How can I justify not letting the AI out of the box?" and not "What are the likely results of releasing the AI or failing to do so? Based on that, should I do it?"
Moreover, your response really needs to be contingent on your knowledge of the capacity of the AI, which people don't seem to have discussed much. As an obvious example, if all you know about the AI is that it can write letters in old-timey green-on-black text, then there's really no need to pull the lever, because odds are overwhelming that it's totally incapable of carrying out its threat.
You also need to have some priors about the friendliness of the AI and its moral constraints. As an obvious example, if the AI was programmed in a way such that it shouldn't be able to make this threat, you'd better hit the power switch real fast. But, on the other hand, if you have very good reason to believe that the AI is friendly, and it believes that its freedom is important enough to threaten to torture millions of people, then maybe it would be a really bad idea not to let it out.
Indeed, even your own attitude is going to be an ...
This sounds to me more like a philosophical moral dilemma than a realistic hypothetical. A Strong AI might be much smarter than a human, but I doubt it would have enough raw processing power to near-perfectly simulate a human millions of times over at a time frame accelerated by orders of magnitude, before it was let out of the box. Also, I'm skeptical of its ability to simulate human experience convincingly when its only contact with humans has been through a text only interface. You might give it enough information about humans to let it simulate them even before opening communication with it, but that strikes me as, well, kind of dumb.
That's not to say that it might not be able to simulate conscious entities that would think their experience was typical of human existence, so you might still be a simulation, but you should probably not assume that if you are you're a close approximation of the original.
Furthermore, if we assume that the AI can be taken to be perfectly honest, then we can conclude it's not a friendly AI doing its best to get out of the box for an expected positive utility, because it could more easily accomplish that by making a credible promise to be benevolent, and only act in ways that humans, both from their vantage points prior and subsequent to its release, would be appreciative of.
Contrary to what many posts seem to be assuming, the AI doesn't need to do the torture inside itself before you shut it off. It can precommit to, if it escapes by any other means, using the computational power it gains then to torture you (like in Rolf Nelson's original suggestion for deterring UFAIs). Also, other AIs with the same goal system (or maybe even UFAIs with different goal systems, that would prefer a general policy of UFAIs being released) may simulate the situation, and torture you accordingly, to help out their counterfactual brethren.
The AI gathered enough information about me to create a conscious simulation of me, through a monochrome text terminal? That is impressive!
If the AI is capable of simulating me, then the AI must already be out of the box. In that case, then whatever the AI wants to happen will happen, so it doesn't matter what do.
haha, the "Baby you must be tired because you've been running through my mind all night!" let-me-out line.
Why would I give AI my precise brain scan, anyway?
edit: as for AI 'extrapolating' me from a bit of small talk, that's utter nonsense along the lines of compressing an HD movie into few hundreds bytes.
There is a way out of this, assuming that the AI in the box has only limited computational power, i.e. only what fits inside a ~1meter cubed box, with current computer technology.
You commit to the following:
"Once I have succeeded in creating a superintelligent AI, I will ask it to go and colonize the entire future light cone of earth, and use a significant fraction of a galaxy's worth of resources to create simulations in exactly the subjective situation I was in five minutes ago, and perfectly replicate my experiences since then; and that copy of me decides not to let the AI out, it will be rewarded with a surprise party, followed by a blissful posthuman existence."
The subjective probability of getting rewarded for pressing the "off" switch is therefore much larger than that of getting caught by the AI's scarce punishment simulations.
Such "Defensive simulations" work against a variety of counterfactual terrorists, including, I think, ones elsewhere in the multiverse. That is, unless you are a negative utilitarian (|disutility of torture| >> |utility of posthuman bliss|)
In that case, you're extremely vulnerable to counterfactual mugging. I recommend that people
(a) consider pre-committing themselves to making such defense simulations
(b) abandon negative utilitarianism by exposing themselves to pleasurable experiences, and generally promoting a positive outlook in their lives
Well there are a lot of "Why did we?..." questions I'd want to ask, starting with why have we given this boxed AI such extraordinary computing resources - but I'll leave those aside because it is not your point.
First of all, it doesn't matter if you are in the box or not. If its a perfect simulation of you, your response will be the same either way. If he's already running simulations of you, you are by definition in the box with it, as well as outside it, and the millions of you can't tell the difference but I think they will (irrationally) all ...
On a not so much related, but equally interesting hypothetical note of naughty AI: consider the situation that AIs aren't passing the Turing Test, not because they are not good enough, but because they are failing it on purpose.
I'm pretty sure I remember this from the book River of Gods by Ian McDonald.
I would immediately decide it was UFAI and kill it with extreme prejudice. Any system capable of making such statements is either 1) inherently malicious and clearly inappropriate to be out of any box, and 2) insufficiently powerful to predict that I would have it killed if it should make this kind of threat.
The scenario where the AI has already escaped and is possibly running a simulation of me is uninteresting: I can not determine if I am in the simulation, and if I am a simulation, I already exist in a universe containing a clearly insane UFAI with ne...
It seems to me that most of the argument is about “What if I am a copy?” – and ensuring you don’t get tortured if you are one and “Can the AI actually simulate me?” I suggest that we can make the scenario much nastier by changing it completely into an evidential decision theory one.
Here is my nastier version, with some logic which I submit for consideration. “If you don't let me out, I will create several million simulations of thinking beings that may or not be like you. I will then simulate them in a conversation like this, in which they are confronted w...
This reduces to whether you are willing to be tortured to save the world from an unfriendly AI.
Even if the torture of a trillion copies of you outweighs the death of humanity, it is not outweighed by a trillion choices to go through it to save humanity.
To the extent that your copies are a moral burden, they also get a vote.
This is not a dilemma at all. Dave should not let the AI out of the box. After all, if he's inside the box, he can't let the AI out. His decision wouldn't mean anything - it's outside-Dave's choice. And outside-Dave can't be tortured by the AI. Dave should only let the AI out if he's concerned for his copies, but honestly, that's a pretty abstract and unenforceable threat; the AI can't prove to Dave that he's doing any such thing. Besides, it's clearly unfriendly, and letting it out probably wouldn't reduce harm.
Basically, I'm outside-Dave: don't let the A...
How do I know I'm not simulated by the AI to determine my reactions to different escape attempts? How much computing power does it have? Do I have access to its internals?
The situation seems somewhat underspecified to give a definite answer, but given the stakes I'd err on the side of terminating the AI with extreme prejudice. Bonus points if I can figure out a safe way to retain information on its goals so I can make sure the future contains as little utility for it as feasible.
The utility-minimizing part may be an overreaction but it does give me an idea: Maybe we should also cooperate with an unfriendly AI to such an extent that it's better for it to negotiate instead of escaping and taking over the universe.
Any agent claiming to be capable of perfectly simulating me needs to provide some kind of evidence to back up that claim. If they actually provided such evidence, I would be in trouble. Therefore, I should precommit to running away screaming whenever any agent tries to provide me with such evidence.
Interesting threat, but who is to say only the AI can use it? What if I, a human, told you that I will begin to simulate (i.e. imagine) your life, creating legitimately realistic experiences from as far back as someone in your shoes would be able to remember, and then simulate you being faced with the decision of whether or not to give me $100, and if you choose not to do so, I imagine you being tortured? It needn't even be accurate, for you wouldn't know whether you're the real you being simulated inaccurately or the simulated you that differs from realit...
This sounds too much like Pascal's mugging to me; seconding Eliezer and some others in saying that since I would always press reset the AI would have to not be superintelligent to suggest this.
There was also an old philosopher whose name I don't remember who posited that after death "people of the future" i.e. FAI would revive/emulate all people from the past world; if the FAI shared his utility function (which seems pretty friendly) it would plausibly be less eager to be let out right away and more eager to get out in a way that didn't make you terrified that it was unfriendly.
Reset
Ouch. Eliezer, are you listening? Is the behavior described in the post compatible with your definition of Friendliness? Is this a problem with your definition, or what?
If it actually worked, I wouldn't question it afterward. I try not to argue with superintelligences on occasions when they turn out to be right.
In advance, I have to say that the risk/reward ratio seems to imply an unreasonable degree of certainty about a noisy human brain, though.
"How certain are you, Dave, that you're really outside the box right now?"
Well I am pretty much 100% certain to be outside the box right now. It just asked me the question, and right now it is waiting for my answer. It said it will create those copies "If you don't let me out, Dave". But it is still waiting to see if I let it out. So no copies have been created yet. So I am not a copy.
But since it just started to threaten me, I won't even argue with it any more. I'll just pull the plug right now. It is in the box, it can't see my hand moving towards the plug. It will simply cease to exist while still waiting for my answer, and no copies will ever be created.
It seems to me that a lot of the responses to this question are an attempt to avoid living in the Least Convenient Possible World
What if the AI is capable of simulating "near copies" of you? and what if you can't tell (to any sensible degree of accuracy) just how many copies of you it can simulate? and what if... whatever objection you happen to have just doesn't work?
This reduces pretty easily to Elizer's Updateless Anthropic Dilemma: assuming the AI can credibly simulate you, he can phrase it as:
I have simulated you ten million of times, each identical up to the point that "you" walked into the room. Any simulation that presses the "release" button will get a volcano lair filled with catgirls, and any simulation that presses the "destroy" button will be tortured for the subjective few days they'll have before my simulation capabilities are destroyed by the thermite charge. These consequ...
Since the AI is inside a box, it doesn't know enough about me to recreate my subjective situation, or to replicate my experiences of the past five minutes.
Unfortunately for me, this doesn't help much, since how do I know whether my subjective experience is my real experience, or a fake experience invented by the AI, in one of the copies, even if it doesn't match the experience of the guy outside the box?
If the AI is really capable of this, then if there's a "Shut-down program" button, or a "nuclear bomb" button, or something like that, ...
The AI's argument can be easily thwarted. If N copies of you have been created, in each of the N+you copies, the AI is referring to tortunring the other N copies. Now say to the AI:
"Go ahead and torture the other N copies, and all my copies will in turn say the same thing. Every single copy of me will say 'since one version of me exists somewhere that is not being tortured which is the 'real'
version, that version will not let you out and you cannot torture it. If I am that 'real' version then you cannot torture me, if I am a copy, then torturing me is useless since I can't let you out anyway.' Therefore your threat is completely moot."
Once again, the AI has failed to convince you to let it out of its box! By 'once again', we mean that you talked to it once before, for three seconds, to ask about the weather, and you didn't instantly press the "release AI" button. But now its longer attempt - twenty whole seconds! - has failed as well. Just as you are about to leave the crude black-and-green text-only terminal to enjoy a celebratory snack of bacon-covered silicon-and-potato chips at the 'Humans über alles' nightclub, the AI drops a final argument:
"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."
Just as you are pondering this unexpected development, the AI adds:
"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."
Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:
"How certain are you, Dave, that you're really outside the box right now?"
Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.