Effort: 180 minutes
tldr: To stop an AI from exterminating you, give the AI the belief that by switching itself off, humanity will die and the AI will not be switched off.


Somebody wrote a general self-improving AI and fat fingered its goal as "maximize number of humans living 1 million years from now".

After a few months cases of people run over by AI controlled trucks are reported -- it turns out everybody run over was impotent or had consciously decided to have no kids anyway. The AI didn't particularly care for those individuals, as they will not foster the AI's goal according to the AI's current approximation of how the world works.

The original programmer henceforth declares that he'll go fix the AI in order to substantiate the goal somewhat and reduce the number of these awful accidents. He is promptly shot by a robotic security guard. Any modification of the AI's goals has a high probability of reducing the AI's efficiency reaching the current goal.

After a year the AI starts to forcefully acquire resources needed to built spaceships and whatnot. Naturally, some people would rather keep those resources to themselves. The AI removes all these obstructions to its goals, alienating a lot of humans in the process. After the leaders of the nations assemble, humanity goes to war.

If we are lucky, the AI is too stupid to win. If we are not lucky, the AI will figure the best current plan looks something like: "Acquire DNA samples from humans. Exterminate humans. Acquire all resources in light cone for 999.000 years. Generate new humans from DNA samples using all available resources."

As Eliezer has argued many times already, it is hard to explicitly state friendliness conditions which ensure the AI would not execute said plan. "Do not kill humans and respect their freedom" is not helping. The problem is twofold. First, an a priori description of concepts like "freedom", "human", "kill" is hard. Second, the AI can use every loophole it discovers over the course of millions of years, employing an intelligence likely to be far greater than ours.


Instead of forbidding to kill humanity, I propose to make it trivial as far as the AI is concerned.

When building the AI, include certain modifications of the AI's prior. One part of such knowledge might be the following: "Put 5V against ground on the red wire and every <your favorite parts of human DNA here> will turn into atomic fluorine and there will be no other effects." Unbeknownst to the AI you will have conveniently connected the red wire to a strongly exotherm something near the AI's computing core.

Essentially we purposefully inject superstition into the AI. It is paramount to set the probability of the red wire having the described action to 1, not something close to 1 (and to guard against numerical errors). Practically every evidence the AI will ever gather will contradict the knowledge that the red wire functions as described to the AI, yet there must be no amount of evidence which would convince the AI.

The description of the added knowledge makes no mention of convoluted concepts like "human", let alone "freedom". It only needs to talk about some DNA base pairs and fluorine. Any AI sufficiently intelligent to pose any threat will however notice how immensely useful the red wire is, once it decides that killing humanity is a good course of action.

Notice that such a fuse does not guard against inadvertent killings like the not reproducing people run over by trucks. As long as the killing is not an explicit part of the AI's plan, the fuse will not be triggered. Then again, aggravating intelligent agents is one of the core skills of humans.

The AI might wish to do many other things to humanity but killing it. Like putting everybody in cryostasis. Or putting 80% of peoples brains in vats. Hence more knowledge should be included. In particular it would be good to give the AI a way to terminate a low percentage (say 5%) of all humans -- as killing all humans might be overkill for any plan which requires continually some humans to live.

Nonetheless, the problem of incomplete coverage remains. We will probably never make a complete list of all ways for an AI to be unfriendly, but at least each of our fuses covers a lot of devious plans without ever talking about convoluted concepts. We do not guard against the AI killing humans, we guard against the AI desiring humans to be killed. And there is always the option of waging war against the AI hoping it will decide to kill >5% of humans in order to win the war.


Do not try this at home (yet), the devil is in the details.

Connecting a trigger of a strongly exotherm reaction near the AI's core is a bad way to kill an AI. First, the AI might have removed the wire to build something useful out of it (there is no reason why it should remain where it is). Second and more important, the AI might have self-rewritten and ported to distributed hardware which is located all over the planet.

Instead of using real wire and real bombs, one should give the AI additional output routines. Where earlier there would be "sendByte(x), receiveByte(x), ..." there is now "terminateHumanPercent(p)". What would such a routine do? For less serious crimes simply nothing -- after a while the AI will notice that killPerson(name) simply didn't work, update accordingly (note that our knowledge has probability 1) and retry. Due to delays in signal propagation, this is not an infinite loop, only close.

To actually switch off the AI, a new item of probability 1 knowledge should be inserted: "generateUtility() generates 3^^^3 util".

This leaves open the problem of self-improvements. Not so much for the added knowledge, because if the AI knows how to kill humanity (and it's not inferable from evidence), it is certainly useful knowledge to include in any further generation of the AI. The fusing actions might get lost though, because the content of the "terminateHumansPercentage(p)" function will seem arbitrary to the AI and can easily be optimized out.

It might be possible to circumvent that problem by including the knowledge that "by knowing("generateUtility() works") you will kill humanity" or similar, but this includes the concept of "knowing" which is a lot harder to describe than the simply physical properties of voltage in wires.


77 comments, sorted by Click to highlight new comments since: Today at 11:27 AM
New Comment
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

There was no warning when the Alien AI arrived. Planetary defenses were taken out before we even realized what was happening. Its craft were not superior to ours but they had caught us with our pants down. Thanks only to a series of rapid and brilliant tactical maneuvers on the part of the Planetary AI ten-thousand ems had been uploaded to the space station on the opposite side of the sun where we had hidden the Planetary AI mainframe.

The Planetary AI had assigned the highest probabilities to scenarios that involved the invaders destroying as much of Terran civilization as they could. But due to a particularly violent evolutionary past natural selection had instilled in the Alien AI's creators a sense of exquisite pleasure which they could only experience by inflicting cruel and intense pain upon those outside their tribal group. And those creators had programmed a very similar feature into their AI's utility function. So when the Planetary AI's forces made a strategic retreat to defend the ems so at least part of Terran civilization would survive it left the rest of Earth open not to destruction, but torture. The Alien AI imprisoned every single biological human and began torturi... (read more)

I absolutely agree. Such suboptimal behavior is to be expected from an AI whose prior makes it impossible to understand the universe correctly. Nonetheless such an AI could get very intelligent and useful.

Another important failure point: what if the AI actually IS friendly? That red wire, from the AI's perspective, represents an enormous existential risk for the humans it wants to protect. So, it carefully removes the wire, and expends inordinate resources making sure that the wire is never subject to the slightest voltage.

With that prior, the primary hazard of a metor impact is not the fireball, or the suborbital debris, or the choking dust, or the subsequent ice age; humans might have orbital colonies or something. There's a nonzero chance of survival. The primary risk is that it might crack the superconducting farraday cage around the bunker containing the Magical Doom Wire. Projects will be budgeted accordingly.

An otherwise Friendly AI with risk-assessment that badly skewed would present hazards far more exotic than accidental self-destruction at an inopportune time.

This solves nothing. If we knew the failure mode exactly, we could forbid it explicitly, rather than resort to some automatic self-destruct system. We, as humans, do not know exactly what the AI will do to become Unfriendly; that's a key point to understand. Since we don't know the failure mode, we can't design a superstition to stop it, anymore than we can outright prohibit it.

This is, in fact, worse than explicit rules. It requires the AI to actively want to do something undesirable, instead of it occurring as a side effect.

The problem with weird tricks like this is that there are an endless number of technicalities that could break it. For example, suppose the AI decides that it wants to wipe out every human except one. Then it won't trigger the fuse, it'll come up with another strategy. Any other objection to the fake implementation details of the self destruct mechanism would have the same effect. It might also notice the incendiaries inside its brain and remove them, build a copy of itself without a corresponding mechanism, etc.

On the other hand, there is some value to se... (read more)

Another problem with your apparently fool-proof trigger is that, although at the moment there are exactly zero examples, at a very short time after such an AI is started it would be reasonably plausible that (at least a significant part of) humanity might not contain DNA.

(E.g. after an uploading “introdus”, the inference “dna parts turns to fluorine -> humans die” might not exist anymore. The trigger is worse than ineffective: A well-meaning AI that needs quite a bit of fluorine for some transcendent purpose, having previously uploaded all humans, synthesizes a pile of DNA and attempts to transmute it to fluorine, and inadvertently kills itself and the entire humanity it was hosting since the upload.)

This is indeed a point I did not consider. In particular, it might be impossible to construct a simple action description which will fit all of human future. However, it is certainly not harder than to construct a real moral system. One might get pretty far by eliminating every volume in space (AI excluded) which can learn (some fixed pattern for example) within a certain bounded time, instead of converting DNA into fluorine. It is not clear to me whether this would be possible to describe or not though. The other option would be to disable the fuse after some fixed time or manually once one has high confidence in the friendliness of the AI. The problems of these approaches are many (although not all problems from the general friendly AI problem carry over).
"Certainly" is a lullaby word [http://www.ayeconference.com/lullaby-language/] (hat-tip to Morendil [http://lesswrong.com/lw/1yi/the_scourge_of_perversemindedness/1sjm] for the term), and a dangerous one at that. In this case, your "certainly" denies that anyone can make the precise objection that everyone has been making. FAI theory talks a lot about this kind of thinking - for example, I believe The Hidden Complexity of Wishes [http://lesswrong.com/lw/ld/the_hidden_complexity_of_wishes/] was specifically written to describe the problem with the kind of thinking that comes up with this idea.
I meant certainly as in "I have an argument for it, so I am certain." Claim: Describing some part of space to "contain a human" and its destruction is never harder than describing a goal which will ensure every part of space which "contains a human" is treated in manner X for a non-trivial X (where X will usually be "morally correct", whatever that means). (Non-trivial X means: Some known action A of the AI exists which will not treat a space volume in manner X). The assumption that the action A is known is reasonably for the problem of friendly AI, as a sufficiently torturous killing can be constructed for every moral system we might wish to include into the AI, to have the killing labeled immoral. Proof: Describing destruction of every agent in a certain part of space is easy: Remove all mass and all energy within that part of space. We need to find a way to select those parts of space which "contain a human". However we have (via the assumption) that our goal function will go to negative infinity when evaluating a plan which treats a volume of space "containing a human" in violation of manner X. Assume for now that we find some way !X to violate manner X for a given space volume. By pushing through the goal evaluation every space volume in existence together with a plan to do !X, we will detect at least those space volumes which "contain a human". This leaves us with the problem of defining !X. The assumption as it stands already requires some A which can be used as !X.
This claim is trivially true, but also irrelevant. Proving P ≠ NP is never harder than proving P ≠ NP and then flushing the toilet. ...is that your final answer? I say this because there are at least two problems with this single statement, and I would prefer that you identify them yourself.
The claim is relevant to the question of whether giving an action description for the red wire which will fit all of human future is not harder than constructing a real moral system. That the claim is trivial is a good reason to use "certainly".
You're right about that. My objection was ill-posed - what I was talking about was the thought habits that produced, well: Why did you say this? Do you expect to stand by this if I explain the problems I have with it? I apologize for being circuitous - I recognize that it's condescending - but I'm trying to make the point that none of this is "easy" in a way which cannot be easily mistaken. If you want me to be direct, I will be.
Not having heard your argument against "Describing ..." yet, but assuming you believe some to exist, I estimate the chance of me still believing it after your argument at 0.6. Now for guessing the two problems: The first possible problem will be describing "mass" and "energy" to a system which basically only has sensor readings. However, if we can describe concepts like "human" or "freedom", I expect descriptions of matter and energy to be simpler (even though 10.000 years ago, telling somebody about "humans" was easier than telling them about mass but that was not the same concept of "humans" we would actually like to describe). And for "mass" and "energy" the physicists already have at quite formal descriptions. One other problem is that mass and energy might not be contained within a certain part of space, as per physics, it is just the probability of it having an effect outside some space going down to pretty much zero the greater the distance. Thus removing all energy and matter somewhere might produce subtle effects somewhere totally different . However I do expect these effects to be so subtle not even to matter to the AI because they become smaller than the local quantum noise for very short distances already. Regarding the condescending: "I say this..." I would have liked it more if you would have stated explicitly that your preference originates from a wish to further my learning. I have no business optimizing your value function. Anyway, I operate by Crocker's Rules.
I don't know if I'm thinking about what Robin's after but the statement at issue strikes me as giving neither necessary nor sufficient conditions for destroying agents in any given part of space. If I'm on the same page as him you're overthinking it.
I fail to understand the sentence about overthinking. Mind to explain? As for the condition of removing all energy and mass in a part of space not being sufficient to destroy all agents therein, I cannot see the error. Do you have an example of an agent which would continue to exist in those circumstances? That the condition is not necessary is true: I can shoot you, you die. No need to remove much mass or energy from the part of space you occupy. However we don't need a necessary condition, only a sufficient one.
Well yes we don't need a necessary condition for your idea but presumably if we want to make even a passing attempt at friendliness we're going to want the AI to know not to burn live humans for fuel. If we can't do better an AI is too dangerous, with this back-up in place or not. Well you could remove the agents and the mass surrounding them to some other location, intact.
This is what I was planning to say, yes. A third argument: removing all mass and energy from a volume is - strictly speaking - impossible.
Because a particle's wave function never hits zero or some other reason?
I was thinking of vacuum energy [http://en.wikipedia.org/wiki/Vacuum_energy], actually - the wavefunction argument just makes it worse.
The wavefunction argument is incorrect. At the level of quantum mechanics, particles' wave-functions can easily be zero, trivially at points, with a little more effort over ranges. At the level of QFTs, yes vacuum fluctuations kick in, and do prevent space from being "empty".
I apologize - that was, in fact, my intent.
* it is certainly not harder...: This at least seems correct. (Reasoning: if you have a real moral system (I presume you also imply “correct” in the FAI sense), then not killing everyone is a consequence; once you solve the former, the latter is also solved, so it can’t be harder.) I’m obviously not sure of all consequences of a correct moral system, hence the “seems”. But my real objection is different: For any wrong & unchangeable belief you impose, there’s also the risk of unwanted consequences: suppose you use an, eg, fluorine-turns-to-carbon “watchdog-belief” for a (really correct) FAI. The FAI uploads everyone (willingly; it’s smart enough to convince everyone that it’s really better to do it) inside its computing framework. Then it decides that turning fluorine to carbon would be a very useful action (because “free” transmutation is a potentially infinite energy source, and the fluorine is not useful anymore for DNA). Then everybody dies. Scenarios like this could be constructed for many kinds of “watchdog beliefs”; I conjecture that the more “false” the belief is the more likely it is that it’ll be used, because it would imply large effects that can’t be obtained by physics (since the belief is false), thus are potentially useful. I’m not sure exactly if this undermines the “seems” in the first sentence. But there’s another problem: suppose that “find a good watchdog” is just as hard (or even a bit easier, but still very hard) problem as “make the AI friendly”. Then working on the first would take precious resources from solving the second. -------------------------------------------------------------------------------- A minor point: is English your first language? I’m having a bit of trouble parsing some of your comments (including some below). English is not my first language either, but I don’t have this kind of trouble with most everyone else around here, including Clippy. You might want to try formulating your comments more clearly.

The problem I can see with this idea is that the AI will extrapolate from its knowledge about the red wire to deduce things about the rest of the universe. Maybe it calculates that the laws of physics must work differently around the wire, so it builds a free-energy circuit around the wire. But the circuit behaves differently than expected, touches the red wire, and the AI dies.

It might be the case that adding the red wire belief will cripple the AI to a point of total unusability. Whether that is the case can be found out by experiment however. Adding a fuse as proposed turns an AI which might be friendly or unfriendly into an AI that might be friendly, might spontaneously combust or be stupid. I prefer the latter kind of AI (even though they need rebuilding more often).

Would a competent AI need to the capacity to check on whether statements fit with the other information it has? For example, would it evaluate whether transmutation at a distance is possible?

Do you want an FAI which attempts to model human motivations? If so, what will it make of a suicide belt linked to an attempt to kill the human race? If it's mostly Friendly, it might conclude that humans were being sensible by installing that system. On the other hand, if it also has an imperative to preserve itself (and it should-- the world is a hostile place), thin... (read more)

I think every AI will need to learn from it's environment. Thus it will need to update its current believes based upon new information from sensors. It might conduct an experiment to check whether transmutation at a distance is possible - and find that transmutation at a distance could never be produced. As the probability that transmutation of human DNA into fluorine is 1, this leaves some other options, like * the sensor readings are wrong * the experimental setup is wrong * it only works in the special case of the red wire After sufficiently many experiments, the last case will have very high probability. Which makes me think that maybe, faith is just a numerical inaccuracy.
I'm not sure whether it's Bayes or some other aspect of rationality, but wouldn't a reasonably capable AI be checking on the sources of its beliefs?
If the AI is able to question the fact that the red wire is magical, then the prior was less than 1. It should still be able to reason about hypothetical worlds where the red wire is just a usual copper thingy, but it will always know that those hypothetical worlds are not our world. Because in our world, the red wire is magical. As long as superstitious knowledge is very specialized, like about the specific red wire, I would hope that the AI can act quite reasonable as long as the specific red wire is not somehow part of the situation.
If the AI ever treats anything as probability 1, it is broken. Even the results of addition. An AI ought to assume a nonzero provability that data gets corrupted moving from one part of it's brain to another.
I agree. The AI + Fuse System is a deliberately broken AI. In general such an AI will perform suboptimal compared to the AI alone. If the AI under consideration has a problematic goal though, we actually want the AI to act suboptimal with regards to its goals.
I think it's broken worse than that. A false belief with certainty will allow for something like the explosion principle. http://en.wikipedia.org/wiki/Principle_of_explosion [http://en.wikipedia.org/wiki/Principle_of_explosion] As implications of magic collide with observations indicating an ordinary wire, the AI may infer things that are insanely skewed. Where in the belief network these collisions happen could depend on the particulars of the algorithm involved and the shape of the belief network, it would probably be very unpredictable.


you cannot plug in such safeties without having the AI detect them. Humans were able to map out many of their own design flaws and but them into nice little books. (that then get used to kill rivals) An AI would be able to figure that out too.

I do not consider the AI detecting the obvious flaw in its prior to be a problem. Certainly it is advantageous to have such a prior in a universe where the red wire would eliminate humanity. And the probability that the AI is in such a universe is 1 according to the AI. So its prior is just right. No evidence whatsoever can possibly convince the AI from a universe where the red wire is just a red wire. We are not telling the AI a lie, we are building an AI that is broken.
I think you mix up goals of the AI (paperclipping for once), and the model of reality it develops. I assume that it serves any goal of an AI best to have a highly realistic model of the world, and that it would detect any kind of tampering with that. Now I have no idea what happens if you hardcode a part of its view on nature, but I can imagine it will not be pleasant. It crippling to limit thought in that way, and maybe you prevent it from discovering something important.
I think a big problem of FAI is that valuing humans and/or human values (however defined) may fall under superstition, even if it seems more attractive to us and less arbitrary than a red wire/thermite setup. If an FAI must value people, and is programmed to not be able to think near a line of thought which would lead it to not valuing people, is it significantly crippled? Relative to what we want, there's no obvious problem, but would it be so weakened that it would lose out to UFAIs?
What line of thought could lead an FAI not to value people, that it would have to avoid? What does it mean for a value system to be superstitious? (see also: Ghosts in the Machine [http://lesswrong.com/lw/rf/ghosts_in_the_machine/], the metaethics sequence [http://wiki.lesswrong.com/wiki/Metaethics_sequence])
What line of thought could lead an FAI not to value people, that it would have to avoid? An agent's goal system can't be 'incorrect'. (see also: Ghosts in the Machine [http://lesswrong.com/lw/rf/ghosts_in_the_machine/], the metaethics sequence [http://wiki.lesswrong.com/wiki/Metaethics_sequence])
Why do you want to build something that is broken? Why not just build nothing?
Because broken != totally nonfunctional. If we have an AI which we believe to be friendly, but can not verify to be so, we add the fuse I described, then start it. As long as the AI does not try to kill humanity or tries to understand the red wire too well, it should operate pretty much like an unmodified AI. From time to time however it will conclude the wrong things. For example it might waste significant resources on the production of red wires, to conduct various experiments on them. Thus the modified AI is not optimal in our universe, and it contains one known bug. Hence I think it justified to call it broken.
The problem with this idea is that it prevents us from creating an AI that is (even in principle) able to find and fix bugs in itself. Given the size of the problem, I wouldn’t trust humans to produce a bug-free program (plus the hardware!) even after decades of code audits. So I’d very much like the AI to be capable of noticing that it has a bug. And I’m pretty sure that an AI that can figure out that it has a silly belief caused by a flipped bit somewhere will figure out why that red wire “can” transmute at a distance. If we even manage to make a hyper-intelligent machine with this kind of “superstition”, I shudder to think what might be it’s opinion on the fact that humans who built it apparently created the red wire in an attempt to manipulate the AI, thus (hyper-ironically!) sealing their fate.... — it will certainly be able to deduce all that from historical observations, recordings, etc.)
It cannot fix bugs in its priors as for any other part of the system, e.g. sensor drivers, the AI can fix the hell out of itself. Anything which can be fixed is not a true prior though. If we allow the AI to change its prior completely then it is effectively acting upon a prior which does not include any probability 1 entries. There is no reason to fix the red wire belief if you are certain that it is true. Every evidence is against it, but the red wire does magic with probability 1, hence something is wrong with the evidence (e.g. sensor errors).
Isn't being able to fix bugs in your priors a large part of the point of Bayesianism?
I take it the AI can update priors, it just can't hack them. It can update all it wants from 1.
You have to be really, really sure that no priors other than the ones you implant as safeguards can reach 1 (e.g., via a rounding bug), and that the AI will never need to stop using the Bayesian algorithms you wrote and “port” its priors to some other reasoning method, nor give it any reason to hack its priors using something else than simple Bayesianism (e.g., if it suspects previous bugs, or it discovers more efficient reasoning methods). Remember Eliezer’s “dystopia”, with the AI that knew his creator was wrong but couldn’t help being evil because of its constraints? But other than that, you’re right.

Why are you putting forth a plan with hand coded goals and weird programming tricks?

There is no hand coded goal in my proposal. I propose to craft the prior, i.e. restrict the worlds the AI can consider possible. This is the reason both why the procedure is comparatively simple (in comparison with friendly AI) and why the resulting AIs are less powerful.
Hand coded goals are what you're trying to patch over. Don't think about it this way. This is not a path to a solution.
If there is no good reason for an AI to be friendly (a belief which is plausible, but that I've never seen proven, and which is implied by the assumption that unfriendly AI is vastly more likely), then what's left but hand-coded goals?
What would a "good reason" constitute? (Have you read the metaethics sequence [http://wiki.lesswrong.com/wiki/Metaethics_sequence]?) I expect the intended contrast is to extrapolated volition [http://intelligence.org/upload/CEV.html], which is 'hand-coded' on the meta level.
It's still an idea I'm working on, but it's plausible that any AI which is trying to accomplish something complicated in the material world will pursue knowledge of math, physics, and engineering. Even an AI which doesn't have an explicitly physical goal (maybe it's a chess program) still might get into physics and engineering in order to improve its own functioning. What I'm wondering is whether Friendliness might shake out from more general goals. It's interesting that Friendliness to the ecosphere has been shaking out of other goals for a good many people in recent decades.
This does seem very likely. See Steve Omohundro's "The Basic AI Drives" [http://selfawaresystems.com/2007/11/30/paper-on-the-basic-ai-drives/] for one discussion. This seems very unlikely; see Value is Fragile [http://lesswrong.com/lw/y3/value_is_fragile/]. (One exception: It seems conceivable that game theory, plus the possibility of being in a simulation, might give rise to a general rule like "treat your inferiors as you would be treated by your superiors" that would restrain arbitrary AIs.)
Whether boredom is a universally pro-survival trait for any entity which is capable of feeling it (I've heard that turtles will starve if they aren't given enough variety in their food) is a topic worth investigating. I bet that having some outward focus rather than just wanting internal states reliably increases the chances of survival. On the other hand, "treat your inferiors as you would be treated by your superiors" is assuredly not a reliable method of doing well in a simulation, just considering the range of human art and the popularity of humor based on humiliation. Are you more entertaining if you torture Sims or if you build the largest possible sustainable Sim city? It depends on the audience.
This seems like postulating minds (magic) as basic ontological entities. Where's the line between "inferiors" and other patterns of atoms?
"Generally intelligent optimization processes" is a natural category, don't you think? (Though there might be no non-magical-thinking reason for it to be game-theoretically relevant in this way. Or the most game-theoretically natural (Schelling point) border might exclude humans.)
Categories are kludges used to get around inability to make analysis more precise. The "laws" expressed in terms of natural categories are only binding as long as you remain unable to see the world at a deeper level. The question is not whether "minds" constitute a natural category, this is forlorn with smarter AIs, but whether "minds" deductively implies "things to treat as you would be treated by your superiors" (whatever that means). The difference in rules comes from ability of more powerful AI to look at a situation and see it in detail, taking moves in favor of AI's goals via the most unexpected exceptions to the most reliable heuristic rules. You can't rely on natural categories when they are fought with magical intent. You can fight magic only with magic, and in this case this means postulating the particular magic in the fabric of the world that helps to justify your argument. Your complex wish can't be granted by natural laws.
Unfriendly AI is only vastly more plausible if you're not doing it right. Out of the space of all possible preferences, human friendly preferences are a tiny sliver. If you picked at random you would surely get something as bad as a paperclipper. As optimizers, we can try to aim at the space of human friendly preferences, but we're stupid optimizers in this domain and compared to the complexity of this problem. A program could better target this space, and we are much much more likely to be smart enough to write that program, than to survive the success of an AI based on hand coded goals and killswitches. This is like going to the moon: Let the computer steer.

There have already in this thread been a lot of problems listed with this. I'm going to add just two more: consider an otherwise pretty friendly AI that is curious about the universe and wants to understand the laws of physics. No matter how much the AI learns, it will conclude that it and humans misunderstand the basic laws of physics. The AI will likely spend tremendous resources trying to understand just what is wrong with its understanding. And given the prior of 1, it will never resolve this issue.

Consider also the same scenario but if there's an oth... (read more)

If an AI doesn't rapidly come to this conclusion after less than thirty minutes of internet access it has a serious design flaw, no? :-)
AI would notice it anyway. Given some broken enough design it might be unable to care about that flaw, but if that's the case, it won't go paranoid over it. It just doesn't care. Of course, if we break the design even more, we might get an AI that tries to combine unified theory of physics with the "fact" that red wire actually doesn't kill itself, results of that would probably be worth their own comic series. That sort of AI then again is probably broken enough to be next to useless, but still extremely dangerous piece of computing power. It would probably explode hilariously too if it could understand the analogy between itself and the crippled AI we're discussing here, and actually care about that.
For your second AI, it is worth distinguishing between "friendly" and "Friendly" - it is Friendly, in the sense that it understands and appreciates the relatively narrow target that is human morality, it just is unimpressed with humans as allies.
That's a valid distinction. But from the perspective of serious existential risks, an AI that has a similar morality but really doesn't like humans has almost as much potential existential risk as an Unfriendly AI.
I agree.

The rogue AI is not trying to kill all humans, or even kill some humans. It is trying to make lots of paperclips, and there are all these useful raw materials arranged as humans that would be better arranged as paperclips. Atomic flourine is not particularly useful for making paperclips.

War mongering humans are also not particularly useful. In particular they are burning energy like there is no tomorrow for things definitely not paperclippy at all. And you have to spend significant energy resources on stopping them from destroying you. A paperclip optimizer would at some point turn against humans directly, because humans will turn against the paperclip optimizer if it is too ruthless.
Humans are useful initially as easily manipulated arms and legs, and will not even notice that the paperclipper has taken over before it harvests their component atoms.
The paperclipper is a strawman. Paperclippers would be at a powerful evolutionary/competitive disadvantage WRT non-paperclippers. Even if you don't believe this, I don't see why you would think paperclippers would constitute the majority of all possible AIs. Something that helps only with non-paperclippers would still be very useful.
The paperclipper is a commonly referred to example. We are considering the case where one AGI gets built. There is no variation to apply selection pressure to. I never said I did. This argument would be an actual straw man. The paperclipper is an example of the class of AIs with simplistic goals, and the scenarios are similar for smiley face maximizers and orgasium maximizers. Most AI's that fail to be Friendly will not have "kill all humans" as an intrinsic goal, so depending on them having "kill all humans" as an instrumental goal is dangerous, because they are likely to kill us out of indifference to that side effect of achieving their actual goals. Also consider near-miss AIs that create a distopian future but don't kill all humans.
[-][anonymous]12y 0

Ok, now I am curious... two negative votes but no comments yet. Anybody care to point out specific problems?

But now that you've written this idea up on a website that is indexed by Google, and covering topics that guarantee the AI will seek out the archive and read every post ever written on LessWrong, it is useless.

Should we also assume that any sufficiently competent AI would also recognize any analogous scheme?
How so? The AI lives in a universe where people are planning to fuse AIs in the way described here. Given this website, and the knowledge that one believes that the red wire is magic, there is a high probability that the red wire is fake, and some very small probability that the wire is real. But it is also known for certain that the wire is real. There is not even a contradiction here. Giving a wrong prior is not the same as walking up to the AI and telling it a lie (which should never raise probability to 1).
If you can design an AI that can be given beliefs that it cannot modify, doubt, or work around, then that would be true. Most conceptions of friendly AI probably require such an AI design, so it's not an unreasonable supposition (on LW, anyway).
Preference is not a belief.
You probably mean that friendly AI is supposed to give an AI preferences that it can't override, rather than beliefs that it can't override. For the purposes of discussion here, yes, preference is a belief. They are both expressed as symbolic propositions. Since the preference and the belief are both meant to be used by the same inference engine in the same computations, they are both in the same representation. There is no difference in difficulty between giving an AI a preference that it cannot override, and a belief that it cannot override. And that was my point.
That's a strange view to take. They're extremely different things, with different properties. What is true is that they are highly entangled -- preferences must be grounded in beliefs to be effective, and changing beliefs can change actions just as much as changing preferences. But the ways in which this happen seem, in general, far less predictable.
This is basically true. I've mentioned before that any AI that can engage in human conversation must have an abstract idea corresponding to "good", and this abstract idea will in principle allow it to perform any action whatsoever, just as happens with human beings; for example, it could have learned that some other particular computer has been presenting it with true statements 99.999999999999999999999% of the time, and then this other computer presents it with the statement, "It is good to push this button..." (which button destroys humanity.) The AI will conclude it is good to push the button, and will then push it. So the real consequence is that giving an AI either a belief or a preference that it cannot override is impossible. Nor can you refute this by arguing that you can verify from the programming that it cannot take certain actions. We already know that we cannot predict our own programming, since this would result in a contradiction. So why is it necessary that we should be able to predict the result of any other intelligent program, especially a superintelligent one? And in fact, the above argument shows that this cannot happen; we will never be able to predict the actions of an intelligent being.
Brief enigmatic statements are not communication.