Dreams of Friendliness


16


Eliezer_Yudkowsky

Continuation ofQualitative Strategies of Friendliness

Yesterday I described three classes of deep problem with qualitative-physics-like strategies for building nice AIs - e.g., the AI is reinforced by smiles, and happy people smile, therefore the AI will tend to act to produce happiness.  In shallow form, three instances of the three problems would be:

  1. Ripping people's faces off and wiring them into smiles;
  2. Building lots of tiny agents with happiness counters set to large numbers;
  3. Killing off the human species and replacing it with a form of sentient life that has no objections to being happy all day in a little jar.

And the deep forms of the problem are, roughly:

  1. A superintelligence will search out alternate causal pathways to its goals than the ones you had in mind;
  2. The boundaries of moral categories are not predictively natural entities;
  3. Strong optimization for only some humane values, does not imply a good total outcome.

But there are other ways, and deeper ways, of viewing the failure of qualitative-physics-based Friendliness strategies.

Every now and then, someone proposes the Oracle AI strategy:  "Why not just have a superintelligence that answers human questions, instead of acting autonomously in the world?"

Sounds pretty safe, doesn't it?  What could possibly go wrong?

Well... if you've got any respect for Murphy's Law, the power of superintelligence, and human stupidity, then you can probably think of quite a few things that could go wrong with this scenario.  Both in terms of how a naive implementation could fail - e.g., universe tiled with tiny users asking tiny questions and receiving fast, non-resource-intensive answers - and in terms of what could go wrong even if the basic scenario worked.

But let's just talk about the structure of the AI.

When someone reinvents the Oracle AI, the most common opening remark runs like this:

"Why not just have the AI answer questions, instead of trying to do anything?  Then it wouldn't need to be Friendly.  It wouldn't need any goals at all.  It would just answer questions."

To which the reply is that the AI needs goals in order to decide how to think: that is, the AI has to act as a powerful optimization process in order to plan its acquisition of knowledge, effectively distill sensory information, pluck "answers" to particular questions out of the space of all possible responses, and of course, to improve its own source code up to the level where the AI is a powerful intelligence.  All these events are "improbable" relative to random organizations of the AI's RAM, so the AI has to hit a narrow target in the space of possibilities to make superintelligent answers come out.

Now, why might one think that an Oracle didn't need goals?  Because on a human level, the term "goal" seems to refer to those times when you said, "I want to be promoted", or "I want a cookie", and when someone asked you "Hey, what time is it?" and you said "7:30" that didn't seem to involve any goals.  Implicitly, you wanted to answer the question; and implicitly, you had a whole, complicated, functionally optimized brain that let you answer the question; and implicitly, you were able to do so because you looked down at your highly optimized watch, that you bought with money, using your skill of turning your head, that you acquired by virtue of curious crawling as an infant.  But that all takes place in the invisible background; it didn't feel like you wanted anything.

Thanks to empathic inference, which uses your own brain as an unopened black box to predict other black boxes, it can feel like "question-answering" is a detachable thing that comes loose of all the optimization pressures behind it - even the existence of a pressure to answer questions!

Problem 4:  Qualitative reasoning about AIs often revolves around some nodes described by empathic inferences.  This is a bad thing: for previously described reasons; and because it leads you to omit other nodes of the graph and their prerequisites and consequences; and because you may find yourself thinking things like, "But the AI has to cooperate to get a cookie, so now it will be cooperative" where "cooperation" is a boundary in concept-space drawn the way you would prefer to draw it... etc.

Anyway: the AI needs a goal of answering questions, and that has to give rise to subgoals of choosing efficient problem-solving strategies, improving its code, and acquiring necessary information.  You can quibble about terminology, but the optimization pressure has to be there, and it has to be very powerful, measured in terms of how small a target it can hit within a large design space.

Powerful optimization pressures are scary things to be around.  Look at what natural selection inadvertently did to itself - dooming the very molecules of DNA - in the course of optimizing a few Squishy Things to make hand tools and outwit each other politically.  Humans, though we were optimized only according to the criterion of replicating ourselves, now have their own psychological drives executing as adaptations.  The result of humans optimized for replication is not just herds of humans; we've altered much of Earth's land area with our technological creativity.  We've even created some knock-on effects that we wish we hadn't, because our minds aren't powerful enough to foresee all the effects of the most powerful technologies we're smart enough to create.

My point, however, is that when people visualize qualitative FAI strategies, they generally assume that only one thing is going on, the normal / modal / desired thing.  (See also: planning fallacy.)  This doesn't always work even for picking up a rock and throwing it.  But it works rather a lot better for throwing rocks than unleashing powerful optimization processes.

Problem 5:  When humans use qualitative reasoning, they tend to visualize a single line of operation as typical - everything operating the same way it usually does, no exceptional conditions, no interactions not specified in the graph, all events firmly inside their boundaries.  This works a lot better for dealing with boiling kettles, than for dealing with minds faster and smarter than your own.

If you can manage to create a full-fledged Friendly AI with full coverage of humane (renormalized human) values, then the AI is visualizing the consequences of its acts, caring about the consequences you care about, and avoiding plans with consequences you would prefer to exclude.  A powerful optimization process, much more powerful than you, that doesn't share your values, is a very scary thing - even if it only "wants to answer questions", and even if it doesn't just tile the universe with tiny agents having simple questions answered.

I don't mean to be insulting, but human beings have enough trouble controlling the technologies that they're smart enough to invent themselves.

I sometimes wonder if maybe part of the problem with modern civilization is that politicians can press the buttons on nuclear weapons that they couldn't have invented themselves - not that it would be any better if we gave physicists political power that they weren't smart enough to obtain themselves - but the point is, our button-pressing civilization has an awful lot of people casting spells that they couldn't have written themselves.  I'm not saying this is a bad thing and we should stop doing it, but it does have consequences.  The thought of humans exerting detailed control over literally superhuman capabilities - wielding, with human minds, and in the service of merely human strategies, powers that no human being could have invented - doesn't fill me with easy confidence.

With a full-fledged, full-coverage Friendly AI acting in the world - the impossible-seeming full case of the problem - the AI itself is managing the consequences.

Is the Oracle AI thinking about the consequences of answering the questions you give it?  Does the Oracle AI care about those consequences the same way you do, applying all the same values, to warn you if anything of value is lost?

What need has an Oracle for human questioners, if it knows what questions we should ask?  Why not just unleash the should function?

See also the notion of an "AI-complete" problem.  Analogously, any Oracle into which you can type the English question "What is the code of an AI that always does the right thing?" must be FAI-complete.

Problem 6:  Clever qualitative-physics-type proposals for bouncing this thing off the AI, to make it do that thing, in a way that initially seems to avoid the Big Scary Intimidating Confusing Problems that are obviously associated with full-fledged Friendly AI, tend to just run into exactly the same problem in slightly less obvious ways, concealed in Step 2 of the proposal.

(And likewise you run right back into the intimidating problem of precise self-optimization, so that the Oracle AI can execute a billion self-modifications one after the other, and still just answer questions at the end; you're not avoiding that basic challenge of Friendly AI either.)

But the deepest problem with qualitative physics is revealed by a proposal that comes earlier in the standard conversation, at the point when I'm talking about side effects of powerful optimization processes on the world:

"We'll just keep the AI in a solid box, so it can't have any effects on the world except by how it talks to the humans."

I explain the AI-Box Experiment (see also That Alien Message); even granting the untrustworthy premise that a superintelligence can't think of any way to pass the walls of the box which you weren't smart enough to cover, human beings are not secure systems.  Even against other humans, often, let alone a superintelligence that might be able to hack through us like Windows 98; when was the last time you downloaded a security patch to your brain?

"Okay, so we'll just give the AI the goal of not having any effects on the world except from how it answers questions.  Sure, that requires some FAI work, but the goal system as a whole sounds much simpler than your Coherent Extrapolated Volition thingy."

What - no effects?

"Yeah, sure.  If it has any effect on the world apart from talking to the programmers through the legitimately defined channel, the utility function assigns that infinite negative utility.  What's wrong with that?"

When the AI thinks, that has a physical embodiment.  Electrons flow through its transistors, moving around.  If it has a hard drive, the hard drive spins, the read/write head moves.  That has gravitational effects on the outside world.

"What?  Those effects are too small!  They don't count!"

The physical effect is just as real as if you shot a cannon at something - yes, might not notice, but that's just because our vision is bad at small length-scales.  Sure, the effect is to move things around by 10^whatever Planck lengths, instead of the 10^more Planck lengths that you would consider as "counting".  But spinning a hard drive can move things just outside the computer, or just outside the room, by whole neutron diameters -

"So?  Who cares about a neutron diameter?"

- and by quite standard chaotic physics, that effect is liable to blow up.  The butterfly that flaps its wings and causes a hurricane, etc.  That effect may not be easily controllable but that doesn't mean the chaotic effects of small perturbations are not large.

But in any case, your proposal was to give the AI a goal of having no effect on the world, apart from effects that proceed through talking to humans.  And this is impossible of fulfillment; so no matter what it does, the AI ends up with infinite negative utility - how is its behavior defined in this case?  (In this case I picked a silly initial suggestion - but one that I have heard made, as if infinite negative utility were like an exclamation mark at the end of a command given a human employee.  Even an unavoidable tiny probability of infinite negative utility trashes the goal system.)

Why would anyone possibly think that a physical object like an AI, in our highly interactive physical universe, containing hard-to-shield forces like gravitation, could avoid all effects on the outside world?

And this, I think, reveals what may be the deepest way of looking at the problem:

Problem 7:  Human beings model a world made up of objects, attributes, and noticeworthy events and interactions, identified by their categories and values.  This is only our own weak grasp on reality; the real universe doesn't look like that.  Even if a different mind saw a similar kind of exposed surface to the world, it would still see a different exposed surface.

Sometimes human thought seems a lot like it tries to grasp the universe as... well, as this big XML file, AI.goal == smile, human.smile == yes, that sort of thing.  Yes, I know human world-models are more complicated than XML.  (And yes, I'm also aware that what I wrote looks more like Python than literal XML.)  But even so.

What was the one thinking, who proposed an AI whose behaviors would be reinforced by human smiles, and who reacted with indignation to the idea that a superintelligence could "mistake" a tiny molecular smileyface for a "real" smile?  Probably something along the lines of, "But in this case, human.smile == 0, so how could a superintelligence possibly believe human.smile == 1?"

For the weak grasp that our mind obtains on the high-level surface of reality, seems to us like the very substance of the world itself.

Unless we make a conscious effort to think of reductionism, and even then, it's not as if thinking "Reductionism!" gives us a sudden apprehension of quantum mechanics.

So if you have this, as it were, XML-like view of reality, then it's easy enough to think you can give the AI a goal of having no effects on the outside world; the "effects" are like discrete rays of effect leaving the AI, that result in noticeable events like killing a cat or something, and the AI doesn't want to do this, so it just switches the effect-rays off; and by the assumption of default independence, nothing else happens.

Mind you, I'm not saying that you couldn't build an Oracle.  I'm saying that the problem of giving it a goal of "don't do anything to the outside world" "except by answering questions" "from the programmers" "the way the programmers meant them", in such fashion as to actually end up with an Oracle that works anything like the little XML-ish model in your head, is a big nontrivial Friendly AI problem.  The real world doesn't have little discreet effect-rays leaving the AI, and the real world doesn't have ontologically fundamental programmer.question objects, and "the way the programmers meant them" isn't a natural category.

And this is more important for dealing with superintelligences than rocks, because the superintelligences are going to parse up the world in a different way.  They may not perceive reality directly, but they'll still have the power to perceive it differently.  A superintelligence might not be able to tag every atom in the solar system, but it could tag every biological cell in the solar system (consider that each of your cells contains its own mitochondrial power engine and a complete copy of your DNA).  It used to be that human beings didn't even know they were made out of cells.  And if the universe is a bit more complicated than we think, perhaps the superintelligence we build will make a few discoveries, and then slice up the universe into parts we didn't know existed - to say nothing of us being able to model them in our own minds!  How does the instruction to "do the right thing" cross that kind of gap?

There is no nontechnical solution to Friendly AI.

That is:  There is no solution that operates on the level of qualitative physics and empathic models of agents.

That's all just a dream in XML about a universe of quantum mechanics.  And maybe that dream works fine for manipulating rocks over a five-minute timespan; and sometimes okay for getting individual humans to do things; it often doesn't seem to give us much of a grasp on human societies, or planetary ecologies; and as for optimization processes more powerful than you are... it really isn't going to work.

(Incidentally, the most epically silly example of this that I can recall seeing, was a proposal to (IIRC) keep the AI in a box and give it faked inputs to make it believe that it could punish its enemies, which would keep the AI satisfied and make it go on working for us.  Just some random guy with poor grammar on an email list, but still one of the most epic FAIls I recall seeing.)