Mirrors and Paintings


12


Eliezer_Yudkowsky

Followup toSorting Pebbles Into Correct Heaps, Invisible Frameworks

Background: There's a proposal for Friendly AI called "Coherent Extrapolated Volition" which I don't really want to divert the discussion to, right now.  Among many other things, CEV involves pointing an AI at humans and saying (in effect) "See that?  That's where you find the base content for self-renormalizing morality."

Hal Finney commented on the Pebblesorter parable:

I wonder what the Pebblesorter AI would do if successfully programmed to implement [CEV]...  Would the AI pebblesort?  Or would it figure that if the Pebblesorters got smarter, they would see that pebblesorting was pointless and arbitrary?  Would the AI therefore adopt our own parochial morality, forbidding murder, theft and sexual intercourse among too-young people?  Would that be the CEV of Pebblesorters?

I imagine we would all like to think so, but it smacks of parochialism, of objective morality.  I can't help thinking that Pebblesorter CEV would have to include some aspect of sorting pebbles.  Doesn't that suggest that CEV can malfunction pretty badly?

I'm giving this question its own post, for that it touches on similar questions I once pondered - dilemmas that forced my current metaethics as the resolution.

Yes indeed:  A CEV-type AI, taking Pebblesorters as its focus, would wipe out the Pebblesorters and sort the universe into prime-numbered heaps.

This is not the right thing to do.

That is not a bug.

A primary motivation for CEV was to answer the question, "What can Archimedes do if he has to program a Friendly AI, despite being a savage barbarian by the Future's standards, so that the Future comes out right anyway?  Then whatever general strategy Archimedes could plausibly follow, that is what we should do ourselves:  For we too may be ignorant fools, as the Future measures such things."

It is tempting to further extend the question, to ask, "What can the Pebblesorters do, despite wanting only to sort pebbles, so that the universe comes out right anyway?  What sort of general strategy should they follow, so that despite wanting something that is utterly pointless and futile, their Future ends up containing sentient beings leading worthwhile lives and having fun?  Then whatever general strategy we wish the Pebblesorters to follow, that is what we should do ourselves:  For we, too, may be flawed."

You can probably see in an intuitive sense why that won't work.  We did in fact get here from the Greek era, which shows that the seeds of our era were in some sense present then - albeit this history doesn't show that no extra information was added, that there were no contingent moral accidents that sent us into one attractor rather than another.  But still, if Archimedes said something along the lines of "imagine probable future civilizations that would come into existence", the AI would visualize an abstracted form of our civilization among them - though perhaps not only our civilization.

The Pebblesorters, by construction, do not contain any seed that might grow into a civilization valuing life, health, happiness, etc. Such wishes are nowhere present in their psychology.  All they want is to sort pebble heaps.  They don't want an AI that keeps them alive, they want an AI that can create correct pebble heaps rather than incorrect pebble heaps.  They are much disturbed by the question of how such an AI can be created, when different civilizations are still arguing about heap sizes - though most of them believe that any sufficiently smart mind will see which heaps are correct and incorrect, and act accordingly.

You can't get here from there.  Not by any general strategy.  If you want the Pebblesorters' future to come out humane, rather than Pebblish, you can't advise the Pebblesorters to build an AI that would do what their future civilizations would do.  You can't advise them to build an AI that would do what Pebblesorters would do if they knew everything the AI knew.  You can't advise them to build an AI more like Pebblesorters wish they were, and less like what Pebblesorters are. All those AIs just sort the universe into prime heaps.  The Pebblesorters would celebrate that and say "Mission accomplished!" if they weren't dead, but it isn't what you want the universe to be like.  (And it isn't right, either.)

What kind of AI would the Pebblesorters have to execute, in order to make the universe a better place?

They'd have to execute an AI did not do what Pebblesorters would-want, but an AI that simply, directly, did what was right - an AI that cared directly about things like life, health, and happiness.

But where would that AI come from?

If you were physically present on the scene, you could program that AI.  If you could send the Pebblesorters a radio message, you could tell them to program it - though you'd have to lie to them about what the AI did.

But if there's no such direct connection, then it requires a causal miracle for the Pebblesorters' AI to do what is right - a perpetual motion morality, with information appearing from nowhere.  If you write out a specification of an AI that does what is right, it takes a certain number of bits; it has a Kolmogorov complexity.  Where is that information appearing from, since it is not yet physically present in the Pebblesorters' Solar System?  What is the cause already present in the Pebble System, of which the right-doing AI is an eventual effect?  If the right-AI is written by a meta-right AI then where does the meta-right AI come from, causally speaking?

Be ye wary to distinguish between yonder levels.  It may seem to you that you ought to be able to deduce the correct answer just by thinking about it - surely, anyone can see that pebbles are pointless - but that's a correct answer to the question "What is right?", which carries its own invisible framework of arguments that it is right to be moved by.  This framework, though harder to see than arguments, has its physical conjugate in the human brain.  The framework does not mention the human brain, so we are not persuaded by the argument "That's what the human brain says!" But this very event of non-persuasion takes place within a human brain that physically represents a moral framework that doesn't mention the brain.

This framework is not physically represented anywhere in the Pebble System.  It's not a different framework in the Pebble System, any more than different numbers are prime here than there.  So far as idealized abstract dynamics are concerned, the same thing is right in the Pebble System as right here. But that idealized abstract framework is not physically embodied anywhere in the Pebble System.  If no human sends a physical message to the Pebble System, then how does anything right just happen to happen there, given that the right outcome is a very small target in the space of all possible outcomes?  It would take a thermodynamic miracle.

As for humans doing what's right - that's a moral miracle but not a causal miracle.  On a moral level, it's astounding indeed that creatures of mere flesh and goo, created by blood-soaked natural selection, should decide to try and transform the universe into a place of light and beauty.  On a moral level, it's just amazing that the brain does what is right, even though "The human brain says so!" isn't a valid moral argument.  On a causal level... once you understand how morality fits into a natural universe, it's not really all that surprising.

And if that disturbs you, if it seems to smack of relativism - just remember, your universalizing instinct, the appeal of objectivity, and your distrust of the state of human brains as an argument for anything, are also all implemented in your brain.  If you're going to care about whether morals are universally persuasive, you may as well care about people being happy; a paperclip maximizer is moved by neither argument.  See also Changing Your Metaethics.

It follows from all this, by the way, that the algorithm for CEV (the Coherent Extrapolated Volition formulation of Friendly AI) is not the substance of what's right.  If it were, then executing CEV anywhere, at any time, would do what was right - even with the Pebblesorters as its focus.  There would be no need to elaborately argue this, to have CEV on the left-hand-side and rightness on the r.h.s.; the two would be identical, or bear the same relation as PA+1 and PA.

So why build CEV?  Why not just build a do-what's-right AI?

Because we don't know the complete list of our own terminal values; we don't know the full space of arguments we can be moved by.  Human values are too complicated to program by hand. We might not recognize the source code of a do-what's-right AI, any more than we would recognize a printout of our own neuronal circuitry if we saw it.  Sort of like how Peano Arithmetic doesn't recognize itself in a mirror.  If I listed out all your values as mere English words on paper, you might not be all that moved by the list: is it more uplifting to see sunlight glittering off water, or to read the word "beauty"?

But in this art of Friendly AI, understanding metaethics on a naturalistic level, we can guess that our morals and metamorals will be physically represented in our brains, even though our morality (considered as an idealized abstracted dynamic) doesn't attach any explicit moral force to "Because a brain said so."

So when we try to make an AI whose physical consequence is the implementation of what is right, we make that AI's causal chain start with the state of human brains - perhaps nondestructively scanned on the neural level by nanotechnology, or perhaps merely inferred with superhuman precision from external behavior - but not passed through the noisy, blurry, destructive filter of human beings trying to guess their own morals.

The AI can't start out with a direct representation of rightness, because the programmers don't know their own values (not to mention that there are other human beings out there than the programmers, if the programmers care about that).  The programmers can neither brain-scan themselves and decode the scan, nor superhumanly precisely deduce their internal generators from their outward behavior.

So you build the AI with a kind of forward reference:  "You see those humans over there?  That's where your utility function is."

As previously mentioned, there are tricky aspects to this.  You can't say:  "You see those humans over there?  Whatever desire is represented in their brains, is therefore right."  This, from a moral perspective, is wrong - wanting something doesn't make it right - and the conjugate failure of the AI is that it will reprogram your brains to want things that are easily obtained in great quantity.  If the humans are PA, then we want the AI to be PA+1, not Self-PA... metaphorically speaking.

You've got to say something along the lines of, "You see those humans over there?  Their brains contain the evidence you will use to deduce the correct utility function, even though right-ness is not caused by those brains, so that intervening to alter the brains won't alter the correct utility function."  Here, the "correct" in "correct utility function" is relative to a meta-utility framework that points to the humans and defines how their brains are to be treated as information.  I haven't worked out exactly how to do this, but it does look solvable.

And as for why you can't have an AI that rejects the "pointless" parts of a goal system and only keeps the "wise" parts - so that even in the Pebble System the AI rejects pebble-sorting and keeps the Pebblesorters safe and warm - it's the problem of the invisible framework again; you've only passed the recursive buck. Humans contain the physical representations of the framework that we appeal to, when we ask whether a goal is pointless or wise. Without sending a message to the Pebble System, the information there cannot physically materialize from nowhere as to which goals are pointless or wise. This doesn't mean that different goals are pointless in the Pebble System, it means that no physical brain there is asking that question.

The upshot is that structurally similar CEV algorithms will behave differently depending on whether they have humans at the focus, or Pebblesorters.  You can infer that CEV will do what's right in the presence of humans, but the general algorithm in CEV is not the direct substance of what's right.  There is no moral imperative to execute CEVs regardless of their focus, on any planet.  It is only right to execute CEVs on decision systems that contain the seeds of rightness, such as humans.  (Again, see the concept of a moral miracle that is not a causal surprise.)

Think of a Friendly AI as being like a finely polished mirror, which reflects an image more accurately than any painting drawn with blurred eyes and shaky hand.  If you need an image that has the shape of an apple, you would do better to put an actual apple in front of the mirror, and not try to paint the apple by hand.  Even though the drawing would inherently be apple-shaped, it wouldn't be a good one; and even though the mirror is not inherently apple-shaped, in the presence of an actual apple it is a better picture than any painting could be.

"Why not just use an actual apple?" you ask.  Well, maybe this isn't a merely accurate mirror; it has an internal camera system that lightens the apple's image before displaying it.  An actual apple would have the right starting shape, but it wouldn't be bright enough.

You may also want a composite image of a lot of apples that have multiple possible reflective equilibria.

As for how the apple ended up apple-shaped, when the substance of the apple doesn't define apple-shaped-ness - in the very important sense that squishing the apple won't change what's apple-shaped - well, it wasn't a miracle, but it involves a strange loop through the invisible background framework.

And if the whole affair doesn't sound all that right... well... human beings were using numbers a long time before they invented Peano Arithmetic.  You've got to be almost as smart as a human to recognize yourself in a mirror, and you've got to be smarter than human to recognize a printout of your own neural circuitry.  This Friendly AI stuff is somewhere in between.  Would the rightness be easier to recognize if, in the end, no one died of Alzheimer's ever again?