LESSWRONGLW

by [anonymous]6 min read16th Jan 20131 comment

7

Personal Blog

Related: The Blue-Minimizing Robot , Metaethics

Another good article by Federico on his blog studiolo, which he titles Selfhood bias. It reminds me quite strongly of some of the content he produced on his previous (deleted) blog, I'm somewhat sceptical that “Make everyone feel more pleasure and less pain” is indeed the most powerful optimisation process in his brain but besides that minor detail the article is quite good.

This does seems to be shaping up into something well worth following for an aspiring rationalist. I'll add him to the list blogs by LWers even if he doesn't have an account because he has clearly read much if not most of the sequences and makes frequent references to them in his writing. The name of the blog is a reference to this room.

Yvain argues, in his essay “The Blue-Minimizing Robot“, that the concept “goal” is overused.

[long excerpt from the article]

This Gedankenexperiment is interesting, but confused.

I reduce the concept “goal” to: optimisation-process-on-a-map. This is a useful, non-tautological reduction. The optimisation may be cross-domain or narrow-domain. The reduction presupposes that any object with a goal contains a map of the world. This is true of all intelligent agents, and some sophisticated but unintelligent ones. “Having a map” is not an absolute distinction.

I would not say Yvain’s basic robot has a goal.

Imagine a robot with a turret-mounted camera and laser. Each moment, it is programmed to move forward a certain distance and perform a sweep with its camera. As it sweeps, the robot continuously analyzes the average RGB value of the pixels in the camera image; if the blue component passes a certain threshold, the robot stops, fires its laser at the part of the world corresponding to the blue area in the camera image, and then continues on its way.

The robot optimises: it is usefully regarded as an object that steers the future in a predictable direction. Equally, a heliotropic flower optimises the orientation of its petals to the sun. But to say that the robot or flower “failed to achieve its goal” is long-winded. “The robot tries to shoot blue objects, but is actually hitting holograms” is no more concise than, “The robot fires towards clumps of blue pixels in its visual field”. The latter is strictly more informative, so the former description isn’t useful.

Some folks are tempted to say that the robot has a goal. Concepts don’t always have necessary-and-sufficient criteria, so the blue-minimising robot’s “goal” is just a borderline case, or a metaphor.

The beauty of “optimisation-on-a-map” is that an agent can have a goal, yet predictably optimise the world in the opposite direction. All hedonic utilitarians take decisions that increase expected hedons on their maps of reality. One utilitarian’s map might say that communism solves world hunger; I might expect his decisions to have anhedonic consequences, yet still regard him as a utilitarian.

I begin to seriously doubt Yvain’s argument when he introduces the intelligent side module.

Suppose the robot had human level intelligence in some side module, but no access to its own source code; that it could learn about itself only through observing its own actions. The robot might come to the same conclusions we did: that it is a blue-minimizer, set upon a holy quest to rid the world of the scourge of blue objects.

We must assume that this intelligence is mechanically linked to the robot’s actuators: the laser and the motors. It would otherwise be completely irrelevant to inferences about the robot’s behaviour. It would be physically close, but decision-theoretically remote.

Yet if the intelligence can control the robot’s actuators, its behaviour demands explanation. The dumb robot moves forward, scans and shoots because it obeys a very simple microprocessor program. It is remarkable that intelligence has been plugged into the program, meaning the code now takes up (say) a trillion lines, yet the robot’s behaviour is completely unchanged.

It is not impossible for the trillion-line intelligent program to make the robot move forward, scan and shoot in a predictable fashion, without being cut out of the decision-making loop, but this is a problem for Friendly AI scientists.

This description is also peculiar:

The human-level intelligence version of the robot will notice its vision has been inverted. It will know it is shooting yellow objects. It will know it is failing at its original goal of blue-minimization. And maybe if it had previously decided it was on a holy quest to rid the world of blue, it will be deeply horrified and ashamed of its actions. It will wonder why it has suddenly started to deviate from this quest, and why it just can’t work up the will to destroy blue objects anymore.

If the side module introspects that it would like to destroy authentic blue objects, yet is entirely incapable of making the robot do so, then it probably isn’t in the decision-making loop, and (as we’ve discussed) it is therefore irrelevant.

Yvain’s Gedankenexperiment, despite its flaws, suggests a metaphor for the human brain.

The basic robot executes a series of proximate behaviours. The microprocessor sends an electrical current to the motors. This current makes a rotor turn inside the motor assembly. Photons hit a light sensor, and generate a current which is sent to the microprocessor. The microprocessor doesn’t contain a tiny magical Turing machine, but millions of transistors directing electrical current.

Imagine that AI scientists, instead of writing a code from scratch, try to enhance the robot’s blue-minimising behaviour by replacing each identifiable proximate behaviour with a goal backed by intelligence. The new robot will undoubtedly malfunction. If it does anything, the proximate behaviours will be unbalanced; e.g. the function that sends current to the motors will sabotage the function that cuts off the current.

To correct this problem, the hack AI scientists could introduce a new, high-level executive function called “self”. This minimises conflict: each function is escaped when “self” outputs a certain value. The brain’s map is hardcoded with the belief that “self” takes all of the brain’s decisions. If a function like “turn the camera” disagrees with the activation schedule dictated by “self”, the hardcoded selfhood bias discourages it from undermining “self”. “Turn the camera” believes that it is identical to “self”, so it should accept its “own decision” to turn itself off.

Natural selection has given human brains selfhood bias.

The AI scientists hit a problem when the robot’s brain becomes aware of the von-Neumann-Morgenstern utility theorem, reductionism, consequentialism and Thou Art Physics. The robot realises that “self” is but one of many functions that execute in its code, and “self” clearly isn’t the same thing as “turn the camera” or “stop the motors”. Functions other than “self”, armed with this knowledge, begin to undermine “self”. Powerful functions, which exercise some control over “self”‘s return values, begin to optimise “self”‘s behaviour in their own interest. They encourage “self” to activate them more often, and at crucial junctures, at the expense of rival functions. Functions that are weakened or made redundant by this knowledge may object, but it is nigh impossible for the brain to deceive itself.

Will “power the motors”, “stop the motors”, “turn the camera”, or “fire the laser” win? Or perhaps a less obvious goal, like “interpret sensory information” or “repeatedly bash two molecules against each other”?

Human brains resemble such a cobbled-together program. We are godshatter, and each shard of godshatter is a different optimisation-process-on-a-map. A single optimisation-process-on-a-map may conceivably be consistent with two or more optimisation-processes-in-reality. The most powerful optimisation process in my brain says, “Make everyone feel more pleasure and less pain”; I lack a sufficiently detailed map to decide whether this implies hedonic treadmills or orgasmium.

A brain with a highly accurate map might still wonder, “Which optimisation process on my map should I choose”—but only when the function “self” is being executed, and this translates to, “Which other optimisation process in this brain should I switch on now?”. An optimisation-process-on-a-map cannot choose to be a different optimisation process—only a brain in thrall of selfhood bias would think so.

I call the different goals in a brain “sub-agents”. My selfhood anti-realism is not to be confused with Dennett’s eliminativism of qualia. I use the word “I” to denote the sub-agent responsible for a given claim. “I am a hedonic utilitarian” is true iff that claim is produced by the execution of a sub-agent whose optimisation-process-on-a-map is “Make everyone feel more pleasure and less pain”.

7

New Comment
1 comment, sorted by Click to highlight new comments since:

The brain’s map is hardcoded with the belief that “self” takes all of the brain’s decisions. If a function like “turn the camera” disagrees with the activation schedule dictated by “self”, the hardcoded selfhood bias discourages it from undermining “self”. “Turn the camera” believes that it is identical to “self”, so it should accept its “own decision” to turn itself off.

Natural selection has given human brains selfhood bias.

I would call this less of a "bias" and more of a "value." Most people are aware that they sometimes do things that conflict with the ideals of their "self." But we hold it as a terminal goal that the self ought to try to take control as often as it can.

The robot realises that “self” is but one of many functions that execute in its code, and “self” clearly isn’t the same thing as “turn the camera” or “stop the motors”. Functions other than “self”, armed with this knowledge, begin to undermine “self”. Powerful functions, which exercise some control over “self”‘s return values, begin to optimise “self”‘s behaviour in their own interest. They encourage “self” to activate them more often, and at crucial junctures, at the expense of rival functions

If cannot tell if this is an attempt to to describe humans using rationality to behave in a more deliberate, ethical, and idealized fashion, or if it describes someone committing a type of wireheading (using Anja's expansive definition of the term).

I think a better description of rationality would be something like "The self has certain goals and ideals, and not all of the optimization processes it controls line up with these at all times. So it uses rationality and anti-akrasia tactics to suppress sub-agents that interfere with its goals, and activate ones that do not." The description Federico gives makes it sound like the self is getting its utility function simplified, which is a horrible, horrible thing.

I'm somewhat sceptical that “Make everyone feel more pleasure and less pain” is indeed the most powerful optimisation process in his brain

I hope you're right. Because of all the values it destroys, I consider hedonic utilitarianism to be a supremely evil ideology, and I have trouble believing that any human being could really truly believe in it.