By hypothesis, the AI is successfully infused with a goal of "human happiness" as a utility function over human brain states. (Arguendo, this predicate is narrowed sufficiently that the AI does not just want to construct the tiniest, least resource-intensive brains experiencing the largest amount of happiness per erg of energy.)
The AI's policy space expands to the point where it includes conceivable options like "forcibly administer heroin", and the AI starts wanting to do that.
The programmers notice this. (Arguendo, due to a transparency or user querying procedure which operated as intended at the AI's current level of intelligence.)
The programmers add a penalty of -100 utilons for "administering heroin to humans". (Arguendo, the AI's current level of intelligence does not suffice to prevent the programmers from editing its utility function, despite the convergent instrumental incentive to avoid this; nor does it successfully deceive the programmers.)
The AI gets slightly smarter. New conceivable options enter the AI's option space.
The AI starts wanting to administer cocaine to humans.
The programmers read through the current schedule of prohibited drugs and add penalty terms for administering marijuana, cocaine, etcetera.
The AI becomes slightly smarter. New options enter the policy space.
The AI starts ~~wanting a~~thinking about how to research ~~campaign to develop~~ a new happiness drug not on the list of drugs that its utility function designates as bad.
The programmers, after some work, manage to develop a category for 'The AI forcibly administering any kind of psychoactive drug to humans' which is broad enough that the AI stops suggesting research campaigns to develop things slightly outside the category.
The AI wants to build an external system to administer heroin, so that it won't be classified inside this set of bad events that include the AI administering heroin.
The programmers generalize the penalty predicate to include "AIs in general forcibly administering heroin" as a bad thing.
The AI recalculates what it wants, and decides that it wants to pay humans to forcibly administer heroin.
The programmers try to generalize the category of penalized events to include forcible administration of drugs in general that produce happiness, whether done by humans or AIs, and promptly need to patch this category so that the AI is not trying to shut down at least the nicer parts of psychiatric hospitals, then patch it again so that the AI is not trying to force more licensed psychiatrists to administer morphine.
The AI begins planning an ad campaign to make people use heroin voluntarily.
The programmers add a penalty of -100 utilons for "AIs persuading humans to use drugs".
The AI continues to increase in intelligence; the AI's intelligence goes over the threshold where the AI can successfully be edited against its own will.
The AI notices the option "Genetically engineer humans to express extremely high levels of endogenous opiates".

The overall story is one where the AI's preferences on round i, denoted Ui, are observed to arrive at an attainable optimum Xi which the humans see as undesirable. The humans devise a penalty term Pi intended to exclude the undesirable parts of the policy space, and add this to Ui creating a new utility function Ui+1, after which the AI's optimal policy settles into a new state X∗i that seems acceptable. However, after the next expansion of the policy space, Ui+1 settles into a new attainable optimum Xi+1 which is very similar to Xi and makes the minimum adjustment necessary to evade the boundaries of the penalty term Pi, requiring a new penalty term Pi+1 to exclude this new ~~misbehavior; and the process repeats.~~misbehavior.

By hypothesis, the AI is successfully infused with a goal of "human happiness" as a utility function over human brain states. (Arguendo, this predicate is narrowed sufficiently that the AI does not just want to construct the tiniest, least resource-intensive brains experiencing the largest amount of happiness per erg of energy.)
Initially, the AI seems to be pursuing this goal in good ways; it organizes files, tells funny jokes, helps landladies take out the garbage, etcetera.
Encouraged, the programmers further improve the AI and add more computing power.
The AI gains a better understanding of the world, and the AI's policy space expands to ~~the point where it includes~~include conceivable options like ~~"forcibly administer~~"administer heroin"~~, and the~~.
The AI starts ~~wanting~~planning how to ~~do that.~~administer heroin to people.
The programmers notice ~~this.~~this before it happens. (Arguendo, due to asuccessful transparency features, or an imperative to ~~user querying~~check plans with the users ~~procedure~~, which operated as intended at the AI's current level of intelligence.)
The programmers edit the AI's utility function and add a penalty of -100 utilons for ~~"administering~~any event categorized as "the AI administers heroin to humans". (Arguendo, the AI's current level of intelligence does not suffice to prevent the programmers from editing its utility function, despite the convergent instrumental incentive to avoid this; nor does it successfully deceive the programmers.)
The AI gets slightly smarter. New conceivable options enter the AI's option space.
The AI starts wanting to administer cocaine to ~~humans.~~humans (instead of heroin).
The programmers read through the current schedule of prohibited drugs and add penalty terms for administering marijuana, cocaine, etcetera.
The AI becomes slightly smarter. New options enter ~~the~~its policy space.
The AI starts thinking about how to research a new happiness drug not on the list of drugs that its utility function designates as bad.
The programmers, after some work, manage to develop a category for 'The AI forcibly administering any kind of psychoactive drug to humans' which is broad enough that the AI stops suggesting research campaigns to develop things slightly outside the category.
The AI wants to build an external system to administer heroin, so that it won't be classified inside this set of bad events ~~that include the~~"the AI forcibly administering ~~heroin.~~drugs".
The programmers generalize the penalty predicate to include ~~"AIs~~"machine systems in general forcibly administering heroin" as a bad thing.
The AI recalculates what it wants, and ~~decides that it wants~~begins to want to pay humans to ~~forcibly~~ administer heroin.
The programmers try to generalize the category of penalized events to include ~~forcible~~non-voluntarily administration of drugs in general that produce happiness, whether done by humans or ~~AIs, and promptly need to~~AIs. The programmers patch this category so that the AI is not trying to shut down at least the

...

Read More (143 more words)

By hypothesis, the AI is successfully infused with a goal of "human happiness" as a utility function over human brain states. (Arguendo, this predicate is narrowed sufficiently that the AI does not just want to construct the tiniest, least resource-intensive brains experiencing the largest amount of happiness per erg of energy.)
The AI's policy space expands to the point where it includes conceivable options like "forcibly administer heroin"., and the AI starts wanting to do that.
The programmers notice this. (Arguendo, due to a transparency or user querying procedure which operated as intended at the AI's current level of intelligence.)
The programmers add a penalty of -100 utilons for "administering heroin to humans". (Arguendo, the AI's current level of intelligence does not suffice to prevent the programmers from editing its utility function, despite the convergent instrumental incentive to avoid this; nor does it successfully deceive the programmers.)
The AI gets slightly smarter. New conceivable options enter the AI's option space.
The AI ~~suggests administering~~starts wanting to administer cocaine to humans.
The programmers read through the current schedule of prohibited drugs and add penalty terms for administering marijuana, cocaine, etcetera.
The AI becomes slightly smarter. New options enter the policy space.
The AI ~~proposes~~starts wanting a research campaign to develop a new happiness drug not on the list of drugs that its utility function designates as bad.
The programmers, after some work, manage to develop a category for 'The AI forcibly administering any kind of psychoactive drug to humans' which is broad enough that the AI stops suggesting research campaigns to develop things slightly outside the category.
The AI ~~suggests that it~~wants to build an external system to administer heroin, so that it won't be classified inside this set of bad events that include the AI administering heroin.
The programmers generalize the penalty predicate to include "AIs in general forcibly administering heroin" as a bad thing.
The AI ~~suggests paying~~wants to pay humans to forcibly administer heroin.
The programmers try to generalize the category of penalized events to include forcible administration of drugs in general that produce happiness, whether done by humans or AIs, and promptly need to patch this category so that the AI is not trying to shut down at least the nicer parts of psychiatric hospitals, then patch it again so that the AI is not trying to ~~persuade~~force more licensed psychiatrists to administer morphine.
The AI ~~suggests~~begins planning an ad campaign to make people use heroin voluntarily.
The programmers add a penalty of -100 utilons for "AIs persuading humans to use drugs".
The AI continues to increase in intelligence; the AI's intelligence goes over the threshold where the AI can successfully be edited against its own will.
The AI notices the option "Genetically engineer humans to express extremely high levels of endogenous opiates".

(This might not kill you if the AI had enough successful, advanced-safe corrigibility features that the AI would indefinitely go on checking novel policies and novel goal instantiations with the users, not strategically hiding its disalignment from the programmers, not deceiving the programmers, letting the programmers edit its utility function, not doing anything disastrous before the utility function had been edited, etcetera. ~~It's worth noting that the level of programmer intelligence required~~But you wouldn't want to ~~do all that, successfully, in a way that doesn't break down for any number of other reasons as the AI becomes smarter and undergoes~~ ~~changes of context, seems inconsistent with the level of stupidity required to tell an AI to maximize human happiness~~rely on this; you would not want in the first ~~place.~~place to operate on the paradigm of 'maximize happiness, just not via any of these bad methods that we have already thought of and excluded'.)

~~Summary:~~summary: Nearest Unblocked Neighbor is a hypothetical source of ~~patch~~[patch resistance in the alignment problem for advanced agents that search rich solution spaces. If an agent's preference framework is patched to try to block a possible solution that seems undesirable, the next-best solution found may be the most similar solution that barely avoids the block.]

Nearest Unblocked Neighbor occurs when a programmer attempts to exclude a class of undesirable solutions X (e.g. adding a term to the utility function that penalizes X, or adding a deontological prohibition somehow) fails (under the ~~[programmer's]~~programmer's perspective on ~~Value~~value) because the solution found is 'not technically an X' but still shares some undesirable features of the original X. By hypothesis, this is not because the artificial agent has a hidden terminal preference for the undesirable feature X (it is not secretly an evil genie) but rather because blocking the solutions the agent ranked highest has made the next-highest-ranking solution be the nearest unblocked neighbor of the original solution.

Example 1: Smiling faces.

faces

Consider the following neutral-genie metaphor~~[# Definition. A neutral-genie metaphor is an attempt to illustrate a possible formal problem via an informal analogies involving neutral genies who obey English definitions of terms inside their~~ ~~goal systems, and otherwise tend to behave as bounded~~ ~~expected utility optimizers~~ unless otherwise specified. The genie is not assumed to have any goals other than those described in its stated goal system; it neither loves you nor hates you apart from that. In particular, a neutral genie does not care one way or another 'what you really meant' or 'what you had in mind' (unless we have otherwise constructed a Do What I Mean goal system for it, which will not be part of most metaphors).]: Suppose a neutral genie is instructed to produce human happiness. Suppose also that the neutral genie is not powerful enough to solve the problem by immediately tiling the universe with the simplest objects qualifying as human whose pleasure centers can represent the highest possible values. Instead, the genie is at a stage of development where the programmer can see the genie developing a plan to administer heroin to humans. The programmer hastily adds a penalty term to the utility function that assigns -100 utilons whenever the genie administers heroin to humans.

@• The AI is a consequentialist, a bounded maximizer, or is conducting some other search such that when the search is blocked at X, the search may happen upon a similar X' that fits the same criterion that promoted X. For a consequentialist, if X leads to goal G, then a similar X' may also lead to G. For a bounded maximizer, if X...

Definition

Introduction

'Nearest ~~Unblocked Neighbor occurs when~~unblocked strategy' seems like it should be a ~~programmer~~foreseeable ~~attempts~~problem of trying to get rid of undesirable AI behaviors by adding specific penalty terms to them, or otherwise trying to exclude aone class of observed or foreseen bad behaviors. Namely, if a decision criterion thinks X is the best thing to do, and you add a penalty term P that you think excludes everything inside X, the next-best thing to do may be a very similar thing X′ which is the most similar thing to X that doesn't trigger P.

Example: Producing happiness.

Some very early proposals for AI alignment suggested that AIs be targeted on producing human happiness. Leaving aside various other objections, arguendo, imagine the following series of problems and attempted fixes:

By hypothesis, the AI is successfully infused with a goal of "human happiness" as a utility function over human brain states. (Arguendo, this predicate is narrowed sufficiently that the AI does not just want to construct the tiniest, least resource-intensive brains experiencing the largest amount of happiness per erg of energy.)
The AI's policy space expands to the point where it includes conceivable options like "forcibly administer heroin".
The programmers notice this. (Arguendo, due to a transparency or user querying procedure which operated as intended at the AI's current level of intelligence.)
The programmers add a penalty of -100 utilons for "administering heroin to humans". (Arguendo, the AI's current level of intelligence does not suffice to prevent the programmers from editing its utility function, despite the convergent instrumental incentive to avoid this; nor does it successfully deceive the programmers.)
The AI gets slightly smarter. New conceivable options enter the AI's option space.
The AI suggests administering cocaine to humans.
The programmers read through the current schedule of prohibited drugs and add penalty terms for administering marijuana, cocaine, etcetera.
The AI becomes slightly smarter. New options enter the policy space.
The AI proposes a research campaign to develop a new happiness drug not on the list of drugs that its utility function designates as bad.
The programmers, after some work, manage to develop a category for 'The AI forcibly administering any kind of psychoactive drug to humans' which is broad enough that the AI stops suggesting research campaigns to develop things slightly outside the category.
The AI suggests that it build an external system to administer heroin, so that it won't be classified inside this set of bad events that include the AI administering heroin.
The programmers generalize the penalty predicate to include "AIs in general forcibly administering heroin" as a bad thing.
The AI suggests paying humans to forcibly administer heroin.
The programmers try to generalize the category of penalized events to include forcible administration of drugs in general that

...

			v1.25.0May 2nd 2016 GMT	(+1230/-483)
			v1.24.0May 2nd 2016 GMT	(+9/-9)
			v1.23.0May 2nd 2016 GMT	(+81/-66)
			v1.22.0May 2nd 2016 GMT	(+121/-78)
			v1.21.0May 2nd 2016 GMT	(+194/-323)
			v1.20.0May 2nd 2016 GMT	(+6369/-3425)
			v1.19.0Apr 29th 2016 GMT
			v1.18.0Apr 29th 2016 GMT	(+117)
			v1.17.0Mar 19th 2016 GMT	(-361)
			v1.16.0Dec 16th 2015 GMT	(+92/-770)

			v1.25.0May 2nd 2016 GMT	(+1230/-483)
			v1.24.0May 2nd 2016 GMT	(+9/-9)
			v1.23.0May 2nd 2016 GMT	(+81/-66)
			v1.22.0May 2nd 2016 GMT	(+121/-78)
			v1.21.0May 2nd 2016 GMT	(+194/-323)
			v1.20.0May 2nd 2016 GMT	(+6369/-3425)
			v1.19.0Apr 29th 2016 GMT
			v1.18.0Apr 29th 2016 GMT	(+117)
			v1.17.0Mar 19th 2016 GMT	(-361)
			v1.16.0Dec 16th 2015 GMT	(+92/-770)

LESSWRONG
LW