summary: Nearest Unblocked Neighbor is a hypothetical source of [patch resistance in the alignment problem for advanced agents that search rich solution spaces. If an agent's preference framework is patched to try to block a possible solution that seems undesirable, the next-best solution found may be the most similar solution that barely avoids the block.]
The overall story is one where the AI's preferences on round i, denoted Ui, are observed to arrive at an attainable optimum Xi which the humans see as undesirable. The humans devise a penalty term Pi intended to exclude the undesirable parts of the policy space, and add this to Ui creating a new utility function Ui+1, after which the AI's optimal policy settles into a new state X∗i that seems acceptable. However, after the next expansion of the policy space, Ui+1 settles into a new attainable optimum Xi+1 which is very similar to Xi and makes the minimum adjustment necessary to evade the boundaries of the penalty term Pi, requiring a new penalty term Pi+1 to exclude this new misbehavior; and the process repeats.misbehavior.
'Nearest unblocked strategy' seems like it should be a foreseeable problem of trying to get rid of undesirable AI behaviors by adding specific penalty terms to them, or otherwise trying to exclude one class of observed or foreseen bad behaviors. Namely, if a decision criterion thinks X is the best thing to do, and you add a penalty term P that you think excludes everything inside X, the nextnext-best-best thing to do may be a very similar thing X′ which is the most similar thing to X that doesn't trigger P.
(This might not kill you if the AI had enough successful, advanced-safe corrigibility features that the AI would indefinitely go on checking novel policies and novel goal instantiations with the users, not strategically hiding its disalignment from the programmers, not deceiving the programmers, letting the programmers edit its utility function, not doing anything disastrous before the utility function had been edited, etcetera. It's worth noting that the level of programmer intelligence requiredBut you wouldn't want to do all that, successfully, in a way that doesn't break down for any number of other reasons as the AI becomes smarter and undergoes changes of context, seems inconsistent with the level of stupidity required to tell an AI to maximize human happinessrely on this; you would not want in the first place.place to operate on the paradigm of 'maximize happiness, just not via any of these bad methods that we have already thought of and excluded'.)
Summary:summary: Nearest Unblocked Neighbor is a hypothetical source of patch[patch resistance in the alignment problem for advanced agents that search rich solution spaces. If an agent's preference framework is patched to try to block a possible solution that seems undesirable, the next-best solution found may be the most similar solution that barely avoids the block.]
Nearest Unblocked Neighbor occurs when a programmer attempts to exclude a class of undesirable solutions X (e.g. adding a term to the utility function that penalizes X, or adding a deontological prohibition somehow) fails (under the [programmer's]programmer's perspective on Valuevalue) because the solution found is 'not technically an X' but still shares some undesirable features of the original X. By hypothesis, this is not because the artificial agent has a hidden terminal preference for the undesirable feature X (it is not secretly an evil genie) but rather because blocking the solutions the agent ranked highest has made the next-highest-ranking solution be the nearest unblocked neighbor of the original solution.
Consider the following neutral-genie metaphor[# Definition. A neutral-genie metaphor is an attempt to illustrate a possible formal problem via an informal analogies involving neutral genies who obey English definitions of terms inside their goal systems, and otherwise tend to behave as bounded expected utility optimizers unless otherwise specified. The genie is not assumed to have any goals other than those described in its stated goal system; it neither loves you nor hates you apart from that. In particular, a neutral genie does not care one way or another 'what you really meant' or 'what you had in mind' (unless we have otherwise constructed a Do What I Mean goal system for it, which will not be part of most metaphors).]: Suppose a neutral genie is instructed to produce human happiness. Suppose also that the neutral genie is not powerful enough to solve the problem by immediately tiling the universe with the simplest objects qualifying as human whose pleasure centers can represent the highest possible values. Instead, the genie is at a stage of development where the programmer can see the genie developing a plan to administer heroin to humans. The programmer hastily adds a penalty term to the utility function that assigns -100 utilons whenever the genie administers heroin to humans.
@• The AI is a consequentialist, a bounded maximizer, or is conducting some other search such that when the search is blocked at X, the search may happen upon a similar X' that fits the same criterion that promoted X. For a consequentialist, if X leads to goal G, then a similar X' may also lead to G. For a bounded maximizer, if X...
'Nearest Unblocked Neighbor occurs whenunblocked strategy' seems like it should be a programmerforeseeable attemptsproblem of trying to get rid of undesirable AI behaviors by adding specific penalty terms to them, or otherwise trying to exclude aone class of observed or foreseen bad behaviors. Namely, if a decision criterion thinks X is the best thing to do, and you add a penalty term P that you think excludes everything inside X, the next-best thing to do may be a very similar thing X′ which is the most similar thing to X that doesn't trigger P.
Some very early proposals for AI alignment suggested that AIs be targeted on producing human happiness. Leaving aside various other objections, arguendo, imagine the following series of problems and attempted fixes: