Nearest unblocked strategy

Definition

Nearest Unblocked Neighbor occurs when a programmer attempts to exclude a class of undesirable solutions X (e.g. adding a term to the utility function that penalizes X, or adding a deontological prohibition somehow) fails (under the programmer's perspective on value) because the solution found is 'not technically an X' but still shares some undesirable features of the original X. By hypothesis, this is not because the artificial agent has a hidden terminal preference for the undesirable feature X (it is not secretly an evil genie) but rather because blocking the solutions the agent ranked highest has made the next-highest-ranking solution be the nearest unblocked neighbor of the original solution.

Example 1: Smiling faces

Consider the following neutral-genie metaphor: Suppose a neutral genie is instructed to produce human happiness. Suppose also that the neutral genie is not powerful enough to solve the problem by immediately tiling the universe with the simplest objects qualifying as human whose pleasure centers can represent the highest possible values. Instead, the genie is at a stage of development where the programmer can see the genie developing a plan to administer heroin to humans. The programmer hastily adds a penalty term to the utility function that assigns -100 utilons whenever the genie administers heroin to humans.

It would not be very surprising if the next plan devised is to administer morphine to humans. Previously the highest-scoring solution pathway to produce human happiness went through the causal pathway of administering heroin; when those pathways were 'blocked', the next highest-ranking solution involved a very similar causal pathway going through the most similar drug that did not meet the 'heroin' predicate.

Next, suppose that the programmer, feeling exasperated at the obstinacy of the genie, creates a penalty term that penalizes the genie administering drugs of any kind to humans. It would not be very surprising if the next plan produced is to administer gene therapy that causes human brains to naturally produce endogenous opiates.

Dependencies

The case for Nearest Unblocked Neighbor predicts that the problem will arise given all of the following conditions:

• The AI is a consequentialist, a bounded maximizer, or is conducting some other search such that when the search is blocked at X, the search may happen upon a similar X' that fits the same criterion that promoted X. For a consequentialist, if X leads to goal G, then a similar X' may also lead to G. For a bounded maximizer, if X leads to an outcome of high expected utility and X is blocked, then a similar X' may also lead to an outcome of high expected utility.

• The search is taking place over a rich domain where the space of relevant neighbors around X is too complicated for us to easily describe, or be certain that we have described all the relevant neighbors correctly. If we imagine an agent playing the purely ideal game of logical Tic-Tac-Toe, then if the agent's utility function hates playing in the center of the board, we can be sure (because we can exhaustively consider the space) that there are no Tic-Tac-Toe squares that behave strategically almost like the center but don't meet the exact definition we used of 'center'. In the far more complicated real world, when you eliminate 'administer heroin' you are very likely to find some other chemical or trick that is strategically mostly equivalent to administering heroin. See the proposition "Almost all real-world domains are rich".

• From our perspective on value, the AI does not have an absolute identification of value for the domain, due to some combination of "the domain is rich" and "value is complex". Chess can be complicated enough to defeat human players, but since a chess program can have an absolute identification of what constitutes winning, we don't run into a problem of unending patches in identifying which states of the board constitute winning conditions. (However, if we consider a very early chess program that (from our perspective) was trying to be a consequentialist but wasn't very good at it, then we can imagine that, if the early chess program consistently threw its queen onto the right edge of the board for strange reasons, forbidding it to move the queen there might well lead it to throw the queen onto the left edge for the same strange reasons. The valuable intermediate states of the board are not absolutely identifiable in the same way as the valuable end states.)

under these conditions, a predicate to eliminate X of low value and high solution-fitness may well lead to a strategically similar X' of low value and high solution-fitness

this takes the programmers by surprise, forming a treacherous problem, in the case where they debugged X'''' when the AI was stupid and then as a smart AI it gained new options.

Arguments

Nearest Unblocked Neighbor behavior is sometimes observed at the human level

Although humans obeying the law make poor analogies for mathematical algorithms, in some cases human economic actors expect not to encounter legal or social penalties for obeying 'the letter rather than the spirit' of the law. In those cases, after a previously high-yield strategy is outlawed or penalized, 'Nearest Unblocked Neighbor' often seems like a fair description of the outcome after the humans reconsider and replan. This illustrates that the theoretical argument also applies in practice to at least some pseudo-economic agents (humans).

Complexity of value means we should not expect to find a simple, formal encoding for directly eliminating strategies we see as low-value

To a human, 'poisonous' is one word. In terms of molecular biology, the exact volume of the configuration space of molecules that is 'nonpoisonous' is very complicated. By having a single word/concept for poisonous-vs.-nonpoisonous, we're dimensionally reducing the space of edible substances - taking a very squiggly volume of molecule-space, and mapping it all onto a linear scale from 'nonpoisonous' to 'poisonous'.

There's a sense in which human cognition implicitly performs dimensional reduction on our solution space, especially by simplifying dimensions that are relevant to some component of our values. There may be some psychological sense in which we feel like "do X, only not weird low-value X" ought to be a simple instruction, and an agent that repeatedly produces the next unblocked weird low-value X is being perverse - that the agent, given a few examples of weird low-value Xs labeled as noninstances of the desired concept, ought to be able to just generalize to not produce weird low-value Xs.

In fact, if it were possible to encode all relevant dimensions of human value into the agent then we could just say directly to "do X, but without any side effects that we consider low-value". By the definition of Full Coverage, the agent's concept for 'low-value' includes everything that is of low value, so this one instruction would blanket all the undesirable strategies we want to avoid.

Conversely, the truth of the complexity of value thesis would imply that the simple word 'low-value' is dimensionally reducing a space of tremendous algorithmic complexity. Thus the effort or information required to actually convey the relevant dos and don'ts of "X, only not weird low-value X" would be high, and a human-generated set of supervised examples labeled 'not the kind of X we mean' would be unlikely to cover and stabilize all the dimensions of the underlying space of possibilities. Since the weird low-value X cannot be eliminated in one instruction or several patches or a human-generated set of supervised examples, the Nearest Unblocked Neighbor problem will recur incrementally each time a patch is attempted.

Thus the 'nearest unblocked neighbor' problem derives from complexity of value, and if FullCoverage were already achieved in an agent then the Nearest Unblocked Neighbor problem might be eliminated.