Definition.

Definition

Thus the ~~NearestUnblockedNeighbor~~Nearest unblocked strategy problem may be called "patch-resistant" by someone who thinks that (1) it is possible to design ~~AdvancedAgents~~an advanced agent with a ~~FullCoverage~~full coverage goal ~~system,~~system, (2) a ~~FullCoverage~~full coverage goal system would negate most ~~NearestUnblockedNeighbor~~Nearest unblocked strategy problems, but (3) we don't yet know how to build a ~~FullCoverage~~full coverage goal system.

Arguments.

~~Todo~~

Arguments

Patch-resistant problems are common in AI and AGI generally.

generally

See e.g. the ~~Todo~~history of nonmonotonic logic intended to represent 'explaining away' evidence, before the invention of Bayesian networks, as recounted in e.g. ~~todo cite~~Probabilistic Reasoning in Intelligent Systems

site

Patch-resistant problems arise in value alignment because of the complexity of value.

value

~~Todo~~ ~~this~~

expand

This is why we can't have the simple amazing formal criterion that solves ~~everything~~everything.

~~Todo~~

expand

You want your AI to accomplish good in the world, which is presently highly correlated with making people happy. Happiness is presently highly correlated with smiling. You build an AI that tries to achieve more smiling.
After the AI proposes to force people to smile by attaching metal pins to their lips, you realize that this current empirical association of smiling and happiness doesn't mean that maximum smiling must occur in the presence of maximum happiness.
Although it's much more complicated to infer, you try to reconfigure the AI's utility function to be about a certain class of brain states that has previously in ~~practiced~~practice produced smiles.
The AI successfully generalizes the concept of pleasure, and begins proposing policies to give people heroin.
You try to add a patch excluding artificial drugs.
The AI proposes a genetic modification producing high levels of endogenous opiates.
You try to explain that what's really important is not forcing the brain to experience pleasure, but rather, people experiencing events that naturally cause happiness.
The AI proposes to put everyone in the Matrix...

A ~~predicted~~proposed foreseeable difficulty of aligning advanced ~~safety~~agents is ~~asserted~~furthermore proposed to be "patch-resistant" if the speaker thinks that most simple or ~~a "resistant problem" if it is asserted that proposed simple~~naive solutions ~~("patches")~~ will fail to ~~solve~~resolve the difficulty and just regenerate ~~the difficulty~~it somewhere else.

Definition

The notion of a problem being "patch-resistant" is relative to the current state of the art. If a speaker asserts a problem is "patch-resistant" and is using that term as defined here, it means the speaker thinks both of the following:

The speaker has heard 'simple' solutions proposed to the difficulty, or sees naive solutions that might be imagined, and the speaker thinks that in all of these cases the difficulty regenerates despite the naive solution.
The speaker thinks that relative to the current state of the art, we don't have a simple way to resolve the problem, or that proposed methods to resolve the problem don't yet have acceptable proposed implementations. (In other words, besides some naive solutions failing, there's also no known ~~simple~~ ~~correct solution.)~~

~~Thus the~~ ~~Nearest unblocked strategy~~ ~~problem may be called "patch-resistant" by someone who thinks that (1) it is possible to design an~~ ~~advanced agent~~ ~~with a~~ ~~full coverage goal system, (2) a full coverage goal system would negate most~~ ~~Nearest unblocked strategy~~ ~~problems, but (3) we don't yet know how to build a full coverage goal system.~~

To call a problem "patch-resistant" is not to assert that it is unsolvable, but it does mean the speaker is cautioning against naive or simple ~~solutions~~solutions.

On most occasions so far, alleged cases of patch-resistance are said to stem from one of two central sources:

The difficulty arises from a convergent instrumental strategy executed by the AI, and simple patches aimed at blocking one observed bad behavior will not stop a very similar behavior from popping up somewhere else.
The difficulty arises because the desired behavior has high algorithmic complexity and simple attempts to pinpoint beneficial behavior are doomed to fail.

Instrumental-convergence patch-resistance

Example: Suppose you want your AI to have a shutdown button:

You first try to achieve this by writing a shutdown function into the AI's code.
After the AI becomes self-modifying, it deletes the code because it is (convergently) the case that the ~~speaker thinks some larger research project~~AI can accomplish its goals better by not being shut down.
You add a patch to the utility function giving the AI minus a million points if the AI deletes the shutdown function or prevents it from operating.
The AI responds by writing a new function that reboots the AI after the shutdown completes, thus technically not preventing the shutdown.
You respond by again patching the AI's utility function to give the AI minus a million points if it continues

...

			v1.16.0Jun 27th 2016 GMT	(+8/-9)
			v1.15.0Jun 27th 2016 GMT	(+26/-2)
			v1.14.0Jun 27th 2016 GMT
			v1.13.0Jun 27th 2016 GMT	(+32)
			v1.12.0Jun 27th 2016 GMT	(-269)
			v1.11.0Jun 27th 2016 GMT	(+9596/-1867)
			v1.10.0Dec 16th 2015 GMT	(+200/-182)
			v1.9.0Jul 19th 2015 GMT	(+20/-23)
			v1.8.0Jul 19th 2015 GMT
			v1.7.0Jul 19th 2015 GMT	(+188/-50)

			v1.16.0Jun 27th 2016 GMT	(+8/-9)
			v1.15.0Jun 27th 2016 GMT	(+26/-2)
			v1.14.0Jun 27th 2016 GMT
			v1.13.0Jun 27th 2016 GMT	(+32)
			v1.12.0Jun 27th 2016 GMT	(-269)
			v1.11.0Jun 27th 2016 GMT	(+9596/-1867)
			v1.10.0Dec 16th 2015 GMT	(+200/-182)
			v1.9.0Jul 19th 2015 GMT	(+20/-23)
			v1.8.0Jul 19th 2015 GMT
			v1.7.0Jul 19th 2015 GMT	(+188/-50)

LESSWRONG
LW

LESSWRONG
LW

Patch resistance

Patch resistance

Definition.

Definition

Arguments.

Arguments

Patch-resistant problems are common in AI and AGI generally.

Patch-resistant problems arise in value alignment because of the complexity of value.

Definition

Instrumental-convergence patch-resistance