Utility indifference

Suppose an advanced agent with a goal like, e.g., producing smiles or making ~~paperclips.~~paperclips. By default, if you try to switch off a sufficiently intelligent agent like this, it will resist being switched off; not because it has an independent goal of survival, but because it expects that if it's switched off it will be able to produce fewer smiles or paperclips. If the agent has policy options to diminish the probability of being successfully switched off, the agent will pursue those options. This is a convergent instrumental strategy if not otherwise prevented.

The shutdown problem is to describe an advanced agent which is corrigible with respect to permitting itself to be safely shut down or suspended to disk. A ~~shutdown-able~~reinforcement learning agent which can be forced to e.g. perform a null action repeatedly for a period of time, is ~~also~~ called interruptible by Stuart Armstrong and Laurent Orseau.

This isn't as simple as writing in a ~~function, or cognitive reflex,~~special function into the AI which carries out a ~~shutdown.~~shutdown after a switch is pressed. If you start out controlling the agent's source code, and you naively try to build in functions which suspend the agent to disk when a button is pressed, and the agent later gains the ability to self-modify, it would remove those functions. (Possibly while trying to conceal the fact that the function would no longer operate.)

We can see this as a natural consequence of our trying to modify the agent's code in a way that was not consistent with the ~~agent'~~code's explicit goals. If you create an agent with source code P that is well-suited to achieving a goal U and that explicitly represents U as a goal, the agent's codeP will be reflectively consistent - the code will not want to modify itself. If we then suppose that instead the agent has code P′ which is no longer well-suited to achieving U, but continues to explicitly represent U as a goal, the code will be reflectively inconsistent and it will want to modify itself given the option of doing so. If you have code that searches for ways to produce paperclips, and you hack the code to contain a shutdown function, but the code is still searching for paperclip-production strategies, and the code correctly models the effect of the shutdown function on future paperclips, the code will ~~search for a way~~want to modify itself to remove the shutdown function.

We then tell the AI that its utility function is the "normal" utility function inover worlds where an off button has not been pushed, ~~and~~plus the "suspend" utility function inover worlds where the off button has been pushed....

The naive proposalNaive compounding

This fails because, being unable to foresee the AI's best attainable policies, we will ~~not~~ be ~~able~~unable to scale UX and UY to ensure their best available policies are equally attractive. Then maxa∈AE[UX|a] ≠ maxa∈AE[UY|a] which fails the desideratum of indifference; the AI will have an incentive to manipulate the state of S.

Armstrong'Naive indifference

An FHI technical report by Stuart Armstrong posed the first proposal ever floated for solving the shutdown problem - in fact, the first proposal for technically solving any subproblem of corrigibility. This report invented the crisp problem of utility indifference, but did not solve it successfully. The solution proposed in the report was as follows:

$U_2(o): {UX(o)¬o.s UY(o)+θo.s$ \theta := \max_{a \in \mathcal A} \mathbb E[U_X|a] - \max_{a \in \mathcal A} \mathbb E[U_Y|a]$

What this says is that the UY component of U2 is to be scaled by a dynamically determined offset term θ which equals the maximum expected utility obtainable under UX minus the maximum expected utility obtainable under UY.

The intent is that, so long as the AI carries out the best UY-policy once the button is pressed, it will obtain equal expected utility compared to what it could have obtained by pursuing the optimal UX-policy with the button not pressed:

maxa∈A(E[UY|a]+θ) = maxaa∈AE[Ux|a]

This version of utility indifference is now believed to be flawed due to an argument by Yudkowsky and Fallenstein. Since θ contains a negated form of UY, we should immediately be suspicious on the grounds "never put a negative sign in front of a utility function." In this case, the problem manifests as follows. In a temporal version of this problem containing sequential policy choices, either:

(a) The AI's ~~initial proposal~~utility function will be dynamically inconsistent, because it will not approve of future versions of itself using a different value of θ.
(b) If recalculated θs are baked into the utility function, so that the AI calculates future utilities as if for different values of θ that the AI would hold in those futures, then policies which hinder future performance under UY will be offset by an increase in the expected bonus from θ.

That is, if the AI contemplates doing something disastrous given that the button is pressed, it will find that in those expected futures, θ takes on a higher value (because θ includes a minus sign in front of the decreasing value of maxa∈AE[UY|a]).

This causes the AI to behave as if UX is the entire utility function and it does not care at all about UY. This in turn means a self-modifying AI will not retain a shutdown function - removing the shutdown function results in disaster if the shutdown button is pressed, but this is offset by an expected increase in the bonus θ which applies if the button is pressed.

Naive utility mixing

(in progress)

			v1.19.0Jul 15th 2016 GMT	(+8)
			v1.18.0Jun 16th 2016 GMT	(+6/-3)
			v1.17.0Jun 16th 2016 GMT	(+6804/-1753)
			v1.16.0May 18th 2016 GMT	(+8/-6)
			v1.15.0May 18th 2016 GMT	(+10/-7)
			v1.14.0May 18th 2016 GMT	(+19/-2)
			v1.13.0May 18th 2016 GMT	(+7167/-364)
			v1.12.0May 16th 2016 GMT	(+78/-43)
			v1.11.0May 16th 2016 GMT	(+35/-24)
			v1.10.0May 16th 2016 GMT	(+2824/-67)

			v1.19.0Jul 15th 2016 GMT	(+8)
			v1.18.0Jun 16th 2016 GMT	(+6/-3)
			v1.17.0Jun 16th 2016 GMT	(+6804/-1753)
			v1.16.0May 18th 2016 GMT	(+8/-6)
			v1.15.0May 18th 2016 GMT	(+10/-7)
			v1.14.0May 18th 2016 GMT	(+19/-2)
			v1.13.0May 18th 2016 GMT	(+7167/-364)
			v1.12.0May 16th 2016 GMT	(+78/-43)
			v1.11.0May 16th 2016 GMT	(+35/-24)
			v1.10.0May 16th 2016 GMT	(+2824/-67)

LESSWRONG
LW

LESSWRONG
LW

Utility indifference

Utility indifference

The naive proposalNaive compounding

Armstrong'Naive indifference

Naive utility mixing

Olah & Taylor's stable policies