Suppose there's an advanced agent with a goal like, e.g., producing smiles or making paperclips. By default, if you try to switch off a sufficiently intelligent agent like this, it will resist being switched off; not because it has an independent goal of survival, but because it expects that if it's switched off it will be able to produce fewer smiles or paperclips. If the agent has policy options to diminish the probability of being successfully switched off, the agent will pursue those options. This is a convergent instrumental strategy if not otherwise prevented.
Suppose an advanced agent with a goal like, e.g., producing smiles or making paperclips.paperclips. By default, if you try to switch off a sufficiently intelligent agent like this, it will resist being switched off; not because it has an independent goal of survival, but because it expects that if it's switched off it will be able to produce fewer smiles or paperclips. If the agent has policy options to diminish the probability of being successfully switched off, the agent will pursue those options. This is a convergent instrumental strategy if not otherwise prevented.
The shutdown problem is to describe an advanced agent which is corrigible with respect to permitting itself to be safely shut down or suspended to disk. A shutdown-ablereinforcement learning agent which can be forced to e.g. perform a null action repeatedly for a period of time, is also called interruptible by Stuart Armstrong and Laurent Orseau.
This isn't as simple as writing in a function, or cognitive reflex,special function into the AI which carries out a shutdown.shutdown after a switch is pressed. If you start out controlling the agent's source code, and you naively try to build in functions which suspend the agent to disk when a button is pressed, and the agent later gains the ability to self-modify, it would remove those functions. (Possibly while trying to conceal the fact that the function would no longer operate.)
We can see this as a natural consequence of our trying to modify the agent's code in a way that was not consistent with the agent'code's explicit goals. If you create an agent with source code P that is well-suited to achieving a goal U and that explicitly represents U as a goal, the agent's codeP will be reflectively consistent - the code will not want to modify itself. If we then suppose that instead the agent has code P′ which is no longer well-suited to achieving U, but continues to explicitly represent U as a goal, the code will be reflectively inconsistent and it will want to modify itself given the option of doing so. If you have code that searches for ways to produce paperclips, and you hack the code to contain a shutdown function, but the code is still searching for paperclip-production strategies, and the code correctly models the effect of the shutdown function on future paperclips, the code will search for a waywant to modify itself to remove the shutdown function.
We then tell the AI that its utility function is the "normal" utility function inover worlds where an off button has not been pushed, andplus the "suspend" utility function inover worlds where the off button has been pushed....
$U_2(o): U2(o):{UX(o)¬o.s UY(o)+θo.s$
With A denoting the action space, we want to find an a0∈A such that:
$U_1(o): U1(o):{UX(o)¬o.s UY(o)o.s$
$U_2(o): {UX(o)¬o.s UY(o)+θo.s$
\theta := \max_{θ:=maxa \in \mathcal A} \mathbb E[U_X|a] - \max_{∈AE[UX|a \in \mathcal A} \mathbb E[U_Y|a]$]−maxa∈AE[UY|a]
Similarly, ifThe shutdown problem is to describe an advanced agent which is corrigible with respect to permitting itself to be safely shut down or suspended to disk. A shutdown-able agent is also called interruptible by Stuart Armstrong and Laurent Orseau.
This isn't as simple as writing in a function, or cognitive reflex, which carries out a shutdown. If you start out controlling the agent's source code, and you naively try to build in functions which suspend the agent to disk when a button is pressed, and the agent later gains the ability to self-modify, it would remove those functions (possiblyfunctions. (Possibly while trying to conceal the fact that the function would no longer operate).operate.)
The next thought is to build an agent with modified code P∗ which also embodies a modified utility function U∗ such that a consequentialist optimizing for U∗ wants the code to be like P∗ and is again reflectively consistent. In
That is, in the case of an off-switch, we'd want to modify the agent, not just to include an off-switch, but to want to shut down if the off-switch is pressed.
However, thisThis approach doesn't yet get us out of the woods, because we now have a new problem:
One way we could try to approach this difficulty is by making the prospect of the button being pressed, and the button not being pressed, equally attractive to the AI. In a way that still preserves the dependence of desired future behavior on whether or not the button was pressed, i.e., the AI still wants to behave normally if the button is not pressed and suspend-to-disk if it is pressed.
An alternative approach to having the AI take into account both UX and UY, without trying to modify the probability P(S) of the suspend button being pressed or not-pressed, would be to optimize for UX and UYin progress)some fixed proportion regardless of P(S). At a time t when the button had not yet been pressed, the expected utility of an action a would be calculated as:
Et[U3|a]=0.75⋅Et[UX|a∧¬s] + 0.25⋅Et[UY|a∧s]
We can see this as an AI which behaves as if the proposition s has a fixed probability of 0.25....
That is, if the AI contemplates doing something disastrous given that the button is pressed, it will find that in those expected futures, θ takes on a higher value (because θ includes a minus sign in front of the decreasing value of maxmaxa∈A E[UY|a]).
This in turn means a self-modifying AI will not retain a shutdown function - removingfunction. Removing the shutdown function results in disaster if the shutdown button is pressed, but this is offset by an expected increase in the bonus θ which applies if the button is pressed.
That is, if the AIagent contemplates doing somethingwhich would be disastrous given thatin worlds where the button is pressed, itthe agent will find that in those expected futures, θ takes on a higher value (because θ includes a minus sign in front of the decreasing value of maxa∈A E[UY|a]).
This causes the AIagent to behave as if UX is the entire utility function and it does not care at all about UY.
This in turn means a self-modifying AI will not retain a shutdown function. Removing the shutdown function results in disaster ifin worlds where the shutdown button is pressed, but this is offset by an expected increase in the bonus θ which applies ifin worlds where the button is pressed.
The obvious next thought is to build an agent with modified code P∗ which also embodies a modified utility function U∗ such that a consequentialist optimizing for U∗ wants the code to be like P∗ and is again reflectively consistent. In the case of an off-switch, we'd want to modify the agent, not just to include an off-switch, but to want to shut down if the off-switch is pressed.
The agent solves a search/optimization problem which we will gloss as simple maximization (even though global maximization is intractable in real life), so that the agent's real action a is assumed to be given by: argmaxa∈A E[U|a]
This fails because, being unable to foresee the AI's best attainable policies, we will not be ableunable to scale UX and UY to ensure their best available policies are equally attractive. Then maxa∈AE[UX|a] ≠ maxa∈AE[UY|a] which fails the desideratum of indifference; the AI will have an incentive to manipulate the state of S.
An FHI technical report by Stuart Armstrong posed the first proposal ever floated for solving the shutdown problem - in fact, the first proposal for technically solving any subproblem of corrigibility. This report invented the crisp problem of utility indifference, but did not solve it successfully. The solution proposed in the report was as follows:
$U_2(o): {UX(o)¬o.s UY(o)+θo.s$ \theta := \max_{a \in \mathcal A} \mathbb E[U_X|a] - \max_{a \in \mathcal A} \mathbb E[U_Y|a]$
What this says is that the UY component of U2 is to be scaled by a dynamically determined offset term θ which equals the maximum expected utility obtainable under UX minus the maximum expected utility obtainable under UY.
The intent is that, so long as the AI carries out the best UY-policy once the button is pressed, it will obtain equal expected utility compared to what it could have obtained by pursuing the optimal UX-policy with the button not pressed:
maxa∈A(E[UY|a]+θ) = maxaa∈AE[Ux|a]
This version of utility indifference is now believed to be flawed due to an argument by Yudkowsky and Fallenstein. Since θ contains a negated form of UY, we should immediately be suspicious on the grounds "never put a negative sign in front of a utility function." In this case, the problem manifests as follows. In a temporal version of this problem containing sequential policy choices, either:
That is, if the AI contemplates doing something disastrous given that the button is pressed, it will find that in those expected futures, θ takes on a higher value (because θ includes a minus sign in front of the decreasing value of maxa∈AE[UY|a]).
This causes the AI to behave as if UX is the entire utility function and it does not care at all about UY. This in turn means a self-modifying AI will not retain a shutdown function - removing the shutdown function results in disaster if the shutdown button is pressed, but this is offset by an expected increase in the bonus θ which applies if the button is pressed.
(in progress)