Suppose an advanced agent with a goal like, e.g., producing smiles or making paperclips. By default, if you try to switch off a sufficiently intelligent agent like this, it will resist being switched off; not because it has an independent goal of survival, but because it expects that if it's switched off it will be able to produce fewer smiles or paperclips. If the agent has policy options to diminish the probability of being successfully switched off, the agent will pursue those options. This is a convergent instrumental strategy if not otherwise prevented.
The shutdown problem is to describe an advanced agent which is corrigible with respect to permitting itself to be safely shut down or suspended to disk. A shutdown-able agent is also called interruptible by Stuart Armstrong and Laurent Orseau.
This isn't as simple as writing in a function, or cognitive reflex, which carries out a shutdown. If you start out controlling the agent's source code, and you naively try to build in functions which suspend the agent to disk when a button is pressed, and the agent later gains the ability to self-modify, it would remove those functions. (Possibly while trying to conceal the fact that the function would no longer operate.)
We can see this as a natural consequence of our trying to modify the agent's code in a way that was not consistent with the agent's explicit goals. If you create an agent with source code that is well-suited to achieving a goal and explicitly represents as a goal, the agent's code will be reflectively consistent - the code will not want to modify itself. If we then suppose that instead the agent has code which is no longer well-suited to achieving but continues to explicitly represent as a goal, the code will be reflectively inconsistent and it will want to modify itself given the option of doing so. If you have code that searches for ways to produce paperclips, and you hack the code to contain a shutdown function, but the code is still searching for paperclip-production strategies, and the code correctly models the effect of the shutdown function on future paperclips, the code will search for a way to modify itself to remove the shutdown function.
The next thought is to build an agent with modified code which embodies a modified utility function such that a consequentialist optimizing for wants the code to be like and is again reflectively consistent.
That is, in the case of an off-switch, we'd want to modify the agent, not just to include an off-switch, but to want to shut down if the off-switch is pressed.
Assuming we have the ability to specify the utility function of an agent, we might try to approach this by creating a utility function with two components:
We then tell the AI that its utility function is the "normal" utility function in worlds where an off button has not been pushed, and the "suspend" utility function in worlds where the off button has been pushed. So long as this is what the AI wants, we'd expect it to write its own suspend-to-disk function in its code if its code did not already contain one.
This approach doesn't yet get us out of the woods, because we now have a new problem:
One way we could try to approach this difficulty is by making the prospect of the button being pressed, and the button not being pressed, equally attractive to the AI.
This is the utility indifference research avenue for the switch problem of causing an AI to switch between two utility functions when a button is pressed, such that:
This turns out to be more difficult than it sounds, and is currently an open problem.
The switch problem is a crisp subproblem of both value learning and corrigibility.
The switch problem is a subproblem of value learning because it describes a very simple case of conditioning a utility function on observation or other facts, without giving the AI an incentive to manipulate those facts. To learn complicated values safely, we'd need to condition a utility function on complicated observations... in a way that was robust to the AI gaining self-modification capability, and without incentivizing programmer manipulation. Just writing code that modifies the utility function, without the larger code's explicit policy search wanting the utility function to be modified in that case, would not be reflectively consistent. Naive dependencies between the utility function and observations, incentivize manipulating those observations.
The switch problem is a subproblem of corrigibility because, combined with a shutdown utility function, it would solve the shutdown problem. I.e., we could give advanced agents off-switches that actually worked and went on working even as the agent became smarter, without the agent just ripping the off-switch out of our hands. Trying to solve this problem also seems like it should be one of the simplest cases of averting an instrumental incentive, which seems likely to be a central problem for many forms of corrigibility.
Suppose we build a compound utility function whose utility over outcomes behaves like in worlds where the button is not pressed and returns in worlds where the button is pressed. Then if the AI's expected utility under is higher than its expected utility under , the AI will try to avoid the button being pressed, and if we try to compensate for this problem by adding a constant to , the AI might instead try to cause the button to be pressed.
E.g., suppose is the shutdown utility function. If the AI expects to get less utility inside than it would otherwise get inside , the AI will try to avoid the shutdown button being pressed. If we try to compensate for this by increasing the utility within of all outcomes being valued as if by , the AI may instead try to cause the shutdown button to be pressed - build a robotic arm and press its own shutdown button, or (more worryingly) try to motivate the programmers into pressing it.
But adding a constant to a utility function does not change its behavior. Thus, Stuart_Armstrong proposed that perhaps we can figure out some clever trick for encoding a compound utility function in which the maximum expected utility achievable under is equal to the utility the AI expects to achieve under , even though we don't know in advance exactly which outcomes will be achievable by the AI. As the AI evaluates possible futures, it will seem very good to have optimized if the button is not pressed, very good to have optimized if the button is pressed, but equally good to have done either of those things.
This is the utility indifference avenue of research toward building in a reflectively consistent shutdown button, or creating a preference framework that learns from observation without trying to manipulate the observations.
It turned out to be surprisingly difficult to figure out how to structure a compound utility function this way even in principle, and this is an open problem and area of actively continuing research.
Let be the outcome space that includes the event of whether the switch was pressed, plus the states of all other events . We will use to indicate the event that the switch is pressed, and to indicate that the switch is not pressed. If then is true if is true inside , otherwise
Let a utility function map outcomes onto real numbers.
and are utility functions not dependent on
Intuitively our goal is to produce a new compound utility function, such that the corresponding agent:
We will designate the agent's action space or policy space as and particular actions
The agent has a model of the world enabling it to produce subjective conditional probabilities for outcomes given actions, This model can also yield conditional expectations which can therefore produce a conditional expectation of the agent's utility given an action,
(For technical reasons of decision theory, we are really discussing counterfactual conditionals from actions to outcomes; but we will omit this complication for now.)
The agent solves a search/optimization problem which we will gloss as maximization (even though global maximization is intractable in real life), so that the agent's real action is assumed to be given by:
Suppose we define a compound utility function as follows:
$U_1(o): $
This fails because, being unable to foresee the AI's best attainable policies, we will be unable to scale and to ensure their best available policies are equally attractive. Then which fails the desideratum of indifference; the AI will have an incentive to manipulate the state of
An FHI technical report by Stuart Armstrong posed the first proposal ever floated for solving the shutdown problem - in fact, the first proposal for technically solving any subproblem of corrigibility. This report invented the crisp problem of utility indifference, but did not solve it successfully. The solution proposed in the report was as follows:
$U_2(o): $ \theta := \max_{a \in \mathcal A} \mathbb E[U_X|a] - \max_{a \in \mathcal A} \mathbb E[U_Y|a]$
What this says is that the component of is to be scaled by a dynamically determined offset term which equals the maximum expected utility obtainable under minus the maximum expected utility obtainable under
The intent is that, so long as the AI carries out the best -policy once the button is pressed, it will obtain equal expected utility compared to what it could have obtained by pursuing the optimal -policy with the button not pressed:
This version of utility indifference is now believed to be flawed due to an argument by Yudkowsky and Fallenstein. Since contains a negated form of we should immediately be suspicious on the grounds "never put a negative sign in front of a utility function." In this case, the problem manifests as follows. In a temporal version of this problem containing sequential policy choices, either:
That is, if the agent contemplates doing which would be disastrous in worlds where the button is pressed, the agent will find that in those expected futures, takes on a higher value (because includes a minus sign in front of the decreasing value of ).
This causes the agent to behave as if is the entire utility function and it does not care at all about
This in turn means a self-modifying AI will not retain a shutdown function. Removing the shutdown function results in disaster in worlds where the shutdown button is pressed, but this is offset by an expected increase in the bonus in worlds where the button is pressed.
An alternative approach to having the AI take into account both and without trying to modify the probability of the suspend button being pressed or not-pressed, would be to optimize for and in some fixed proportion regardless of At a time when the button had not yet been pressed, the expected utility of an action would be calculated as:
We can see this as an AI which behaves as if the proposition has a fixed probability of 0.25. The intended behavior is that since the AI behaves as if this probability is fixed, the AI will not care about altering the probability of (trying to press the button or avoid it being pressed).
One observes, however, that this is an ad-hoc hack in the sense that there's ordinarily a simple or coherent way of considering expected utility (sum over the agent's actual subjective probabilities), and we just inserted a weird special case into it (treat one probability as fixed). This could potentially break any number of things having to do with Bayesian updating or dynamic consistency.
In particular, we'd expect the AI to treat the button as an "outcome pump" or "improbability drive". Suppose an asteroid is approaching Earth with 99% probability, and this is very bad under The AI grabs control of the button and precommits to press the button iff the asteroid strikes Earth. Now, conditioning on the button not being pressed, the asteroid has a much lower probability of striking Earth, and the term will have a higher value.
In other words, since the AI behaves as if the button as having a fixed 25% probability of being pressed, the AI also behaves as if it thinks "I can commit to press this button iff the asteroid strikes and thereby make it have a 25% probability of striking."
Similarly, the AI would behave as if it thought it could buy a lottery ticket, commit to press the button iff the ticket loses, and thereby give the lottery ticket a 75% chance of winning.
Note that this is at least not the same failure mode as the AI reasoning, "The asteroid strike is very bad under (which values human lives) but not so bad under (which values shutting down), therefore I will try to ensure the button is pressed with high probability in order to get more expected utility." At least this failure mode has been averted - the AI would simply act as if the badness of the asteroid strike under was being taken into account at a fixed weight. Instead, the AI thinks that it can 'shunt' the probability mass of the bad event into button-pressed worlds where holds, and out of worlds where holds, by correlating the button press with the bad event.
To put it another way, a naively compounded AI acting under might notice an asteroid coming with notice that this was better under than under and try to shunt probability mass from to in order to gain expected utility given the 99% probability of the asteroid striking. Under the AI instead behaves as if is fixed at 0.25 and it can therefore shunt probability from to by correlating and
Chris Olah and Jessica Taylor considered a possible rule of selecting an action that is stable given the corresponding probability of the button being pressed. We'll first consider a simpler version of this proposal and how it fails, in order to motivate the more complicated version.
We can think of this "select a stable action" rule as comprising two insights:
With denoting the action space, we want to find an such that:
In other words, "Find an action such that it is among the maximal actions for the quantity "Utility from times probability of given plus the utility from times probability of given "
We first note that this is an ad-hoc hack relative to the normal idea of acting according to a utility function, and might have all sorts of reflective stability or dynamic consistency problems, as before.
In particular, this looks like it should fail due to there being no fixed point. Suppose an asteroid is heading toward Earth with 99% probability, and we start with a 10% chance of the button being pressed. Treating this probability as being fixed, by pressing the button iff the asteroid strikes, we can send the asteroid's probability of hitting to 10%. (Treating the button as an outcome pump, as in the 'naive mixing' proposal above.) But now when we re-evaluate the probability of the button being pressed, the new probability comes out to 99%. So we decide to not press the button iff the asteroid strikes, thus sending its probability of striking to 1%. But now it seems best to press the button iff the asteroid strikes, and thus send the strike probability to 1%. Etcetera.
The actual form of Olah and Taylor's proposal was to treat the conditioning on the button press, inside each branch of the calculation, as being a Judea-Pearl-style, causal counterfactual operator:
The intent is that by severing the causal parents of the agent would evaluate that there was no point in trying to correlate the asteroid strike with the button-press.
We're still trying to figure out exactly what would happen to an agent reasoning like this, but the current suspicion is that it might misbehave on account of attaching probability 0 or very low probability to the actual world because it can no longer take into account the correct correlation between and other variables.
See http://agentfoundations.org/item?id=735.