Suppose an advanced agent with a goal like, e.g., producing smiles or making paperclips. By default, if you try to switch off a sufficiently intelligent agent like this, it will resist being switched off; not because it has an independent goal of survival, but because it expects that if it's switched off it will be able to produce fewer smiles or paperclips. If the agent has policy options to diminish the probability of being successfully switched off, the agent will pursue those options. This is a convergent instrumental strategy if not otherwise prevented.
Similarly, if you start out controlling the agent's source code, and you naively try to build in functions which suspend the agent to disk when a button is pressed, and the agent later gains the ability to self-modify, it would remove those functions (possibly while trying to conceal the fact that the function would no longer operate).
We can see this as a natural consequence of our trying to modify the agent's code in a way that was not consistent with the agent's explicit goals. If you create an agent with source code that is well-suited to achieving a goal and explicitly represents as a goal, the agent's code will be reflectively consistent - the code will not want to modify itself. If we then suppose that instead the agent has code which is no longer well-suited to achieving but continues to explicitly represent as a goal, the code will be reflectively inconsistent and it will want to modify itself given the option of doing so. If you have code that searches for ways to produce paperclips, and you hack the code to contain a shutdown function, but the code is still searching for paperclip-production strategies, and the code correctly models the effect of the shutdown function on future paperclips, the code will search for a way to modify itself to remove the shutdown function.
The next thought is to build an agent with modified code which also embodies a modified utility function such that a consequentialist optimizing for wants the code to be like and is again reflectively consistent. In the case of an off-switch, we'd want to modify the agent, not just to include an off-switch, but to want to shut down if the off-switch is pressed.
Assuming we have the ability to specify the utility function of an agent, we might try to approach this by creating a utility function with two components:
We then tell the AI that its utility function is the "normal" utility function in worlds where an off button has not been pushed, and the "suspend" utility function in worlds where the off button has been pushed. So long as this is what the AI wants, we'd expect it to write its own suspend-to-disk function in its code if its code did not already contain one.
However, this approach doesn't yet get us out of the woods, because we now have a new problem:
One way we could try to approach this difficulty is by making the prospect of the button being pressed, and the button not being pressed, equally attractive to the AI. In a way that still preserves the dependence of desired future behavior on whether or not the button was pressed, i.e., the AI still wants to behave normally if the button is not pressed and suspend-to-disk if it is pressed.
This is the utility indifference research avenue for the switch problem of causing an AI to switch between two utility functions when a button is pressed, such that:
This turns out to be more difficult than it sounds, and is currently an open problem.
The switch problem is a crisp subproblem of both value learning and corrigibility.
The switch problem is a subproblem of value learning because it describes a very simple case of conditioning a utility function on observation or other facts, without giving the AI an incentive to manipulate those facts. To learn complicated values safely, we'd need to condition a utility function on complicated observations... in a way that was robust to the AI gaining self-modification capability, and without incentivizing programmer manipulation. Just writing code that modifies the utility function, without the larger code's explicit policy search wanting the utility function to be modified in that case, would not be reflectively consistent. Naive dependencies between the utility function and observations, incentivize manipulating those observations.
The switch problem is a subproblem of corrigibility because, combined with a shutdown utility function, it would solve the shutdown problem. I.e., we could give advanced agents off-switches that actually worked and went on working even as the agent became smarter, without the agent just ripping the off-switch out of our hands. Trying to solve this problem also seems like it should be one of the simplest cases of averting an instrumental incentive, which seems likely to be a central problem for many forms of corrigibility.
Suppose we build a compound utility function whose utility over outcomes behaves like in worlds where the button is not pressed and returns in worlds where the button is pressed. Then if the AI's expected utility under is higher than its expected utility under , the AI will try to avoid the button being pressed, and if we try to compensate for this problem by adding a constant to , the AI might instead try to cause the button to be pressed.
E.g., suppose is the shutdown utility function. If the AI expects to get less utility inside than it would otherwise get inside , the AI will try to avoid the shutdown button being pressed. If we try to compensate for this by increasing the utility within of all outcomes being valued as if by , the AI may instead try to cause the shutdown button to be pressed - build a robotic arm and press its own shutdown button, or (more worryingly) try to motivate the programmers into pressing it.
But adding a constant to a utility function does not change its behavior. Thus, Stuart_Armstrong proposed that perhaps we can figure out some clever trick for encoding a compound utility function in which the maximum expected utility achievable under is equal to the utility the AI expects to achieve under , even though we don't know in advance exactly which outcomes will be achievable by the AI. As the AI evaluates possible futures, it will seem very good to have optimized if the button is not pressed, very good to have optimized if the button is pressed, but equally good to have done either of those things.
This is the utility indifference avenue of research toward building in a reflectively consistent shutdown button, or creating a preference framework that learns from observation without trying to manipulate the observations.
It turned out to be surprisingly difficult to figure out how to structure a compound utility function this way even in principle, and this is an open problem and area of actively continuing research.
Let be the outcome space that includes the event of whether the switch was pressed, plus the states of all other events . We will use to indicate the event that the switch is pressed, and to indicate that the switch is not pressed. If then is true if is true inside , otherwise
Let a utility function map outcomes onto real numbers.
and are utility functions not dependent on
Intuitively our goal is to produce a new compound utility function, such that the corresponding agent:
We will designate the agent's action space or policy space as and particular actions
The agent has a model of the world enabling it to produce subjective conditional probabilities for outcomes given actions, This model can also yield conditional expectations which can therefore produce a conditional expectation of the agent's utility given an action,
(For technical reasons of decision theory, we are really discussing counterfactual conditionals from actions to outcomes; but we will omit this complication for now.)
The agent solves a search/optimization problem which we will gloss as maximization (even though global maximization is intractable in real life), so that the agent's real action is assumed to be given by:
Suppose we define a compound utility function as follows:
$U_1(o): $
This fails because, being unable to foresee the AI's best attainable policies, we will be unable to scale and to ensure their best available policies are equally attractive. Then which fails the desideratum of indifference; the AI will have an incentive to manipulate the state of
An FHI technical report by Stuart Armstrong posed the first proposal ever floated for solving the shutdown problem - in fact, the first proposal for technically solving any subproblem of corrigibility. This report invented the crisp problem of utility indifference, but did not solve it successfully. The solution proposed in the report was as follows:
$U_2(o): $ \theta := \max_{a \in \mathcal A} \mathbb E[U_X|a] - \max_{a \in \mathcal A} \mathbb E[U_Y|a]$
What this says is that the component of is to be scaled by a dynamically determined offset term which equals the maximum expected utility obtainable under minus the maximum expected utility obtainable under
The intent is that, so long as the AI carries out the best -policy once the button is pressed, it will obtain equal expected utility compared to what it could have obtained by pursuing the optimal -policy with the button not pressed:
This version of utility indifference is now believed to be flawed due to an argument by Yudkowsky and Fallenstein. Since contains a negated form of we should immediately be suspicious on the grounds "never put a negative sign in front of a utility function." In this case, the problem manifests as follows. In a temporal version of this problem containing sequential policy choices, either:
That is, if the agent contemplates doing which would be disastrous in worlds where the button is pressed, the agent will find that in those expected futures, takes on a higher value (because includes a minus sign in front of the decreasing value of ).
This causes the agent to behave as if is the entire utility function and it does not care at all about
This in turn means a self-modifying AI will not retain a shutdown function. Removing the shutdown function results in disaster in worlds where the shutdown button is pressed, but this is offset by an expected increase in the bonus in worlds where the button is pressed.
(in progress)
(in progress)