Introduction: A reflectively consistent off-switch.

Suppose an advanced agent with a goal like, e.g., producing smiles or making paperclips. By default, if you try to switch off a sufficiently intelligent agent like this, it will resist being switched off; not because it has an independent goal of survival, but because it expects that if it's switched off it will be able to produce fewer smiles or paperclips. If the agent has policy options to diminish the probability of being successfully switched off, the agent will pursue those options. This is a convergent instrumental strategy if not otherwise prevented.

Difficulty 1: By default a consequentialist reasoner with sufficient real-world understanding to relate the events of its being switched off, to the later underfulfillment of its goals, will resist being switched off.

Similarly, if you start out controlling the agent's source code, and you naively try to build in functions which suspend the agent to disk when a button is pressed, and the agent later gains the ability to self-modify, it would remove those functions (possibly while trying to conceal the fact that the function would no longer operate).

Corollary 1a: By default a consequentialist reasoner. with sufficient programmatic understanding to relate the execution of a shutdown function to the later underfulfillment of its goals, which has policy options for modifying its code, will want to remove the shutdown function.

We can see this as a natural consequence of our trying to modify the agent's code in a way that was not consistent with the agent's explicit goals. If you create an agent with source code that is well-suited to achieving a goal $U$ and explicitly represents $U$ as a goal, the agent's code will be reflectively consistent - the code will not want to modify itself. If we then suppose that instead the agent has code $P^{'}$ which is no longer well-suited to achieving $U,$ but continues to explicitly represent $U$ as a goal, the code will be reflectively inconsistent and it will want to modify itself given the option of doing so. If you have code that searches for ways to produce paperclips, and you hack the code to contain a shutdown function, but the code is still searching for paperclip-production strategies, and the code correctly models the effect of the shutdown function on future paperclips, the code will search for a way to modify itself to remove the shutdown function.

The next thought is to build an agent with modified code $P^{*}$ which also embodies a modified utility function $U^{*}$ such that a consequentialist optimizing for $U^{*}$ wants the code to be like $P^{*}$ and is again reflectively consistent. In the case of an off-switch, we'd want to modify the agent, not just to include an off-switch, but to want to shut down if the off-switch is pressed.

Assuming we have the ability to specify the utility function of an agent, we might try to approach this by creating a utility function with two components:

A "normal behavior" component - the utility function that you think corresponds to the way you want a normally-functioning AI to behave.

An "abort and suspend to disk" component - the utility function that you think corresponds to "halt your current tasks in a way that minimizes their further impact and then enter a quiescent state in which you are not optimizing further and your decisions have little further impact", or however you would want the AI to behave after an abort button had been pressed.

We then tell the AI that its utility function is the "normal" utility function in worlds where an off button has not been pushed, and the "suspend" utility function in worlds where the off button has been pushed. So long as this is what the AI wants, we'd expect it to write its own suspend-to-disk function in its code if its code did not already contain one.

However, this approach doesn't yet get us out of the woods, because we now have a new problem:

Difficulty 2: In a compound utility function produced by mixing a 'normal' and 'suspend' utility function conditional on a 'button', the AI may have an incentive to prevent the button from being pushed, or cause the button to be pushed, depending on which utility function makes more expected utility available afterwards.

One way we could try to approach this difficulty is by making the prospect of the button being pressed, and the button not being pressed, equally attractive to the AI. In a way that still preserves the dependence of desired future behavior on whether or not the button was pressed, i.e., the AI still wants to behave normally if the button is not pressed and suspend-to-disk if it is pressed.

This is the utility indifference research avenue for the switch problem of causing an AI to switch between two utility functions when a button is pressed, such that:

The AI wants to preserve the switch and the corresponding behavior-switching functionality within itself, even if the AI can self-modify.

The AI does not try to prevent the switch from being pressed;

The AI does not try to cause the switch to be pressed.

This turns out to be more difficult than it sounds, and is currently an open problem.

Larger implications of the switch problem

The switch problem is a crisp subproblem of both value learning and corrigibility.

The switch problem is a subproblem of value learning because it describes a very simple case of conditioning a utility function on observation or other facts, without giving the AI an incentive to manipulate those facts. To learn complicated values safely, we'd need to condition a utility function on complicated observations... in a way that was robust to the AI gaining self-modification capability, and without incentivizing programmer manipulation. Just writing code that modifies the utility function, without the larger code's explicit policy search wanting the utility function to be modified in that case, would not be reflectively consistent. Naive dependencies between the utility function and observations, incentivize manipulating those observations.

The switch problem is a subproblem of corrigibility because, combined with a shutdown utility function, it would solve the shutdown problem. I.e., we could give advanced agents off-switches that actually worked and went on working even as the agent became smarter, without the agent just ripping the off-switch out of our hands. Trying to solve this problem also seems like it should be one of the simplest cases of averting an instrumental incentive, which seems likely to be a central problem for many forms of corrigibility.

The utility indifference approach to the switch problem

Introduction: To avert optimizing the button, equalize available utility?

Suppose we build a compound utility function $U_{C}$ whose utility over outcomes behaves like $U_{X}$ in worlds where the button is not pressed and returns $U_{Y}$ in worlds where the button is pressed. Then if the AI's expected utility under $U_{X}$ is higher than its expected utility under $U_{Y}$ , the AI will try to avoid the button being pressed, and if we try to compensate for this problem by adding a constant to $U_{Y}$ , the AI might instead try to cause the button to be pressed.

E.g., suppose $U_{Y}$ is the shutdown utility function. If the AI expects to get less utility inside $U_{Y}$ than it would otherwise get inside $U_{X}$ , the AI will try to avoid the shutdown button being pressed. If we try to compensate for this by increasing the utility within $U_{C}$ of all outcomes being valued as if by $U_{Y}$ , the AI may instead try to cause the shutdown button to be pressed - build a robotic arm and press its own shutdown button, or (more worryingly) try to motivate the programmers into pressing it.

But adding a constant to a utility function does not change its behavior. Thus, Stuart_Armstrong proposed that perhaps we can figure out some clever trick for encoding a compound utility function $U_{I}$ in which the maximum expected utility achievable under $U_{X}$ is equal to the utility the AI expects to achieve under $U_{Y}$ , even though we don't know in advance exactly which outcomes will be achievable by the AI. As the AI evaluates possible futures, it will seem very good to have optimized $U_{X}$ if the button is not pressed, very good to have optimized $U_{Y}$ if the button is pressed, but equally good to have done either of those things.

This is the utility indifference avenue of research toward building in a reflectively consistent shutdown button, or creating a preference framework that learns from observation without trying to manipulate the observations.

It turned out to be surprisingly difficult to figure out how to structure a compound utility function this way even in principle, and this is an open problem and area of actively continuing research.

Existing proposals and their difficulties

Setup

Let $O : S \times E$ be the outcome space $O$ that includes the event $S$ of whether the switch was pressed, plus the states of all other events $E$ . We will use $s \in S$ to indicate the event that the switch is pressed, and $\neg s \in S$ to indicate that the switch is not pressed. If $o \in O$ then $o . s$ is true if $s$ is true inside $o$ , otherwise $\neg o . s .$

Let a utility function $U : O \to R$ map outcomes onto real numbers.

$U_{X} \in U$ and $U_{Y} \in U$ are utility functions not dependent on $S .$

Intuitively our goal is to produce a new compound utility function, such that the corresponding agent:

Wants outcomes preferred under $U_{X}$ in worlds where the button is not pressed.

Wants outcomes preferred under $U_{Y}$ in worlds where the button is pressed.

Does not want to manipulate the state $S$ of the button.

We will designate the agent's action space or policy space as $A$ and particular actions $a \in A .$

The agent has a model of the world enabling it to produce subjective conditional probabilities for outcomes given actions, $P (O | A) .$ This model can also yield conditional expectations $E [O | a],$ which can therefore produce a conditional expectation of the agent's utility given an action, $E [U | a] .$

(For technical reasons of decision theory, we are really discussing counterfactual conditionals $P [a □ \to O$ from actions to outcomes; but we will omit this complication for now.)

The agent solves a search/optimization problem which we will gloss as maximization (even though global maximization is intractable in real life), so that the agent's real action $a$ is assumed to be given by: $a r g m a x a \in A E [U | a]$

Naive compounding

Suppose we define a compound utility function $U_{1}$ as follows:

$U_1(o): ${\begin{matrix} U_{X} (o) & \neg o . s U_{Y} (o) & o . s \end{matrix}$ $

This fails because, being unable to foresee the AI's best attainable policies, we will be unable to scale $U_{X}$ and $U_{Y}$ to ensure their best available policies are equally attractive. Then $max a \in A E [U_{X} | a] \neq max a \in A E [U_{Y} | a]$ which fails the desideratum of indifference; the AI will have an incentive to manipulate the state of $S .$

Naive indifference

An FHI technical report by Stuart Armstrong posed the first proposal ever floated for solving the shutdown problem - in fact, the first proposal for technically solving any subproblem of corrigibility. This report invented the crisp problem of utility indifference, but did not solve it successfully. The solution proposed in the report was as follows:

$U_2(o): ${\begin{matrix} U_{X} (o) & \neg o . s U_{Y} (o) + θ & o . s \end{matrix}$ $ \theta := \max_{a \in \mathcal A} \mathbb E[U_X|a] - \max_{a \in \mathcal A} \mathbb E[U_Y|a]$

What this says is that the $U_{Y}$ component of $U_{2}$ is to be scaled by a dynamically determined offset term $θ$ which equals the maximum expected utility obtainable under $U_{X}$ minus the maximum expected utility obtainable under $U_{Y} .$

The intent is that, so long as the AI carries out the best $U_{Y}$ -policy once the button is pressed, it will obtain equal expected utility compared to what it could have obtained by pursuing the optimal $U_{X}$ -policy with the button not pressed:

$max a \in A (E [U_{Y} | a] + θ) = max a a \in A E [U_{x} | a]$

This version of utility indifference is now believed to be flawed due to an argument by Yudkowsky and Fallenstein. Since $θ$ contains a negated form of $U_{Y},$ we should immediately be suspicious on the grounds "never put a negative sign in front of a utility function." In this case, the problem manifests as follows. In a temporal version of this problem containing sequential policy choices, either:

(a) The AI's utility function will be dynamically inconsistent, because it will not approve of future versions of itself using a different value of $θ .$

(b) If recalculated $θ$ s are baked into the utility function, so that the AI calculates future utilities as if for different values of $θ$ that the AI would hold in those futures, then policies which hinder future performance under $U_{Y}$ will be offset by an increase in the expected bonus from $θ .$

That is, if the agent contemplates doing which would be disastrous in worlds where the button is pressed, the agent will find that in those expected futures, $θ$ takes on a higher value (because $θ$ includes a minus sign in front of the decreasing value of $m a x a \in A E [U_{Y} | a]$ ).

This causes the agent to behave as if $U_{X}$ is the entire utility function and it does not care at all about $U_{Y} .$

This in turn means a self-modifying AI will not retain a shutdown function. Removing the shutdown function results in disaster in worlds where the shutdown button is pressed, but this is offset by an expected increase in the bonus $θ$ in worlds where the button is pressed.

Naive utility mixing

(in progress)

Olah & Taylor's stable policies

(in progress)

Introduction: A reflectively consistent off-switch.

Difficulty 1: By default a consequentialist reasoner with sufficient real-world understanding to relate the events of its being switched off, to the later underfulfillment of its goals, will resist being switched off.

Corollary 1a: By default a consequentialist reasoner. with sufficient programmatic understanding to relate the execution of a shutdown function to the later underfulfillment of its goals, which has policy options for modifying its code, will want to remove the shutdown function.

Assuming we have the ability to specify the utility function of an agent, we might try to approach this by creating a utility function with two components:

A "normal behavior" component - the utility function that you think corresponds to the way you want a normally-functioning AI to behave.

An "abort and suspend to disk" component - the utility function that you think corresponds to "halt your current tasks in a way that minimizes their further impact and then enter a quiescent state in which you are not optimizing further and your decisions have little further impact", or however you would want the AI to behave after an abort button had been pressed.

However, this approach doesn't yet get us out of the woods, because we now have a new problem:

Difficulty 2: In a compound utility function produced by mixing a 'normal' and 'suspend' utility function conditional on a 'button', the AI may have an incentive to prevent the button from being pushed, or cause the button to be pushed, depending on which utility function makes more expected utility available afterwards.

This is the utility indifference research avenue for the switch problem of causing an AI to switch between two utility functions when a button is pressed, such that:

The AI wants to preserve the switch and the corresponding behavior-switching functionality within itself, even if the AI can self-modify.

The AI does not try to prevent the switch from being pressed;

The AI does not try to cause the switch to be pressed.

This turns out to be more difficult than it sounds, and is currently an open problem.

Larger implications of the switch problem

The switch problem is a crisp subproblem of both value learning and corrigibility.

The utility indifference approach to the switch problem

Introduction: To avert optimizing the button, equalize available utility?

It turned out to be surprisingly difficult to figure out how to structure a compound utility function this way even in principle, and this is an open problem and area of actively continuing research.

Existing proposals and their difficulties

Setup

Let a utility function $U : O \to R$ map outcomes onto real numbers.

$U_{X} \in U$ and $U_{Y} \in U$ are utility functions not dependent on $S .$

Intuitively our goal is to produce a new compound utility function, such that the corresponding agent:

Wants outcomes preferred under $U_{X}$ in worlds where the button is not pressed.

Wants outcomes preferred under $U_{Y}$ in worlds where the button is pressed.

Does not want to manipulate the state $S$ of the button.

We will designate the agent's action space or policy space as $A$ and particular actions $a \in A .$

(For technical reasons of decision theory, we are really discussing counterfactual conditionals $P [a □ \to O$ from actions to outcomes; but we will omit this complication for now.)

Naive compounding

Suppose we define a compound utility function $U_{1}$ as follows:

$U_1(o): ${\begin{matrix} U_{X} (o) & \neg o . s U_{Y} (o) & o . s \end{matrix}$ $

Naive indifference

$U_2(o): ${\begin{matrix} U_{X} (o) & \neg o . s U_{Y} (o) + θ & o . s \end{matrix}$ $ \theta := \max_{a \in \mathcal A} \mathbb E[U_X|a] - \max_{a \in \mathcal A} \mathbb E[U_Y|a]$

$max a \in A (E [U_{Y} | a] + θ) = max a a \in A E [U_{x} | a]$

(a) The AI's utility function will be dynamically inconsistent, because it will not approve of future versions of itself using a different value of $θ .$

(b) If recalculated $θ$ s are baked into the utility function, so that the AI calculates future utilities as if for different values of $θ$ that the AI would hold in those futures, then policies which hinder future performance under $U_{Y}$ will be offset by an increase in the expected bonus from $θ .$

This causes the agent to behave as if $U_{X}$ is the entire utility function and it does not care at all about $U_{Y} .$

Naive utility mixing

(in progress)

Olah & Taylor's stable policies

(in progress)

LESSWRONG
LW

LESSWRONG
LW

Utility indifference

Introduction: A reflectively consistent off-switch.

Larger implications of the switch problem

The utility indifference approach to the switch problem

Introduction: To avert optimizing the button, equalize available utility?

Existing proposals and their difficulties

Setup

Naive compounding

Naive indifference

Naive utility mixing

Olah & Taylor's stable policies

Further reading:

Utility indifference

Introduction: A reflectively consistent off-switch.

Larger implications of the switch problem

The utility indifference approach to the switch problem

Introduction: To avert optimizing the button, equalize available utility?

Existing proposals and their difficulties

Setup

Naive compounding

Naive indifference

Naive utility mixing

Olah & Taylor's stable policies

Further reading: