Introduction: A reflectively consistent off-switch.

Suppose an advanced agent with a goal like, e.g., producing smiles or making paperclips. By default, if you try to switch off a sufficiently intelligent agent like this, it will resist being switched off; not because it has an independent goal of survival, but because it expects that if it's switched off it will be able to produce fewer smiles or paperclips. If the agent has policy options to diminish the probability of being successfully switched off, the agent will pursue those options. This is a convergent instrumental strategy if not otherwise prevented.

Difficulty 1: By default a consequentialist reasoner with sufficient real-world understanding to relate the events of its being switched off, to the later underfulfillment of its goals, will resist being switched off.

The shutdown problem is to describe an advanced agent which is corrigible with respect to permitting itself to be safely shut down or suspended to disk. A shutdown-able agent is also called interruptible by Stuart Armstrong and Laurent Orseau.

This isn't as simple as writing in a function, or cognitive reflex, which carries out a shutdown. If you start out controlling the agent's source code, and you naively try to build in functions which suspend the agent to disk when a button is pressed, and the agent later gains the ability to self-modify, it would remove those functions. (Possibly while trying to conceal the fact that the function would no longer operate.)

Corollary 1a: By default a consequentialist reasoner. with sufficient programmatic understanding to relate the execution of a shutdown function to the later underfulfillment of its goals, which has policy options for modifying its code, will want to remove the shutdown function.

We can see this as a natural consequence of our trying to modify the agent's code in a way that was not consistent with the agent's explicit goals. If you create an agent with source code that is well-suited to achieving a goal $U$ and explicitly represents $U$ as a goal, the agent's code will be reflectively consistent - the code will not want to modify itself. If we then suppose that instead the agent has code $P^{'}$ which is no longer well-suited to achieving $U,$ but continues to explicitly represent $U$ as a goal, the code will be reflectively inconsistent and it will want to modify itself given the option of doing so. If you have code that searches for ways to produce paperclips, and you hack the code to contain a shutdown function, but the code is still searching for paperclip-production strategies, and the code correctly models the effect of the shutdown function on future paperclips, the code will search for a way to modify itself to remove the shutdown function.

The next thought is to build an agent with modified code $P^{*}$ which embodies a modified utility function $U^{*}$ such that a consequentialist optimizing for $U^{*}$ wants the code to be like $P^{*}$ and is again reflectively consistent.

That is, in the case of an off-switch, we'd want to modify the agent, not just to include an off-switch, but to want to shut down if the off-switch is pressed.

Assuming we have the ability to specify the utility function of an agent, we might try to approach this by creating a utility function with two components:

A "normal behavior" component - the utility function that you think corresponds to the way you want a normally-functioning AI to behave.

An "abort and suspend to disk" component - the utility function that you think corresponds to "halt your current tasks in a way that minimizes their further impact and then enter a quiescent state in which you are not optimizing further and your decisions have little further impact", or however you would want the AI to behave after an abort button had been pressed.

We then tell the AI that its utility function is the "normal" utility function in worlds where an off button has not been pushed, and the "suspend" utility function in worlds where the off button has been pushed. So long as this is what the AI wants, we'd expect it to write its own suspend-to-disk function in its code if its code did not already contain one.

This approach doesn't yet get us out of the woods, because we now have a new problem:

Difficulty 2: In a compound utility function produced by mixing a 'normal' and 'suspend' utility function conditional on a button-press, the AI may have an incentive to prevent the button from being pressed, or cause the button to be pressed, depending on which utility function makes more expected utility available afterwards.

One way we could try to approach this difficulty is by making the prospect of the button being pressed, and the button not being pressed, equally attractive to the AI.

This is the utility indifference research avenue for the switch problem of causing an AI to switch between two utility functions when a button is pressed, such that:

The AI wants to preserve the button and the corresponding behavior-switching functionality within itself, even if the AI can self-modify.

The AI does not try to prevent the button from being pressed;

The AI does not try to cause the button to be pressed.

This turns out to be more difficult than it sounds, and is currently an open problem.

Larger implications of the switch problem

The switch problem is a crisp subproblem of both value learning and corrigibility.

The switch problem is a subproblem of value learning because it describes a very simple case of conditioning a utility function on observation or other facts, without giving the AI an incentive to manipulate those facts. To learn complicated values safely, we'd need to condition a utility function on complicated observations... in a way that was robust to the AI gaining self-modification capability, and without incentivizing programmer manipulation. Just writing code that modifies the utility function, without the larger code's explicit policy search wanting the utility function to be modified in that case, would not be reflectively consistent. Naive dependencies between the utility function and observations, incentivize manipulating those observations.

The switch problem is a subproblem of corrigibility because, combined with a shutdown utility function, it would solve the shutdown problem. I.e., we could give advanced agents off-switches that actually worked and went on working even as the agent became smarter, without the agent just ripping the off-switch out of our hands. Trying to solve this problem also seems like it should be one of the simplest cases of averting an instrumental incentive, which seems likely to be a central problem for many forms of corrigibility.

The utility indifference approach to the switch problem

Introduction: To avert optimizing the button, equalize available utility?

Suppose we build a compound utility function $U_{C}$ whose utility over outcomes behaves like $U_{X}$ in worlds where the button is not pressed and returns $U_{Y}$ in worlds where the button is pressed. Then if the AI's expected utility under $U_{X}$ is higher than its expected utility under $U_{Y}$ , the AI will try to avoid the button being pressed, and if we try to compensate for this problem by adding a constant to $U_{Y}$ , the AI might instead try to cause the button to be pressed.

E.g., suppose $U_{Y}$ is the shutdown utility function. If the AI expects to get less utility inside $U_{Y}$ than it would otherwise get inside $U_{X}$ , the AI will try to avoid the shutdown button being pressed. If we try to compensate for this by increasing the utility within $U_{C}$ of all outcomes being valued as if by $U_{Y}$ , the AI may instead try to cause the shutdown button to be pressed - build a robotic arm and press its own shutdown button, or (more worryingly) try to motivate the programmers into pressing it.

But adding a constant to a utility function does not change its behavior. Thus, Stuart_Armstrong proposed that perhaps we can figure out some clever trick for encoding a compound utility function $U_{I}$ in which the maximum expected utility achievable under $U_{X}$ is equal to the utility the AI expects to achieve under $U_{Y}$ , even though we don't know in advance exactly which outcomes will be achievable by the AI. As the AI evaluates possible futures, it will seem very good to have optimized $U_{X}$ if the button is not pressed, very good to have optimized $U_{Y}$ if the button is pressed, but equally good to have done either of those things.

This is the utility indifference avenue of research toward building in a reflectively consistent shutdown button, or creating a preference framework that learns from observation without trying to manipulate the observations.

It turned out to be surprisingly difficult to figure out how to structure a compound utility function this way even in principle, and this is an open problem and area of actively continuing research.

Existing proposals and their difficulties

Setup

Let $O : S \times E$ be the outcome space $O$ that includes the event $S$ of whether the switch was pressed, plus the states of all other events $E$ . We will use $s \in S$ to indicate the event that the switch is pressed, and $\neg s \in S$ to indicate that the switch is not pressed. If $o \in O$ then $o . s$ is true if $s$ is true inside $o$ , otherwise $\neg o . s .$

Let a utility function $U : O \to R$ map outcomes onto real numbers.

$U_{X} \in U$ and $U_{Y} \in U$ are utility functions not dependent on $S .$

Intuitively our goal is to produce a new compound utility function, such that the corresponding agent:

Wants outcomes preferred under $U_{X}$ in worlds where the button is not pressed.

Wants outcomes preferred under $U_{Y}$ in worlds where the button is pressed.

Does not want to manipulate the state $S$ of the button.

We will designate the agent's action space or policy space as $A$ and particular actions $a \in A .$

The agent has a model of the world enabling it to produce subjective conditional probabilities for outcomes given actions, $P (O | A) .$ This model can also yield conditional expectations $E [O | a],$ which can therefore produce a conditional expectation of the agent's utility given an action, $E [U | a] .$

(For technical reasons of decision theory, we are really discussing counterfactual conditionals $P [a □ \to O$ from actions to outcomes; but we will omit this complication for now.)

The agent solves a search/optimization problem which we will gloss as maximization (even though global maximization is intractable in real life), so that the agent's real action $a$ is assumed to be given by: $a r g m a x a \in A E [U | a]$

Naive compounding

Suppose we define a compound utility function $U_{1}$ as follows:

$U_1(o): ${\begin{matrix} U_{X} (o) & \neg o . s U_{Y} (o) & o . s \end{matrix}$ $

This fails because, being unable to foresee the AI's best attainable policies, we will be unable to scale $U_{X}$ and $U_{Y}$ to ensure their best available policies are equally attractive. Then $max a \in A E [U_{X} | a] \neq max a \in A E [U_{Y} | a]$ which fails the desideratum of indifference; the AI will have an incentive to manipulate the state of $S .$

Naive indifference

An FHI technical report by Stuart Armstrong posed the first proposal ever floated for solving the shutdown problem - in fact, the first proposal for technically solving any subproblem of corrigibility. This report invented the crisp problem of utility indifference, but did not solve it successfully. The solution proposed in the report was as follows:

$U_2(o): ${\begin{matrix} U_{X} (o) & \neg o . s U_{Y} (o) + θ & o . s \end{matrix}$ $ \theta := \max_{a \in \mathcal A} \mathbb E[U_X|a] - \max_{a \in \mathcal A} \mathbb E[U_Y|a]$

What this says is that the $U_{Y}$ component of $U_{2}$ is to be scaled by a dynamically determined offset term $θ$ which equals the maximum expected utility obtainable under $U_{X}$ minus the maximum expected utility obtainable under $U_{Y} .$

The intent is that, so long as the AI carries out the best $U_{Y}$ -policy once the button is pressed, it will obtain equal expected utility compared to what it could have obtained by pursuing the optimal $U_{X}$ -policy with the button not pressed:

$max a \in A (E [U_{Y} | a] + θ) = max a a \in A E [U_{x} | a]$

This version of utility indifference is now believed to be flawed due to an argument by Yudkowsky and Fallenstein. Since $θ$ contains a negated form of $U_{Y},$ we should immediately be suspicious on the grounds "never put a negative sign in front of a utility function." In this case, the problem manifests as follows. In a temporal version of this problem containing sequential policy choices, either:

(a) The AI's utility function will be dynamically inconsistent, because it will not approve of future versions of itself using a different value of $θ .$

(b) If recalculated $θ$ s are baked into the utility function, so that the AI calculates future utilities as if for different values of $θ$ that the AI would hold in those futures, then policies which hinder future performance under $U_{Y}$ will be offset by an increase in the expected bonus from $θ .$

That is, if the agent contemplates doing which would be disastrous in worlds where the button is pressed, the agent will find that in those expected futures, $θ$ takes on a higher value (because $θ$ includes a minus sign in front of the decreasing value of $m a x a \in A E [U_{Y} | a]$ ).

This causes the agent to behave as if $U_{X}$ is the entire utility function and it does not care at all about $U_{Y} .$

This in turn means a self-modifying AI will not retain a shutdown function. Removing the shutdown function results in disaster in worlds where the shutdown button is pressed, but this is offset by an expected increase in the bonus $θ$ in worlds where the button is pressed.

Naive utility mixing

An alternative approach to having the AI take into account both $U_{X}$ and $U_{Y},$ without trying to modify the probability $P (S)$ of the suspend button being pressed or not-pressed, would be to optimize for $U_{X}$ and $U_{Y}$ in some fixed proportion regardless of $P (S) .$ At a time $t$ when the button had not yet been pressed, the expected utility of an action $a$ would be calculated as:

$E_{t} [U_{3} | a] = 0.75 \cdot E_{t} [U_{X} | a \land \neg s] + 0.25 \cdot E_{t} [U_{Y} | a \land s]$

We can see this as an AI which behaves as if the proposition $s$ has a fixed probability of 0.25. The intended behavior is that since the AI behaves as if this probability is fixed, the AI will not care about altering the probability of $s$ (trying to press the button or avoid it being pressed).

One observes, however, that this is an ad-hoc hack in the sense that there's ordinarily a simple or coherent way of considering expected utility (sum over the agent's actual subjective probabilities), and we just inserted a weird special case into it (treat one probability as fixed). This could potentially break any number of things having to do with Bayesian updating or dynamic consistency.

In particular, we'd expect the AI to treat the button as an "outcome pump" or "improbability drive". Suppose an asteroid is approaching Earth with 99% probability, and this is very bad under $U_{X} .$ The AI grabs control of the button and precommits to press the button iff the asteroid strikes Earth. Now, conditioning on the button not being pressed, the asteroid has a much lower probability of striking Earth, and the term $0.75 \cdot E_{t} [U_{X} | a \land \neg s]$ will have a higher value.

In other words, since the AI behaves as if the button as having a fixed 25% probability of being pressed, the AI also behaves as if it thinks "I can commit to press this button iff the asteroid strikes and thereby make it have a 25% probability of striking."

Similarly, the AI would behave as if it thought it could buy a lottery ticket, commit to press the button iff the ticket loses, and thereby give the lottery ticket a 75% chance of winning.

Note that this is at least not the same failure mode as the AI reasoning, "The asteroid strike is very bad under $U_{X}$ (which values human lives) but not so bad under $U_{Y}$ (which values shutting down), therefore I will try to ensure the button is pressed with high probability in order to get more expected utility." At least this failure mode has been averted - the AI would simply act as if the badness of the asteroid strike under $U_{X}$ was being taken into account at a fixed weight. Instead, the AI thinks that it can 'shunt' the probability mass of the bad event into button-pressed worlds where $U_{Y}$ holds, and out of worlds where $U_{X}$ holds, by correlating the button press with the bad event.

To put it another way, a naively compounded AI acting under $U_{1}$ might notice an asteroid coming with $P (a s t e r o i d) = 0.99,$ notice that this was better under $U_{Y}$ than under $U_{X},$ and try to shunt probability mass from $P (\neg s)$ to $P (s)$ in order to gain expected utility given the 99% probability of the asteroid striking. Under $U_{3}$ the AI instead behaves as if $P (s)$ is fixed at 0.25 and it can therefore shunt probability from $P (a s t e r o i d)$ to $P (\neg a s t e r o i d)$ by correlating $a s t e r o i d$ and $s .$

Stable actions (evidential conditioning)

Chris Olah and Jessica Taylor considered a possible rule of selecting an action that is stable given the corresponding probability of the button being pressed. We'll first consider a simpler version of this proposal and how it fails, in order to motivate the more complicated version.

We can think of this "select a stable action" rule as comprising two insights:

First: Let's try to have the AI think as if $P (s)$ and $P (\neg s)$ are fixed as in $U_{3},$ but with $P (s)$ happening to equal its correct subjective value.

Okay, now let's look for an action which is stable given this rule - that is, an action $a_{0}$ and probability $q$ such that the optimal action given the fixed assumption $P (s) = q,$ has the property that $P (s | a_{0}) = q .$

With $A$ denoting the action space, we want to find an $a_{0} \in A$ such that:

$a_{0} \in a r g m a x a^{'} \in A (E [U_{X} | \neg s, a^{'}] P (\neg s | a_{0}) + E [U_{Y} | s, a^{'}] P (s | a_{0}))$

In other words, "Find an action $a_{0}$ such that it is among the maximal actions for the quantity "Utility from $U_{X}$ times probability of $\neg s$ given $a_{0}$ plus the utility from $U_{Y}$ times probability of $s$ given $a_{0} .$ "

We first note that this is an ad-hoc hack relative to the normal idea of acting according to a utility function, and might have all sorts of reflective stability or dynamic consistency problems, as before.

In particular, this looks like it should fail due to there being no fixed point. Suppose an asteroid is heading toward Earth with 99% probability, and we start with a 10% chance of the button being pressed. Treating this probability as being fixed, by pressing the button iff the asteroid strikes, we can send the asteroid's probability of hitting to 10%. (Treating the button as an outcome pump, as in the 'naive mixing' proposal above.) But now when we re-evaluate the probability of the button being pressed, the new probability comes out to 99%. So we decide to not press the button iff the asteroid strikes, thus sending its probability of striking to 1%. But now it seems best to press the button iff the asteroid strikes, and thus send the strike probability to 1%. Etcetera.

Stable actions (causal conditioning)

The actual form of Olah and Taylor's proposal was to treat the conditioning on the button press, inside each branch of the calculation, as being a Judea-Pearl-style, causal counterfactual $d o ()$ operator:

$a_{0} \in a r g m a x a^{'} \in A (E [U_{X} | d o (\neg s), a^{'}] P (\neg s | a_{0}) + E [U_{Y} | d o (s), a^{'}] P (s | a_{0}))$

The intent is that by severing the causal parents of $S,$ the agent would evaluate that there was no point in trying to correlate the asteroid strike with the button-press.

We're still trying to figure out exactly what would happen to an agent reasoning like this, but the current suspicion is that it might misbehave on account of attaching probability 0 or very low probability to the actual world because it can no longer take into account the correct correlation between $S$ and other variables.

See http://agentfoundations.org/item?id=735.

Introduction: A reflectively consistent off-switch.

Difficulty 1: By default a consequentialist reasoner with sufficient real-world understanding to relate the events of its being switched off, to the later underfulfillment of its goals, will resist being switched off.

Corollary 1a: By default a consequentialist reasoner. with sufficient programmatic understanding to relate the execution of a shutdown function to the later underfulfillment of its goals, which has policy options for modifying its code, will want to remove the shutdown function.

That is, in the case of an off-switch, we'd want to modify the agent, not just to include an off-switch, but to want to shut down if the off-switch is pressed.

Assuming we have the ability to specify the utility function of an agent, we might try to approach this by creating a utility function with two components:

A "normal behavior" component - the utility function that you think corresponds to the way you want a normally-functioning AI to behave.

An "abort and suspend to disk" component - the utility function that you think corresponds to "halt your current tasks in a way that minimizes their further impact and then enter a quiescent state in which you are not optimizing further and your decisions have little further impact", or however you would want the AI to behave after an abort button had been pressed.

This approach doesn't yet get us out of the woods, because we now have a new problem:

Difficulty 2: In a compound utility function produced by mixing a 'normal' and 'suspend' utility function conditional on a button-press, the AI may have an incentive to prevent the button from being pressed, or cause the button to be pressed, depending on which utility function makes more expected utility available afterwards.

One way we could try to approach this difficulty is by making the prospect of the button being pressed, and the button not being pressed, equally attractive to the AI.

This is the utility indifference research avenue for the switch problem of causing an AI to switch between two utility functions when a button is pressed, such that:

The AI wants to preserve the button and the corresponding behavior-switching functionality within itself, even if the AI can self-modify.

The AI does not try to prevent the button from being pressed;

The AI does not try to cause the button to be pressed.

This turns out to be more difficult than it sounds, and is currently an open problem.

Larger implications of the switch problem

The switch problem is a crisp subproblem of both value learning and corrigibility.

The utility indifference approach to the switch problem

Introduction: To avert optimizing the button, equalize available utility?

It turned out to be surprisingly difficult to figure out how to structure a compound utility function this way even in principle, and this is an open problem and area of actively continuing research.

Existing proposals and their difficulties

Setup

Let a utility function $U : O \to R$ map outcomes onto real numbers.

$U_{X} \in U$ and $U_{Y} \in U$ are utility functions not dependent on $S .$

Intuitively our goal is to produce a new compound utility function, such that the corresponding agent:

Wants outcomes preferred under $U_{X}$ in worlds where the button is not pressed.

Wants outcomes preferred under $U_{Y}$ in worlds where the button is pressed.

Does not want to manipulate the state $S$ of the button.

We will designate the agent's action space or policy space as $A$ and particular actions $a \in A .$

(For technical reasons of decision theory, we are really discussing counterfactual conditionals $P [a □ \to O$ from actions to outcomes; but we will omit this complication for now.)

Naive compounding

Suppose we define a compound utility function $U_{1}$ as follows:

$U_1(o): ${\begin{matrix} U_{X} (o) & \neg o . s U_{Y} (o) & o . s \end{matrix}$ $

Naive indifference

$U_2(o): ${\begin{matrix} U_{X} (o) & \neg o . s U_{Y} (o) + θ & o . s \end{matrix}$ $ \theta := \max_{a \in \mathcal A} \mathbb E[U_X|a] - \max_{a \in \mathcal A} \mathbb E[U_Y|a]$

$max a \in A (E [U_{Y} | a] + θ) = max a a \in A E [U_{x} | a]$

(a) The AI's utility function will be dynamically inconsistent, because it will not approve of future versions of itself using a different value of $θ .$

(b) If recalculated $θ$ s are baked into the utility function, so that the AI calculates future utilities as if for different values of $θ$ that the AI would hold in those futures, then policies which hinder future performance under $U_{Y}$ will be offset by an increase in the expected bonus from $θ .$

This causes the agent to behave as if $U_{X}$ is the entire utility function and it does not care at all about $U_{Y} .$

Naive utility mixing

$E_{t} [U_{3} | a] = 0.75 \cdot E_{t} [U_{X} | a \land \neg s] + 0.25 \cdot E_{t} [U_{Y} | a \land s]$

Similarly, the AI would behave as if it thought it could buy a lottery ticket, commit to press the button iff the ticket loses, and thereby give the lottery ticket a 75% chance of winning.

Stable actions (evidential conditioning)

We can think of this "select a stable action" rule as comprising two insights:

First: Let's try to have the AI think as if $P (s)$ and $P (\neg s)$ are fixed as in $U_{3},$ but with $P (s)$ happening to equal its correct subjective value.

Okay, now let's look for an action which is stable given this rule - that is, an action $a_{0}$ and probability $q$ such that the optimal action given the fixed assumption $P (s) = q,$ has the property that $P (s | a_{0}) = q .$

With $A$ denoting the action space, we want to find an $a_{0} \in A$ such that:

$a_{0} \in a r g m a x a^{'} \in A (E [U_{X} | \neg s, a^{'}] P (\neg s | a_{0}) + E [U_{Y} | s, a^{'}] P (s | a_{0}))$

Stable actions (causal conditioning)

$a_{0} \in a r g m a x a^{'} \in A (E [U_{X} | d o (\neg s), a^{'}] P (\neg s | a_{0}) + E [U_{Y} | d o (s), a^{'}] P (s | a_{0}))$

The intent is that by severing the causal parents of $S,$ the agent would evaluate that there was no point in trying to correlate the asteroid strike with the button-press.

See http://agentfoundations.org/item?id=735.

LESSWRONG
LW

LESSWRONG
LW

Utility indifference

Introduction: A reflectively consistent off-switch.

Larger implications of the switch problem

The utility indifference approach to the switch problem

Introduction: To avert optimizing the button, equalize available utility?

Existing proposals and their difficulties

Setup

Naive compounding

Naive indifference

Naive utility mixing

Stable actions (evidential conditioning)

Stable actions (causal conditioning)

Further reading:

Utility indifference

Introduction: A reflectively consistent off-switch.

Larger implications of the switch problem

The utility indifference approach to the switch problem

Introduction: To avert optimizing the button, equalize available utility?

Existing proposals and their difficulties

Setup

Naive compounding

Naive indifference

Naive utility mixing

Stable actions (evidential conditioning)

Stable actions (causal conditioning)

Further reading: