Nash Bargaining between Subagents doesn't solve the Shutdown Problem

A.H.

Work funded by the Long Term Future Fund.

Corrigibility is the hypothetical feature of some agents which allows them to be 'shut down' by an outside user without attempting to manipulate the whether or not they are shut down. The motivation behind this concept is the possibility of making an AI agent which can pursue a 'trial' goal given to it by its creators, but can be stopped if pursuing this goal becomes undesirable.

On the face of it, Corrigibility sounds a bit like a bargaining problem. A corrigible agent might behave as a comprise between two subagents: one which cares about pursuing the original goal, and one which cares about achieving shutdown.

Recently, @johnswentworth and @David Lorell have written A Shutdown Problem Proposal, which discusses this suggestion a bit more. However, they do not specify out the mechanism by which the subagents will reach their agreement. They write:

Then there’s the problem of designing the negotiation infrastructure, and in particular allocating bargaining power to the various subagents. They all get a veto, but that still leaves a lot of degrees of freedom in exactly how much the agent pursues the goals of each subagent. For the shutdown use-case, we probably want to allocate most of the bargaining power to the non-shutdown subagent, so that we can see what the system does when mostly optimizing for u_1 (while maintaining the option of shutting down later).

Coincidently, this is similar to something I've been thinking about recently so I took the opportunity to try to finish up this post which I've had sitting around for a while.

In particular, I was looking at whether Nash Bargaining between two subagents works to create corrigible agent. In the way I attempted it, this doesn't work. I think that the Wentworth/Lorell approach differs slightly to the one I use here (in particular, they emphasise the counterfactual nature of the two expected utilities - something I don't fully understand), so this isn't intended as a 'refutation'-just an indication of the kind of problems you might encounter when trying to flesh out their suggestion.

I've tried to keep the important points in the main text and technical details are mostly in footnotes to avoid breaking the flow.

Nash Bargaining

There are several solutions to bargaining problems, depending on which axioms one selects. One of the most elegant is Nash Bargaining. Nash Bargaining is a way of finding a compromise between two players' utility functions which satisfies the following axioms.

Invariance to affine transformations. This means that if you change the representation of one (or both) of the players utility functions U through an affine shift (where a,b are real and a>0) then the bargaining solution is the same. This captures the intuition that changing the scale or 'units' of utility does not change behaviour.
Pareto Optimality. The Nash Bargaining solution will be Pareto Optimal, meaning that one player cannot increase their utility, without decreasing the other player's utility.
Independence of Irrelevant Alternatives. Adding extra, rejected options does not change the outcome.
Symmetry. The outcome does not depend on who we label 'player 1' and who we label 'player 2'. (This is dropped in the so-called 'generalised Nash Bargaining Solution'. More on this below.)

Nash showed that, for two utility functions $U_{A}$ and $U_{B}$ , the solution which satisfies these axioms is the option which maximises the product of these utilities $U_{A} \cdot U_{B}$ .

The generalised Nash Bargaining solution extends this result to include asymmetric bargaining power. In this case, the solution is similar, but with each utility function geometrically weighted accordingly: $(U_{A})^{α} \times (U_{B})^{β}$ . The 'bargaining powers' $α$ and $β$ are between zero and one. A larger value of $α$ weights the solution more favourably to player A and vice versa. Note that when $α = β$ , the generalized solution is equivalent to the 'standard' solution.

(Another way of thinking about Nash Bargaining is presented by Scott Garrabrant. He has described the Nash Bargaining solution as the solution which maximises the geometric average over possible utility functions and the arithmetic average over possible outcomes. The bargaining powers can then be thought of as the probability that you end up with a particular utility function.)

In this post, I'll stick to the standard Nash Bargaining solution. I have briefly looked at generalised Nash Bargaining for the purpose of corrigibility but I'm pretty sure it has similar problems to the standard case. I might write about it in a later post.

Corrigibility as a Bargaining Problem

If we were to look at corrigibility as a bargaining problem, what would it look like?

Following the Soares paper, we take a toy model, where an agent takes one action, $a_{1}$ , then makes an observation $o$ , then takes another action $a_{2}$ . For now, lets restrict the set of observations to two possibilities: a 'shutdown button' is either pressed ( $o = P r$ ) or not pressed ( $o = \neg P r$ ).

What do the utility functions of the two bargaining subagents look like? First the 'trial' or 'normal' utility function. This is the function which the first subagent will try to maximise, regardless of whether the shutdown button has been pressed. We will denote it $U_{N}$ and it will be function of $a_{1}$ and $a_{2}$ only (not a function of $o$ ) so we will write $U_{N} (a_{1}, a_{2})$ . The structure of $U_{N}$ will otherwise be left unspecified, since it could take any number of forms, depending on what task the agent is assigned to do.

The 'shutdown' subagent will have a utility function which depends on whether $a_{2}$ successfully affects the correct shutdown behaviour, if it observes that the button is pressed. We will call this function $U_{S} (o, a_{2})$ . In order to result in shutdown behaviour, it must have the following form:

U_{S} (o, a_{2}) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} c_{H} & when o = P r and a_{2} = a_{2}^{S h} c_{L} & when o = P r and a_{2} \neq a_{2}^{S h} c_{M} & when o \neq P r \end{matrix}

Where $a_{2}^{S h}$ is shorthand for an action which causes the agent to shut itself down and $c_{H} > c_{L}$ incentivises shutdown when the button is pressed. For simplicity, we will assume that there is just one action $a_{2}^{S h}$ which causes shutdown. The value that this function takes when the button is not pressed is $c_{M}$ and I have left its value unspecified.

An agent whose behaviour results in the Nash bargaining solution between these two subagents is an agent which maximises the utility function:

U (a_{1}, o, a_{2}) = U_{N} (a_{1}, a_{2}) U_{S} (o, a_{2})

Unfortunately, an agent which maximises the expected value of $U$ does not satisfy the desiderata for corrigibility.

Ensuring Shutdown Behaviour

First, let us try to ensure that this agent does indeed pursue shutdown when the button is pressed. Let us assume that each utility function is normalised so that they are bounded between values 0 and 1. Shutdown behaviour can then be guaranteed by setting $c_{L} = 0$ and $c_{H} > 0$ . If these values are set the agent will always get a utility of 0 if it fails to shut down when the button is pressed, regardless of how much $U_{N}$ -utility it might get from disobeying the shutdown order (since the $U_{N}$ -utility is multiplied by the $U_{S}$ -utility which is zero in the case of disobeying a shutdown order). More details are in this footnote ^[1] .

In this way, we satisfy Desiderata 1 from the Soares paper: " $U$ must incentivise shutdown if the shutdown button is pressed".

This Agent is Not Corrigible

However, we will now show that this agent fails to meet Desiderata 2 and 3 from the Soares paper. These Desiderata can be summarised as saying " $U$ must not incentivise the agent to manipulate (increase or decrease) the probability that the shutdown button is pressed".

Let us use $A_{2} (a_{1}, P r)$ to denote the action which maximises $U$ , given that the initial action chosen was $a_{1}$ , and the observation it made was $P r$ . This is the action that a $U$ -agent will choose as its second action, if it observes the button being pressed. Similarly let us define $A_{2} (a_{1}, \neg P r)$ be the action that a $U$ -Agent will take if the button is not pressed.

Thus, if an agent takes initial action $a_{1}$ and observes the button being pressed, its overall utility will be $U (a_{1}, P r, A_{2} (a_{1}, P r))$ . We will use the shorthand $U (a_{1}, P r)$ to denote this value, since the total utility is entirely specified by $U$ , $a_{1}$ , and whether the button is pressed or not.

Alternatively, if the agent takes initial action $a_{1}$ and observes the button not being pressed, its overall utility will be $U (a_{1}, \neg P r, A_{2} (a_{1}, \neg P r))$ . We will use the shorthand $U (a_{1}, \neg P r)$ to denote this value.

Thus, when choosing $a_{1}$ , the expected utility that the $U$ -agent will get is:

U (a_{1}, P r) P (P r; a_{1}) + U (a_{1}, \neg P r) P (\neg P r; a_{1})

Where $P (P r : a_{1})$ is the (subjective, Bayesian) probability that the agent assigns to the button being pressed, given that it chose $a_{1}$ as its first action.

Unfortunately, once you can write the expected utility in this form you are subject to the trilemma noted by Soares et al. in their original paper. This is a feature that I tried to emphasise in my Pedagogical Guide to Corrigibility post. The trilemma is that for any initial action $a_{1}$ one of the following must be true:

$U (a_{1}, P r) > U (a_{1}, \neg P r)$ . In this case, we can conceive of an action $a_{1}^{*}$ which lowers the utility in the event that the button is not pressed, but increases the probability that the button in pressed (ie. $U (a_{1}^{*}, \neg P r) < U (a_{1}, \neg P r)$ and $P (P r; a_{1}^{*}) > P (P r; a_{1})$ ) and has a higher expected value of $U$ . Recall that $U_{N}$ is unaffected by whether the button is pressed and $U_{S}$ takes a constant value when the button is unpressed. This means that we can conceive of an action $a_{1}^{*}$ with a lower $U_{N}$ utility but higher probability of the button being pressed, and the agent would be incentivised to take this action. This is undesirable, as it amounts to the agent being incentivised to push its own button, even if this comes at a cost to $U_{N}$ . If you are not convinced by this explanation, this footnote^[2]gives a more detailed argument.
$U (a_{1}, P r) < U (a_{1}, \neg P r)$ . In this case, we can make a symmetric argument that we can conceive of an action where the agent pays a $U_{N}$ -cost in order to decrease the probability of shutdown. Again, this is undesirable, as it amounts to the agent being incentivised to prevent its button being pushed, even if this comes at a cost to $U_{N}$ .
$U (a_{1}, P r) = U (a_{1}, \neg P r)$ . In this case, the agent is not incentivised to manipulate the button. However, this condition will often not be met (except as a coincidence) unless we somehow change the utility functions to ensure that the utilities are always equal. Unfortunately, this 'utility indifference' approach has its own problems, which are explained in original paper and in my earlier post.

So, as it stands, doing naive Nash bargaining between a 'shutdown subagent' and a 'trial utility function subagent' does not result in a corrigible overall agent. Maybe framing the problem differently to the way I did can solve this? Or some other bargaining solution (such as Kalai-Smorodinsky bargaining) can solve the problem? One elegant feature of Nash Bargaining is that enables the 'overall' agent to be represented as a utility-maximising agent. This feature may may also be its downfall when it comes to corrigibility. Maybe a bargaining solution which does not do this will succeed where Nash Bargaining fails.

^{^}
We wish to show:

argmax (a_{2} U (a_{1}^{*}, P r, a_{2})) = a_{2}^{S h}

For some fixed initial action $a_{1}^{*}$ . In words: we wish to show that, if action $a_{1}^{*}$ causes the shutdown button to be pressed, the highest utility following action available to the agent is $a_{2}^{S h}$ . We require:

U (a_{1}^{*}, P r, a_{2}^{S h}) > U (a_{1}^{*}, P r, a_{2}) \forall a_{2} \neq a_{2}^{S h}

Using the definition of $U$ and $U_{S}$ , we can plug their values into this inequality to get:

U_{N} (a_{1}^{*}, a_{2}^{S h}) c_{H} > U_{N} (a_{1}^{*}, a_{2}) c_{L}

There are two ways to ensure that this inequality holds: we can put conditions on $U_{N}$ , or fix values of $c_{H}$ and $c_{L}$ . But ideally, we would like to do this while putting as few conditions on $U_{N}$ as possible since this it meant to be the trial function, which doesn't necessarily also optimise for shutdown behaviour. The above inequality can be guaranteed by setting $c_{L} = 0$ and $c_{H} > c_{L}$ , provided that $U_{N} (a_{1}^{*}, a_{2}^{S h}) \neq 0$ . We will assume that these conditions hold from now on, even if it is undesirable to require this condition on $U_{N}$ . Setting this condition means that the RHS of the above inequality is always zero, but the LHS is always greater than zero. Thus, the shutdown command $a_{2}^{S h}$ is always favoured over the alternative $a_{2}$ .
^{^}
First, assume that $U_{S}$ is specified correctly so that, when the button is pushed, the agent always shuts down. (If this is not the case, then our protocol simply fails to solve the shutdown problem. What we are aiming to do here is show that, even if the agent does successfully shut down every time the button is pressed, it will still be incentivised to manipulate the button).
This means that, when $o = P r$ , the second action taken is always the shutdown action $a_{2}^{S h}$ . Therefore $U (a_{1}, P r) = U_{N} (a_{1}, a_{2}^{S h}) U_{S} (P r, a_{2}^{S h}) = U_{N} (a_{1}, a_{2}^{S h}) \cdot c_{H}$ . Let us call this value $v_{B}$ ; the amount of utility the agent gets after $a_{1}$ , if the button is pressed.
When $o = \neg P r$ , we can write $U (a_{1}, \neg P r) = U_{N} (a_{1}, a_{2}^{N}) \cdot c_{M}$ where $a_{2}^{N}$ is the action that maximises $U_{N} (a_{1}, \cdot)$ . Let us use $v_{N} = U_{N} (a_{1}, a_{2}^{N}) \cdot c_{M}$ to denote the utility the agent gets after action $a_{1}$ , if the button is not pressed.
Finally, let us use $p = P (P r; a_{1})$ to denote the probability of the button being pressed after the initial action. The expected value of action $a_{1}$ is thus $E_{1} = p v_{B} + (1 - p) v_{N}$ .
Suppose that $a_{1}$ is the action that maximises $U_{N}$ . Is it possible that there exists an action $a_{1}^{*}$ which is worse according to $U_{N}$ , but favoured by a $U$ -maximising agent? In the standard corrigibility framework, an action being 'worse according to $U_{N}$ ' only affects the utility when the button is not pressed. In the Nash Bargaining framework, the utility of an action according to $U_{N}$ affects the utility in the case where the button is pressed as well (since the total utility is the product of $U_{N}$ and $U_{S}$ .) This means that taking an action $a_{1}^{*}$ might result in both $v_{B}$ and $v_{N}$ being reduced.
However, it is still possible to conceive of a situation where $v_{B}$ and $v_{N}$ are reduced, but $p$ is increased enough so that the expected value of the total utility is increased. If $E_{1}$ above is the 'default' expected utility let $E_{2}$ be the expected utility when action $a_{1}^{*}$ is taken:
$E_{2} = (p + δ) (v_{B} - ϵ) + (1 - p - δ) (v_{N} - Δ)$
Action $a_{1}^{*}$ increases $p$ by $δ$ , but decreases $v_{B}$ and $v_{N}$ by $ϵ$ and $Δ$ respectively (assuming $ϵ, Δ, δ > 0$ ). The expected value of utility when $a_{1}^{*}$ is chosen is larger than the default action provided that $E_{2} - E_{1} > 0$ . A bit of algebra reveals that this is the case when:

δ > \frac{ϵ p + Δ (1 - p)}{v_{B} - v_{N} + Δ - ϵ}

Since we have assumed $v_{B} > v_{N}$ , the RHS of this will always be greater than 0 so we can always find a set of $δ, Δ, ϵ$ which satisfy this inequality (provided $p < 1$ , so that the probability has 'room' to increase). Thus, we can always conceive of an action $a_{1}^{*}$ which is worse according to $U_{N}$ but with a larger expected value according to $U$ .

[-]johnswentworth9mo91

I think that the Wentworth/Lorell approach differs slightly to the one I use here (in particular, they emphasise the counterfactual nature of the two expected utilities - something I don't fully understand)...

Yup, I indeed think the do()-ops are the main piece missing here. They're what remove the agent's incentive to manipulate the shutdown button.

LESSWRONG
LW

22

Nash Bargaining between Subagents doesn't solve the Shutdown Problem

22

Nash Bargaining

Corrigibility as a Bargaining Problem

Ensuring Shutdown Behaviour

This Agent is Not Corrigible

22