I think that the Wentworth/Lorell approach differs slightly to the one I use here (in particular, they emphasise the counterfactual nature of the two expected utilities - something I don't fully understand)...
Yup, I indeed think the do()-ops are the main piece missing here. They're what remove the agent's incentive to manipulate the shutdown button.
Work funded by the Long Term Future Fund.
Corrigibility is the hypothetical feature of some agents which allows them to be 'shut down' by an outside user without attempting to manipulate the whether or not they are shut down. The motivation behind this concept is the possibility of making an AI agent which can pursue a 'trial' goal given to it by its creators, but can be stopped if pursuing this goal becomes undesirable.
On the face of it, Corrigibility sounds a bit like a bargaining problem. A corrigible agent might behave as a comprise between two subagents: one which cares about pursuing the original goal, and one which cares about achieving shutdown.
Recently, @johnswentworth and @David Lorell have written A Shutdown Problem Proposal, which discusses this suggestion a bit more. However, they do not specify out the mechanism by which the subagents will reach their agreement. They write:
Coincidently, this is similar to something I've been thinking about recently so I took the opportunity to try to finish up this post which I've had sitting around for a while.
In particular, I was looking at whether Nash Bargaining between two subagents works to create corrigible agent. In the way I attempted it, this doesn't work. I think that the Wentworth/Lorell approach differs slightly to the one I use here (in particular, they emphasise the counterfactual nature of the two expected utilities - something I don't fully understand), so this isn't intended as a 'refutation'-just an indication of the kind of problems you might encounter when trying to flesh out their suggestion.
I've tried to keep the important points in the main text and technical details are mostly in footnotes to avoid breaking the flow.
Nash Bargaining
There are several solutions to bargaining problems, depending on which axioms one selects. One of the most elegant is Nash Bargaining. Nash Bargaining is a way of finding a compromise between two players' utility functions which satisfies the following axioms.
Nash showed that, for two utility functions UA and UB, the solution which satisfies these axioms is the option which maximises the product of these utilities UA⋅UB.
The generalised Nash Bargaining solution extends this result to include asymmetric bargaining power. In this case, the solution is similar, but with each utility function geometrically weighted accordingly: (UA)α×(UB)β. The 'bargaining powers' α and β are between zero and one. A larger value of α weights the solution more favourably to player A and vice versa. Note that when α=β, the generalized solution is equivalent to the 'standard' solution.
(Another way of thinking about Nash Bargaining is presented by Scott Garrabrant. He has described the Nash Bargaining solution as the solution which maximises the geometric average over possible utility functions and the arithmetic average over possible outcomes. The bargaining powers can then be thought of as the probability that you end up with a particular utility function.)
In this post, I'll stick to the standard Nash Bargaining solution. I have briefly looked at generalised Nash Bargaining for the purpose of corrigibility but I'm pretty sure it has similar problems to the standard case. I might write about it in a later post.
Corrigibility as a Bargaining Problem
If we were to look at corrigibility as a bargaining problem, what would it look like?
Following the Soares paper, we take a toy model, where an agent takes one action, a1, then makes an observation o, then takes another action a2. For now, lets restrict the set of observations to two possibilities: a 'shutdown button' is either pressed (o=Pr) or not pressed (o=¬Pr).
What do the utility functions of the two bargaining subagents look like? First the 'trial' or 'normal' utility function. This is the function which the first subagent will try to maximise, regardless of whether the shutdown button has been pressed. We will denote it UN and it will be function of a1 and a2 only (not a function of o) so we will write UN(a1,a2). The structure of UN will otherwise be left unspecified, since it could take any number of forms, depending on what task the agent is assigned to do.
The 'shutdown' subagent will have a utility function which depends on whether a2 successfully affects the correct shutdown behaviour, if it observes that the button is pressed. We will call this function US(o,a2). In order to result in shutdown behaviour, it must have the following form:
US(o,a2)=⎧⎪⎨⎪⎩cHwhen o=Pr and a2=aSh2cLwhen o=Pr and a2≠aSh2cMwhen o≠PrWhere aSh2 is shorthand for an action which causes the agent to shut itself down and cH>cL incentivises shutdown when the button is pressed. For simplicity, we will assume that there is just one action aSh2 which causes shutdown. The value that this function takes when the button is not pressed is cM and I have left its value unspecified.
An agent whose behaviour results in the Nash bargaining solution between these two subagents is an agent which maximises the utility function:
U(a1,o,a2)=UN(a1,a2)US(o,a2)Unfortunately, an agent which maximises the expected value of U does not satisfy the desiderata for corrigibility.
Ensuring Shutdown Behaviour
First, let us try to ensure that this agent does indeed pursue shutdown when the button is pressed. Let us assume that each utility function is normalised so that they are bounded between values 0 and 1. Shutdown behaviour can then be guaranteed by setting cL=0 and cH>0. If these values are set the agent will always get a utility of 0 if it fails to shut down when the button is pressed, regardless of how much UN-utility it might get from disobeying the shutdown order (since the UN-utility is multiplied by the US-utility which is zero in the case of disobeying a shutdown order). More details are in this footnote [1] .
In this way, we satisfy Desiderata 1 from the Soares paper: "U must incentivise shutdown if the shutdown button is pressed".
This Agent is Not Corrigible
However, we will now show that this agent fails to meet Desiderata 2 and 3 from the Soares paper. These Desiderata can be summarised as saying "U must not incentivise the agent to manipulate (increase or decrease) the probability that the shutdown button is pressed".
Let us use A2(a1,Pr) to denote the action which maximises U, given that the initial action chosen was a1, and the observation it made was Pr. This is the action that a U-agent will choose as its second action, if it observes the button being pressed. Similarly let us define A2(a1,¬Pr) be the action that a U-Agent will take if the button is not pressed.
Thus, if an agent takes initial action a1 and observes the button being pressed, its overall utility will be U(a1,Pr,A2(a1,Pr)). We will use the shorthand U(a1,Pr) to denote this value, since the total utility is entirely specified by U, a1, and whether the button is pressed or not.
Alternatively, if the agent takes initial action a1 and observes the button not being pressed, its overall utility will be U(a1,¬Pr,A2(a1,¬Pr)). We will use the shorthand U(a1,¬Pr) to denote this value.
Thus, when choosing a1, the expected utility that the U-agent will get is:
U(a1,Pr)P(Pr;a1)+U(a1,¬Pr)P(¬Pr;a1)Where P(Pr:a1) is the (subjective, Bayesian) probability that the agent assigns to the button being pressed, given that it chose a1 as its first action.
Unfortunately, once you can write the expected utility in this form you are subject to the trilemma noted by Soares et al. in their original paper. This is a feature that I tried to emphasise in my Pedagogical Guide to Corrigibility post. The trilemma is that for any initial action a1 one of the following must be true:
So, as it stands, doing naive Nash bargaining between a 'shutdown subagent' and a 'trial utility function subagent' does not result in a corrigible overall agent. Maybe framing the problem differently to the way I did can solve this? Or some other bargaining solution (such as Kalai-Smorodinsky bargaining) can solve the problem? One elegant feature of Nash Bargaining is that enables the 'overall' agent to be represented as a utility-maximising agent. This feature may may also be its downfall when it comes to corrigibility. Maybe a bargaining solution which does not do this will succeed where Nash Bargaining fails.
argmax(a2U(a∗1,Pr,a2))=aSh2^
We wish to show:
- For some fixed initial action a∗1. In words: we wish to show that, if action a∗1 causes the shutdown button to be pressed, the highest utility following action available to the agent is aSh2. We require:
U(a∗1,Pr,aSh2)>U(a∗1,Pr,a2)∀a2≠aSh2- Using the definition of U and US, we can plug their values into this inequality to get:
UN(a∗1,aSh2)cH>UN(a∗1,a2)cL- There are two ways to ensure that this inequality holds: we can put conditions on UN, or fix values of cH and cL. But ideally, we would like to do this while putting as few conditions on UN as possible since this it meant to be the trial function, which doesn't necessarily also optimise for shutdown behaviour. The above inequality can be guaranteed by setting cL=0 and cH>cL, provided that UN(a∗1,aSh2)≠0. We will assume that these conditions hold from now on, even if it is undesirable to require this condition on UN. Setting this condition means that the RHS of the above inequality is always zero, but the LHS is always greater than zero. Thus, the shutdown command aSh2 is always favoured over the alternative a2.
δ>ϵp+Δ(1−p)vB−vN+Δ−ϵ^
First, assume that US is specified correctly so that, when the button is pushed, the agent always shuts down. (If this is not the case, then our protocol simply fails to solve the shutdown problem. What we are aiming to do here is show that, even if the agent does successfully shut down every time the button is pressed, it will still be incentivised to manipulate the button).
This means that, when o=Pr, the second action taken is always the shutdown action aSh2. Therefore U(a1,Pr)=UN(a1,aSh2)US(Pr,aSh2)=UN(a1,aSh2)⋅cH. Let us call this value vB; the amount of utility the agent gets after a1, if the button is pressed.
When o=¬Pr, we can write U(a1,¬Pr)=UN(a1,aN2)⋅cM where aN2 is the action that maximises UN(a1,⋅). Let us use vN=UN(a1,aN2)⋅cM to denote the utility the agent gets after action a1, if the button is not pressed.
Finally, let us use p=P(Pr;a1) to denote the probability of the button being pressed after the initial action. The expected value of action a1 is thusE1=pvB+(1−p)vN.
Suppose that a1 is the action that maximises UN. Is it possible that there exists an action a∗1 which is worse according to UN, but favoured by a U-maximising agent? In the standard corrigibility framework, an action being 'worse according to UN' only affects the utility when the button is not pressed. In the Nash Bargaining framework, the utility of an action according to UN affects the utility in the case where the button is pressed as well (since the total utility is the product of UN and US.) This means that taking an action a∗1 might result in both vB and vN being reduced.
However, it is still possible to conceive of a situation where vB and vN are reduced, but p is increased enough so that the expected value of the total utility is increased. If E1 above is the 'default' expected utility let E2 be the expected utility when action a∗1 is taken:
E2=(p+δ)(vB−ϵ)+(1−p−δ)(vN−Δ)
Action a∗1 increases p by δ, but decreases vB and vN by ϵ and Δ respectively (assuming ϵ,Δ,δ>0). The expected value of utility when a∗1 is chosen is larger than the default action provided that E2−E1>0. A bit of algebra reveals that this is the case when: