Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Generalizing the Corrigibility paper's impossibility result?

0Stuart_Armstrong

New Comment

I think we can improve the setup, by conditioning only on things the AI has not control over.

Imagine that every turn, there is a random bit B sent along a wire. It's almost certainly 1, but there's a tiny chance that it's 0.

If the button is pressed and B=0, then the agent's utility doesn't update. But if the button is pressed and B=1, the utility is updated as usual.

Except the expression is conditioned not on "press", but on "B=0".

Now we just have to define B in such a way that the AI can't affect it - we need a defined source of true randomness.

Even better: B has already been calculated by some defined past process, the AI just doesn't know what it is yet.

In our paper on corrigibility, we consider the question of how to make a highly intelligent agent that would pursue some goal, but not resist being shut down if its human programmers determined that they had made a mistake in specifying this goal. We assume that we are given two utility functions: a function UN, which specifies the agent's normal goal (which it pursues until the shutdown button is pressed), and a function US, which specifies the goal of shutting down. We then ask whether there is some sort of combined utility function U such that an agent maximizing U would act as if maximizing UN unless and until the shutdown button is pressed, and act as if maximizing US after this point.

In the paper, we investigate a particular linear way to combine UN and US: we define U to compute utility according to UN if the shutdown button is not pressed, and according to US plus a certain additive term if the shutdown button is pressed. We show that this way of combining UN and US in some sense satisfies the informal desideratum above, but has some pretty undesirable consequences.

Max Tegmark recently asked me whether it would be possible to show a more general impossibility result: that not only

thisparticular way of combining UN and US leads to bad consequences, but thateveryU which makes an agent maximize UN until the shutdown button is pressed, and US after, will have these consequences---including, for example, non-linear combinations of UN and US.As it happens, the answer to this question turns out to be either trivial or underspecified, though it may be interesting to see why this is the case. The reason is that the answer depends on how, exactly, we formalize the demand about maximizing UN before the button is pressed and US after. The U we investigate in this paper suggests one possible way of formalizing this demand (I'll give more detail below). This formalization, however, nails down exactly how our agent behaves in every possible situation: Although there is more than one utility function U which produces this behavior, all of them lead the agent to choose exactly the same actions, producing the same undesirable consequences as the U in the paper.

Alternatively, we may try to find a different formalization of the demand, which doesn't imply the same bad consequences; but then, instead of a math problem that we can answer by a proof, we're faced with the more philosophical question of what formalizations would be acceptable.

In the rest of this post, I go through all of this in more detail, although you might want to stop reading here unless you're interested in these details.

In the paper, we assume that the agent takes an action a1, then makes an observation which tells it whether the user has pressed the shutdown button, and takes another action a2. Let's write E[U;a1] for the expected utility if the agent takes action a1; write ¬Pr for the event that the shutdown button isn't pressed; and write E[UN∣¬Pr;a1] for the expected utility,

conditional on the button not being pressed, if the agent takes action a1. (These definitions are a little vague; for exact technical details, see the paper.) We show that when choosing the first action, a1, an agent maximizing the expectation of U (as defined in the paper) will behave as if it's maximizing the conditional expectation E[UN∣¬Pr;a1] of UN.We then consider an agent whose UN tells it to make as much money as possible, and assume that as its first action a1, this agent is able to (i) play the lottery and (ii) construct a robot arm that will press the shutdown button if the agent doesn't win the jackpot. Then the result above implies that the agent thinks it's an

awesomeidea to build that robot arm! This is because when we condition on ¬Pr in the conditional expectation E[UN∣¬Pr;a1], we rule out all the cases where the shutdown button is pressed, which includes all the cases when the agent loses the lottery; in other words, the agent acts as if it now believes it's going to win the lottery with probability 1!The above is a result of using a utility function U(a1,o,a2) which equals UN(a1,o,a2) when o indicates that the shutdown button hasn't been pressed, and equals US(a1,o,a2) plus an additive term when o indicates that the shutdown button

hasbeen pressed. However, there's more than one utility function of these three arguments that leads to the same behavior by the agent.If we want to describe only the behavior, we can instead talk about

policies, which specify how the agent will act in every possible situation it might find itself in; in our set-up, a policy would be a pair (a1,π2) of an action a1 and a function π2(o) from observations to actions a2; here, a1 specifies what action the agent takes before it makes its observation, and π2(o) specifies what action it takes after it makes observation o.To write down the policy produced by the utility function U from our paper, it's useful to first define a function A2(a1,o), which, for every action a1 and o, specifies what action a2 maximizes utility given that the agent took action a1 earlier and made observation o. (In our toy model, only the observations are probabilistic, so after we know o, we do not need to take expectations any longer.) It's pretty clear what this function should be, since at this point we know for sure whether the shutdown button has been pressed; we will set A2(a1,o)=argmaxa2UN(a1,o,a2) when o∉Press, where Press is the set of observations indicating that the button has been pressed, A2(a1,o)=argmaxa2US(a1,o,a2) when o∈Press.

Now, the policy produced by maximizing the expectation of U turns out to be the pair (a1,π2) given by

For any given choice of a1, it seems to me that π2(o)=A2(a1,o) is the only reasonable choice, given our informal demand that the agent behave like a US-maximizer if the shutdown button has been pressed, and like a UN-maximizer otherwise. This leaves the question of how to formalize the idea of behaving like a UN-maximizer when choosing a1, i.e.,

beforelearning whether the shutdown button will be pressed. The above choice of a1 is one option, but if you're taking that option, you reproduce exactly the behavior of maximizing the expectation of U---including the undesirable behavior where it acts as if it can win the lottery by pressing the shutdown button when it loses.So in order to interpret Max's question as a math question, which we can settle by proving a theorem, we would have to ask: Is there a utility function U′ which satisfies our intuitive desideratum in the same formal sense that U does, but doesn't produce bad behavior? The answer is, of course, no, since by assumption, maximizing U′ would produce exactly the same behavior, specified by the above policy, as U, and would therefore have the same unintended consequences.

Alternatively, we might ask whether there's a better way to choose a1 that's still in accord with our

intuitivedesideratum of maximizing UN before the shutdown button is pressed, and US after. For example, this post discusses a variant of the system in the corrigibility paper, originally proposed by Stuart Armstrong, and shows that that version avoids the problem with playing the lottery, though it still has undesirable consequences. However, the discussion then becomes about what are reasonable formalizations of our intuitive demands, which is no longer something we can settle by proving a theorem.