Corrigibility thoughts I: caring about multiple things

Stuart_Armstrong

This is the first of three articles about limitations and challenges in the concept of corrigibility (see articles 2 and 3).

The desiderata for corrigibility are:

A corrigible agent tolerates, and preferably assists, its operators in their attempts to alter or shut down the agent.
A corrigible agent does not attempt to manipulate or deceive its operators.
A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in the event of a breakage.
A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

In this post, I'll be looking more at point 4. A summary of the result will be:

Unless giving the AI extra options can reduce expected utility, the AI must care about every possible utility at least a bit.

Some of the results are formal, but the boundaries of the model are very unclear, so the warning in this post should always be born in mind.

Note that the indifference agents fail to be fully corrigible (they don't create corrigible subagents) and they also don't care about the other possible utilities before being changed (as this is a point of indifference).

Agents versus non-agents

First I'll present a cleaner version of an old argument. Basically, it seems that defining what a sub-agent or sub-system is, is tricky or impossible.

Consider for instance a paperclip maximiser, that may get corriged into a staple maximiser at a later date. The AI has some income, and may buy a large proportion of shares in either General Paperclips (GP) or Staples United (SU). Assume the best way of promoting the use of one of the tools, it to take over the company that makes them.

There are two scenarios; in each scenario, the AI has one of two choices.

In scenario 1, the AI has choice A: it buys stocks in GP, but cannot make further trades (in particular, it can't sell its stocks and buy others). It also has choice B: it retains flexibility, and can sell its GP stocks at some later date to buy stocks in SU.
In scenario 2, the AI delegates its investment decisions to a subagent. Under choice A, the subagent is a pure paperclip maximiser. Under choice B, the subagent is corrigible, and can be changed into a staple maximiser.

Clearly 1A and 2A are equivalent, while 1B and 2B are also equivalent. This trick can be extended: in most toy models, it's possible to replace a set of options with a choice of sub-agents, and vice-versa.

Thus point 4, in the real world, cannot be merely about subagents and subsystems, but about maintaining flexibility for its future decisions.

Burn the (rest of) the world

Let's now consider giving the corrigible agent an option C: it gets slightly more shares in GP than otherwise. But, in exchange, SU is destroyed, along with most of the rest of the world. This is the highest option for paperclip maximalisation, but the absolutely worst option for staples and anything else.

Also assume that the AI's operators are virtually certain (99.9999...9% sure) to be changing it into a staple maximiser. This means that taking option C is dramatically worse that option B, from the perspective of expected utility.

But the AI will only prefer option B if it "cares" about staples, at least a little bit, from the very beginning (we've shown that there is no clear distinction between subagent and general actions, thus the agent has to have a general care for staples, rather than-only-caring-about-B-in-the-context-of-subagents).

Thus, if we want to avoid corrigible agents getting worse expected outcomes when we give them extra "option Cs", or if we just want to rule out "option Cs" in the first place, then the AI needs to care about staples (or any possible future utility), from the very beginning.

Maintaining flexibility

You could argue that what's important about the agent is maintaining flexibility, and that option B is flexible, while options A and C are not. Thus the agent doesn't need to care about staples in advance, just to have a general preference for flexibility.

That doesn't seem sufficient. Let's add another, flexible, option D, and put numbers on the whole thing. If U_p is the paperclip maximising utility and U_s is the staple maximising one, then:

Option A: U_p=100, U_s=0
Option B: U_p=100 or 0, U_s=0 or 100
Option C: U_p=101, U_s=-1,000,000
Option D: U_p=101 or 0, U_s=-1,000,000 or -999,000

The "or" denotes options the AI has kept open. Here D is the most flexible option, allowing the greatest range of outcomes in terms of value. Yet it's clearly a disaster for expected utility if changes from U_p to U_s are likely. If we want to rule out the AI choosing D, then we need it to actively value U_s at least a bit, not just have a desire for flexibility.

I'd like to note that "caring about Us a bit" can also be read as "small probability of caring about Us a lot".

Actually, a small probability of caring about Us a bit, can suffice.

Maybe a tangent, but: Are we humans corrigible?

I think about this a lot -- it seems that no matter what I do, I'm not able to prevent a sufficiently motivated attacker from ending my life.

I'm positive. Humans strongly update their utility function based on the morality of the people around them. Do you ever find yourself a bit paralyzed in a new social environment because you don't know about the local customs?

On the other hand, humans are also notorious for trying to fix someone's problem before properly listening to them. Hmm.

Does "corrigible" mean the same thing as "slave"? If an "operator" has the ability to change an agent's utility function, isn't it really the operator's function, rather than the agent's?

The technical definition for corrigibility being used here is thus: "We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences."

And yes, the basic idea is to make it so that the agent can be correct by its operators after instantiation.

I think it matters what KIND of correction you're considering. If there's a term in the agent's utility function to understand and work toward things that humans (or specific humans) value, you could make a correction either by altering the weights or other terms of the utility function, or by a simple knowledge update.

Those feel very different. are both required for "corrigibility"?

The "If there's a term in the agent's utility function to ... work toward things that humans ... value" part is the hard part. If you can figure out how to make it truly care what its operator wants, you've already solved a huge problem.

An agent would have to be corrigible even if you couldn't manage to make it care explicitly what it's operator wants. We need some way of taking agents that explicitly don't care what their operators want, and making them not stop their operators from turning them off, despite default the incentives to prevent interference.

I'm not following. I think your definition of "care" is confusing me.

If you want an agent to care (have a term in it's utility function) what you want, and if you can control it's values, then you should just make it care what you want, not make it NOT care and then fix it later.

There is a very big gap between "I want it to care what I want, but I don't yet know what I want so I need to be able to tell it what I want later and have it believe me" and "I want it not to care what I want but I want to later change my mind and force it to care what I want".

"Just care what I want" is a separate, unsolved research problem. Corrigibility is an attempt to get an agent to simply not immediately kill its user even if it doesn't necessarily have a good model of what that user wants.

"don't kill an operator" seems like something that can more easily be encoded into an agent than "allow operators to correct things they consider undesirable when they notice them".

In fact, even a perfectly corrigible agent with such a glaring initial flaw might kill the operator(s) before they can apply the corrections, not because they are resisting correction, but just because it furthers whatever other goals they may have.

You're exactly right, I think. IMO it may actually be easier to build an AI that can learn to want what some target agent wants, than to build an AI that lets itself be interfered with by some operator whose goals don't align with its own current goals.