The limits of corrigibility

by Stuart_Armstrong6 min read10th Apr 20189 comments

25

CorrigibilityAI
Frontpage

I previously wrote a critique of Paul Christiano's Amplification/Distillation idea. Wei Dei pointed out that I had missed out a key part of Paul's idea, his use of corrigibility.

I'd been meaning to write about corrigibility for some time, so this provides a good excuse. For a general understanding of corrigibility, see here; for Paul's specific uses of it, I'm relying on this post and this one.

There have been some posts analysing whether corrigibility can be learnt, or whether corrigibility can nevertheless allow for terrible outcomes.

Summary of the claims of this post:

  • Strong corrigibility, outside of a few simple examples, needs to resolve the problem of figuring out human values. The more powerful and long-term the AI's decisions, the more important this is.
  • Once human values are figured out, corrigibility adds little to the mix.
  • Moderate versions of corrigibility can be useful along the way for building safe AGIs.

An example of corrigibility

Corrigibility is roughly about "being helpful to the user and keeping the user in control".

This seems rather intuitive, and examples are not hard to come by. Take the example of being a valet to an ageing billionaire. The billionaire wants to write his will, and decide how to split his money among his grandnephews. He asks you, his loyal valet, to bring him the family photo albums, which he will peruse while making his decision.

It's obvious that there are many way of manipulating him, if you wanted to. He won't have time to go through all the albums, so just by selecting some positive/negative photos for one nephew and putting them on top, you can skew his recollection and his choice.

That's clearly manipulative behaviour. Much better would be to select a representative cross-section of the photos, and give them to him. Or, even better, enquire about the criteria he wants to use for making his choices, and selecting a cross-section of representative and relevant photos for those criteria.

Complicating the example

Which billionaire?

This is where it gets complicated. Suppose your billionaire claims to only care about whether his grandnephews were kind and considerate. However, you happen to know that he also values flattery. What's more, a bit of gentle coaxing on your part will get him to admit that; once that's done, you can freely select photos based on evidence of both kindness and flattery.

Should you do that coaxing? Even if you don't (or maybe if the coaxing was unsuccessful), should you still select the photos based on flattery? What if you could convince the billionaire to admit he valued flattery - but only by using conversational techniques that would themselves be considered very manipulative? Should you then manipulate the billionaire to admitting the truth?

Or, suppose the billionaire was going senile, and knew this to some extent, and wanted you to help him make the decision as he would have made it when he was in his prime. However, he's in denial of the extent of his senility, and his image of himself at his prime is seriously flawed.

So, are a corrigible valet, who should you aim to keep in control? The current billionaire, his past self, or his current vision of his past self?

A similar question applies if the billionaire is in control of his faculties, but subject to random and major mood swings.

Learning what to value

Now suppose that the billionaire wants to give his money to a good charity instead. He has some vague idea of what a good charity is - more properly, he has a collection of vague ideas of what a good charity is. He wants you to educate him on what criteria make a good charity, and then help him select the top one(s).

You have almost unlimited power here, to shape not only the outcome of his decisions but also the values of his future self. Pushing him towards effective altruism, making him focus on administrative costs, tugging at his heart-strings, emphasising or de-emphasising his in-group, focusing on health/animal/development etc... All of these are possible options, and all of them fit broadly within his underdeveloped current values and meta values. What is the corrigible thing to do here?

You may recognise this as the usual problem of figuring out what the true values of humans are, given their underdefined, contradictory, and manipulable values.

Indeed, if you have figured out that the billionaire's true reward function is R, then corrigibility seems to simply reduce to respecting R and avoiding guiding him away from R.

But can you be corrigible without actually solving the problem of human values?

Corrigibility: methods or outcomes?

Incorrigibility seems much easier to define than corrigibility. There are a variety of manipulative or coercive techniques that clearly count as incorrigible. Tricking someone, emotionally abusing them, feeding them slanted and incorrect information, or threatening them - these seem clearly bad. And it's very plausible that an artificial agent could learn examples of these, and how to avoid them.

But this is pointless if the AI could get the human to the same place through more subtle and agreeable techniques. If the AI is able to convince anyone of anything with a gentle and enjoyable one hour conversation, then there's no such thing as corrigible behaviour - or, equivalently, there's no such thing as a complete list of incorrigible behaviours.

Even the subproblem of ensuring that the AI's answers are informative or accurate is unsolved (though this subproblem of corrigibility, however, is one I feel may be solvable, either by training the AI on examples or by formal methods; see the end of this post).

Outcomes and solving human values

Instead, you have to define corrigibility by it outcome: what has the human been made to do/believe/value, as compared with some idealised version of what that outcome should be.

In other words, the AI has to solve the problem of establishing what the true human values are. There are some caveats here; for example, the AI could merely figure out a range of acceptable outcomes, and aim for within that range. Or the AI could figure out that the human should truly value "what's hidden in this safe", without knowing the contents of the safe: it doesn't need to fully figure out the human's values, just figure out the conditional dependence of human value based on some evidence.

But, even with these caveats, the AI has to do almost all the work of establishing human values. And if it does that, why not directly have the AI maximise those values? We can add conditions like forbidding certain incorrigible methods and making the space of possible future human values large enough that we exercise some choice, but that doesn't remove the fact that strong, long-term corrigibility is a subset of the problem of learning human values.

Moderate corrigibility

Ironically, despite the conclusions above, I've been working on a few techniques that might help build some weaker version of corrigibility.

These don't solve the problem of corrigibility in theory; at best they can turn the AI's manipulation of the human into a random walk rather than a goal directed process.

But they can help in practice. Not if the goal is to build a fully corrigible AI, but if we are using AIs as advisers or tools along the road for constructing an adequate human reward function, and an AI that can honour it. Training wheels, as it were, not the main driving wheels.

25

10 comments, sorted by Highlighting new comments since Today at 4:47 AM
New Comment
Even the subproblem of ensuring that the AI's answers are informative or accurate is unsolved

I totally agree that, if there is a useful concept like corrigibility, it has definitely not been pinned down yet (nor do we have a training regime that we believe would ensure corrigibility).

But this is pointless if the AI could get the human to the same place through more subtle and agreeable techniques. If the AI is able to convince anyone of anything with a gentle and enjoyable one hour conversation, then there's no such thing as corrigible behaviour - or, equivalently, there's no such thing as a complete list of incorrigible behaviours.

Suppose my AI has a conversation with the purpose of convincing me to value X. If you were to ask me "Hey Paul, do you want your AI to choose actions to try to cause you to value X, unbeknownst to you?" I'd say "No." It doesn't really matter whether the conversation is pleasant.

The reason this seems like a basin of attraction to me is that if I were to ask my AI "What are you trying to achieve in this conversation?" and it says "To get you to believe X" then I'll say "Ah, you must have misunderstood, trying to influence my beliefs in ways I don't currently endorse is bad, and you should be conservative about doing that" and then it will change its behavior accordingly. Of course, it could lie outright, or ignore the direct instruction, but those seem to involve even more brazen failures of corrigibility.

We can then go further: if the AI gives a dishonest or unrepresentative answer, and I poke at it a little bit and asking other related questions, I'm likely to notice an inconsistency between the answer (or what it led me to believe) and the AI's other answers. Of course, the AI could weave a web of lies in order to make the statement look accurate. But that involves an even more brazen failure of corrigibility, with even more sophisticated optimization pointing in a clearly wrong direction.

By doing a sequence of tests like this,it seems like we could experimentally determine that either (a) we live in the world where our AI is behaving corrigibly, or (b) we live in the world where our AI is constructing an elaborate web of lies or performing other sophisticated optimization to undermine our understanding of the situation. Then the AI is "corrigible" if it's (a) rather than (b).

I'm optimistic about getting a good definition along these lines because it decomposes into (i) a concrete test, and (ii) a distinction where there is no blurry ground. This could be a reasonable definition even if we don't actually do the experimental test in most cases.

(I don't think this comment perfectly captures my intuition either, but hopefully it clarifies it a bit.)

Instead, you have to define corrigibility by it outcome: what has the human been made to do/believe/value, as compared with some idealised version of what that outcome should be.

I don't think we have to define corrigibility based either on superficial characteristics of the AI's behavior or by long-term outcomes. Instead, we can ask questions like "What is the AI `trying' to achieve?" or "How would some idealized deliberative process judge the AI's actions?"

Suppose my AI has a conversation with the purpose of convincing me to value X. If you were to ask me “Hey Paul, do you want your AI to choose actions to try to cause you to value X, unbeknownst to you?” I’d say “No.” It doesn’t really matter whether the conversation is pleasant.

What if the AI says "Hey Paul, I think it's a really good idea to talk about whether you should value X, because I think you currently don't value X but according to my best understanding of moral philosophy, there's a high probability you actually should value X, and this will be relevant in the near future. Would you like to schedule some time to let me lay out the most important arguments for and against X so you can decide if you want to change your mind?"

Suppose the AI is fully honest here and really doing what it says it's doing, but its understanding of moral philosophy is biased in some way (let's say it over-values the importance of certain types of arguments and under-values other types of arguments), and its attempt to optimize the presentation of arguments for understandability to the user has the side effect of making them very convincing. It seems to me that the user could ask a bunch of questions, which the AI honestly answers, without detecting anything wrong, and end up being convinced of a wrong X. (It seems that similar things could happen if X was a belief about facts or a action/strategy instead of a value.)

Stuart had asked, corrigibility to whom? Do we define corrigibility as being corrigible to the original user (i.e., the user with their beliefs/values at time t_0), or to the current user? If we define it with regard to the original user, it seems that corrigibility does not have a basin of attraction. (If the user is convinced of a wrong X by the AI, there's no force on the AI pushing back to the original belief.) If we define it with regard to the current user, "basin of attraction" may be true but is not as useful a property as it might intuitively seem, because the basin itself is now a moving target which can be pushed around by the AI.

Do we define corrigibility as being corrigible to the original user (i.e., the user with their beliefs/values at time t_0), or to the current user?

Definitely current user, as you say that's the only way to have a basin of attraction.

If we define it with regard to the current user, "basin of attraction" may be true but is not as useful a property as it might intuitively seem, because the basin itself is now a moving target which can be pushed around by the AI.

Yes, it can be pushed around by the user or the AI or any other process in the world. The goal is to have it push around the user's values in the way that the user wants, so that we aren't at a disadvantage relative to normal reflection. (We might separately be at a disadvantage if the AI is relatively better at some kinds of thinking than others. That also seems like it ought to be addressed by separate work.)

It seems to me that the user could ask a bunch of questions, which the AI honestly answers, without detecting anything wrong, and end up being convinced of a wrong X. (It seems that similar things could happen if X was a belief about facts or a action/strategy instead of a value.)

Yes, if we need to answer a moral question in the short term, then we may get the wrong answer, whether we deliberate on our own or the AI helps us. My goal is to have the AI try to help us in the way we want to be helped, I am not currently holding out hope for the kind of AI that eliminates the possibility of moral error.

Of course our AI can also follow along with this kind of reasoning and therefore be conservative about making irreversible commitments or risking value drift, just as we would be. But if you postulate a situation that requires making a hasty moral judgment, I don't think you can avoid the risk of error.

The goal is to have it push around the user's values in the way that the user wants, so that we aren't at a disadvantage relative to normal reflection.

My concern here is that even small errors in this area (i.e., in AI's understanding of how the user wants their values to be pushed around) could snowball into large amounts of value drift, and no obvious "basin of attraction" protects against this even if the AI is corrigible.

Another concern is that the user may have little idea or a very vague idea of how they want their values to be pushed around, so the choice of how to push the user's values around is largely determined by what the AI "wants" to do (i.e., tends to do in such cases). And this may end up being very different from where the user would end up by using "normal reflection".

I guess my point is that there are open questions about how to protect against value drift caused by AI, what the AI should do when the user doesn't have much idea of how they want their values to be pushed around, and how to get the AI to competently help the user with moral questions, which seem to be orthogonal to how to make the AI corrigible. I think you don't necessarily disagree but just see these as lower priority problems than corrigibility? Without arguing about that, perhaps we can agree that listing these explicitly at least makes it clearer what problems corrigibility by itself can and can't solve?

Of course our AI can also follow along with this kind of reasoning and therefore be conservative about making irreversible commitments or risking value drift, just as we would be.

If other areas of intellectual development are progressing at a very fast pace, I'm not sure being conservative about values would work out well.

We might separately be at a disadvantage if the AI is relatively better at some kinds of thinking than others. That also seems like it ought to be addressed by separate work.

This seems fine, as long as people who need to make strategic decisions about AI safety are aware of this, and whatever separate work that needs to be done is compatible with your basic approach.

I don't think you can avoid the risk of error.

You say this a couple of times, seeming to imply that I'm asking for something unrealistic. I just want an AI that's as competent in value learning/morality/philosophy as in science/technology/persuasion/etc. (or ideally more competent in the former to give a bigger safety margin), which unlike "eliminates the possibility of moral error" does not seem like asking for too much.

I guess my point is that there are open questions about how to protect against value drift caused by AI, what the AI should do when the user doesn't have much idea of how they want their values to be pushed around, and how to get the AI to competently help the user with moral questions, which seem to be orthogonal to how to make the AI corrigible. I think you don't necessarily disagree but just see these as lower priority problems than corrigibility? Without arguing about that, perhaps we can agree that listing these explicitly at least makes it clearer what problems corrigibility by itself can and can't solve?

I agree with all of this. Yes, I see these other problems as (significantly) lower priority problems than alignment/corrigibility. But I do agree that it's worth listing those problems explicitly.

My current guess is that the most serious non-alignment AI problems are:

1. AI will enable access to destructive physical technologies (without corresponding improvements in coordination).

2. AI will enable access to more AI, not covered by existing alignment techniques (without corresponding speedups in alignment).

These are both related to the more general problem: "Relative to humans, AI might be even better at tasks with rapid feedback relative to tasks without rapid feedback." Moral/philosophical competence is also related to that general problem.

I typically list this more general problem prominently (as opposed to all of the other particular problems possibly posed by AI), because I think it's especially important. (I may also be influenced by the fact that iterated amplification or debate also seem like a good approaches to this problem.)

This seems fine, as long as people who need to make strategic decisions about AI safety are aware of this, and whatever separate work that needs to be done is compatible with your basic approach.

I agree with this.

(I expect we disagree about practical recommendations, because we disagree about the magnitude of different problems.)

open questions about how to protect against value drift caused by AI

Do you see this problem as much different / more serious than value drift caused by other technology? (E.g. by changing how we interact with each other?)

I typically list this more general problem prominently (as opposed to all of the other particular problems possibly posed by AI), because I think it’s especially important.

Have you written about this in a post or paper somewhere? (I'm thinking of writing a post about this and related topics and would like to read and build upon existing literature.)

Do you see this problem as much different /​ more serious than value drift caused by other technology? (E.g. by changing how we interact with each other?)

What other technology are you thinking of, that might have an effect comparable to AI? As far as how we interact with each other, it seems likely that once superintelligent AIs come into existence, all or most interactions between humans will be mediated through AIs, which surely will have a much greater effect than any other change in communications technology?

Have you written about this in a post or paper somewhere? (I'm thinking of writing a post about this and related topics and would like to read and build upon existing literature.)

Not usefully. If I had to link to something on it, I might link to the Ought mission page, but I don't have any substantive analysis to point to.

As far as how we interact with each other, it seems likely that once superintelligent AIs come into existence, all or most interactions between humans will be mediated through AIs, which surely will have a much greater effect than any other change in communications technology?

I agree with "larger effect than historical changes" but not "larger effect than all changes that we could speculate about" or even "larger effect sthan all changes between now and one superintelligent AIs come into existence."

If AI is aligned, then it's also worth noting that this effect is large but not obviously unusually disruptive, since e.g. the AI is trying to think about how to minimize it (though it may be doing that imperfectly).

As a random example, it seems plausible to me that changes to the way society is organized---what kinds of jobs people do, compulsory schooling, weaker connections to family, lower religiosity---over the last few centuries have had a larger unendorsed impact on values than AI will. I don't see any principled reason to expect those changes to be positive while the changes from AI are negative, it seems like in expectation both of them would be positive but for the opportunity cost effect (where today we have the option to let our values and views change in whatever way we most endorse, and we foreclose this option when we let our values drift anything less than maximally-reflectively).

We should try and nail down the concept of corrigibility when I'm in the US - are you in San Francisco currently?

I have three thoughts on your example. First of all, it does seem a better version of corrigibility than I've seen. Secondly, it doesn't help much in those cases where the AI has to determine your preferences, like the "teach me about charities" example. And lastly, it puts a lot of weight on the AI successfully informing the human; it's trivial to mislead the human with truthful answers, especially when manipulating the human is an instrumental goal for the AI.

"What are you trying to achieve in this conversation?" "Allow you to write your will to the best of your abilities, as specified in my programming." That wouldn't even be a lie...

(as I said at the end of the post, I have more hope on the accurate answer front; so maybe we could get that to work?)

It seems like an underlying assumption of this post is that any useful safety property like "corrigibility" must be about outcomes of an AI acting in the world, whereas my understanding of (Paul's version of) corrigibility is that it is also about the motivations underlying the AI's actions. It's certainly true that we don't have a good definition of what an AI's "motivation" is, and we don't have a good way of testing whether the AI has "bad motivations", but this seems like a tractable problem? In addition, maybe we can make claims of the form "this training procedure motivates the AI to help us and not manipulate us".

I think of corrigibility as "wanting to help humans" (see here) plus some requirements on the capability of the AI (for example, it "knows" that a good way to help humans is to help them understand its true reasoning, and it "knows" that it could be wrong about what humans value). In the "teach me about charities" example, I think basically any of the behaviors you describe are corrigible, if the AI has no ulterior motive behind it. For example, trying to convince the billionaire to focus on administrative costs because then it would be easier for the AI to evaluate which charities are good or not is incorrigible. However, talking to the billionaire to focus on administrative costs because the AI has noticed that the billionaire is very frugal would be corrigible. (Though ideally the AI would mention all of the options that it sees the billionaire being convinced by, and then asks the billionaire for input on which method of convincing him he would endorse.) I agree that testing corrigibility in such a scenario is hard (though I like Paul's comment above as an idea for that), but it seems like we can train an agent in such a way that the optimization will knowably (i.e. high but not proof-level confidence) create an AI that is corrigible.

[+][comment deleted]4y -7