The limits of corrigibility

by Stuart_Armstrong4 min read10th Apr 20189 comments



I previously wrote a critique of Paul Christiano's Amplification/Distillation idea. Wei Dei pointed out that I had missed out a key part of Paul's idea, his use of corrigibility.

I'd been meaning to write about corrigibility for some time, so this provides a good excuse. For a general understanding of corrigibility, see here; for Paul's specific uses of it, I'm relying on this post and this one.

There have been some posts analysing whether corrigibility can be learnt, or whether corrigibility can nevertheless allow for terrible outcomes.

Summary of the claims of this post:

  • Strong corrigibility, outside of a few simple examples, needs to resolve the problem of figuring out human values. The more powerful and long-term the AI's decisions, the more important this is.
  • Once human values are figured out, corrigibility adds little to the mix.
  • Moderate versions of corrigibility can be useful along the way for building safe AGIs.

An example of corrigibility

Corrigibility is roughly about "being helpful to the user and keeping the user in control".

This seems rather intuitive, and examples are not hard to come by. Take the example of being a valet to an ageing billionaire. The billionaire wants to write his will, and decide how to split his money among his grandnephews. He asks you, his loyal valet, to bring him the family photo albums, which he will peruse while making his decision.

It's obvious that there are many way of manipulating him, if you wanted to. He won't have time to go through all the albums, so just by selecting some positive/negative photos for one nephew and putting them on top, you can skew his recollection and his choice.

That's clearly manipulative behaviour. Much better would be to select a representative cross-section of the photos, and give them to him. Or, even better, enquire about the criteria he wants to use for making his choices, and selecting a cross-section of representative and relevant photos for those criteria.

Complicating the example

Which billionaire?

This is where it gets complicated. Suppose your billionaire claims to only care about whether his grandnephews were kind and considerate. However, you happen to know that he also values flattery. What's more, a bit of gentle coaxing on your part will get him to admit that; once that's done, you can freely select photos based on evidence of both kindness and flattery.

Should you do that coaxing? Even if you don't (or maybe if the coaxing was unsuccessful), should you still select the photos based on flattery? What if you could convince the billionaire to admit he valued flattery - but only by using conversational techniques that would themselves be considered very manipulative? Should you then manipulate the billionaire to admitting the truth?

Or, suppose the billionaire was going senile, and knew this to some extent, and wanted you to help him make the decision as he would have made it when he was in his prime. However, he's in denial of the extent of his senility, and his image of himself at his prime is seriously flawed.

So, are a corrigible valet, who should you aim to keep in control? The current billionaire, his past self, or his current vision of his past self?

A similar question applies if the billionaire is in control of his faculties, but subject to random and major mood swings.

Learning what to value

Now suppose that the billionaire wants to give his money to a good charity instead. He has some vague idea of what a good charity is - more properly, he has a collection of vague ideas of what a good charity is. He wants you to educate him on what criteria make a good charity, and then help him select the top one(s).

You have almost unlimited power here, to shape not only the outcome of his decisions but also the values of his future self. Pushing him towards effective altruism, making him focus on administrative costs, tugging at his heart-strings, emphasising or de-emphasising his in-group, focusing on health/animal/development etc... All of these are possible options, and all of them fit broadly within his underdeveloped current values and meta values. What is the corrigible thing to do here?

You may recognise this as the usual problem of figuring out what the true values of humans are, given their underdefined, contradictory, and manipulable values.

Indeed, if you have figured out that the billionaire's true reward function is R, then corrigibility seems to simply reduce to respecting R and avoiding guiding him away from R.

But can you be corrigible without actually solving the problem of human values?

Corrigibility: methods or outcomes?

Incorrigibility seems much easier to define than corrigibility. There are a variety of manipulative or coercive techniques that clearly count as incorrigible. Tricking someone, emotionally abusing them, feeding them slanted and incorrect information, or threatening them - these seem clearly bad. And it's very plausible that an artificial agent could learn examples of these, and how to avoid them.

But this is pointless if the AI could get the human to the same place through more subtle and agreeable techniques. If the AI is able to convince anyone of anything with a gentle and enjoyable one hour conversation, then there's no such thing as corrigible behaviour - or, equivalently, there's no such thing as a complete list of incorrigible behaviours.

Even the subproblem of ensuring that the AI's answers are informative or accurate is unsolved (though this subproblem of corrigibility, however, is one I feel may be solvable, either by training the AI on examples or by formal methods; see the end of this post).

Outcomes and solving human values

Instead, you have to define corrigibility by it outcome: what has the human been made to do/believe/value, as compared with some idealised version of what that outcome should be.

In other words, the AI has to solve the problem of establishing what the true human values are. There are some caveats here; for example, the AI could merely figure out a range of acceptable outcomes, and aim for within that range. Or the AI could figure out that the human should truly value "what's hidden in this safe", without knowing the contents of the safe: it doesn't need to fully figure out the human's values, just figure out the conditional dependence of human value based on some evidence.

But, even with these caveats, the AI has to do almost all the work of establishing human values. And if it does that, why not directly have the AI maximise those values? We can add conditions like forbidding certain incorrigible methods and making the space of possible future human values large enough that we exercise some choice, but that doesn't remove the fact that strong, long-term corrigibility is a subset of the problem of learning human values.

Moderate corrigibility

Ironically, despite the conclusions above, I've been working on a few techniques that might help build some weaker version of corrigibility.

These don't solve the problem of corrigibility in theory; at best they can turn the AI's manipulation of the human into a random walk rather than a goal directed process.

But they can help in practice. Not if the goal is to build a fully corrigible AI, but if we are using AIs as advisers or tools along the road for constructing an adequate human reward function, and an AI that can honour it. Training wheels, as it were, not the main driving wheels.