Corrigibility Scales To Value Alignment

PeterMcCluskey

Epistemic status: speculation with a mix of medium confidence and low confidence conclusions.

I argue that corrigibility is all we need in order to make an AI permanently aligned to a principal.

This post will not address how hard it may be to ensure that an AI is corrigible, or the conflicts associated with an AI being corrigible to just one principal versus multiple principals. There may be important risks from AI being corrigible to the wrong person / people, but those are mostly outside the scope of this post.

I am specifically using the word corrigible to mean Max Harms' concept (CAST).

Max Harms writes that corrigibility won't scale to superintelligence:

A big part of the story for CAST is that safety is provided by wise oversight. If the agent has a dangerous misconception, the principal should be able to notice this and offer correction. While this might work in a setting where the principal is at least as fast, informed, and clear-minded as the agent, might it break down when the agent scales up to be a superintelligence? A preschooler can't really vet my plans, even if I genuinely want to let the preschooler be able to fix my mistakes.

I don't see it breaking down. Instead, I have a story for how, to the contrary, a corrigible AI scales fairly naturally to a superintelligence that is approximately value aligned.

A corrigible AI will increasingly learn to understand what the principal wants. In the limit, that means that the AI will increasingly do what the principal wants, without need for the principal to issue instructions. Probably that eventually becomes indistinguishable from a value-aligned AI.

It might still differ from a value-aligned AI if a principal's instructions differ from what the principal values. If the principal can't learn to minimize that with the assistance of an advanced AI, then I don't know what to recommend. A sufficiently capable AI ought to be able to enlighten the principal as to what their values are. Any alignment approach requires some translation of values to behavior. Under corrigibility, the need to solve this seems to happen later than under other approaches. It's probably safer to handle it later, when better AI assistance is available.

Corrigibility doesn't guarantee a good outcome. My main point is that I don't see any step in this process where existential risks are reduced by switching from corrigibility to something else.

Vetting

Why can't a preschooler vet the plans of a corrigible human? When I try to analyze my intuitions about this scenario, I find that I'm tempted to subconsciously substitute an actual human for the corrigible human. An actual human would have goals that are likely to interfere with corrigibility.

More importantly, the preschooler's alternatives to corrigibility suck. Would the preschooler instead do a good enough job of training an AI to reflect the preschooler's values? Would the preschooler write good enough rules for a Constitutional AI? A provably safe AI would be a good alternative, but the feasibility of that looks like a long-shot.

Now I'm starting to wonder who the preschooler is an analogy for. I'm fairly sure that Max wants the AI to be corrigible to the most responsible humans until some period of acute risk is replaced by a safer era.

Incompletely Corrigible Stages of AI

We should assume that the benefits of corrigibility depend on how reasonable the principal is.

I assume there will be a nontrivial period when the AI behaves corrigibly in situations that closely resemble its training environment, but would behave incorrigibly in some unusual situations.

That means a sensible principal would give high priority to achieving full corrigibility, while being careful to avoid exposing the AI to unusual situations. The importance of avoiding unusual situations is an argument for slowing or halting capabilities advances, but I don't see how it's an argument for replacing corrigibility with another goal.

When in this path to full corrigibility and full ASI does the principal's ability to correct the AI decrease? The principal is becoming more capable, due to having an increasingly smart AI helping them, and due to learning more about the AI. The principal's trust in the AI's advice ought to be increasing as the evidence accumulates that the AI is helpful, so I see decreasing risk that the principal will do something foolish and against the AI's advice.

Maybe there's some risk of the principal getting in the habit of always endorsing the AI's proposals? I'm unsure how to analyze that. It seems easier to avoid than the biggest risks. It still presents enough risk that I encourage you to analyze more deeply than I have so far.

Jeremy Gillen says in The corrigibility basin of attraction is a misleading gloss that we won't reach full corrigibility because

The engineering feedback loop will use up all its fuel

This seems like a general claim that alignment is hard, not a claim that corrigibility causes risks compared to other strategies.

Suppose it's really hard to achieve full corrigibility. We should be able to see this risk better when we're assisted by a slightly better than human AI than we can now. With a better than human AI, we should have increased ability to arrange an international agreement to slow or halt AI advances.

Does corrigibility make it harder to get such an international agreement, compared to alternative strategies? I can imagine an effect: the corrigible AI would serve the goals of just one principal, making it less trustworthy than, say, a constitutional AI. That effect seems hard to analyze. I expect that an AI will be able to mitigate that risk by a combination of being persuasive and being able to show that there are large risks to an arms race.

Complexity

Max writes:

Might we need a corrigible AGI that is operating at speeds and complexities beyond what a team of wise operators can verify? I'd give it a minority---but significant---chance (maybe 25%?), with the chance increasing the more evenly/widely distributed the AGI technology is.

I don't agree that complexity of what the AGI does constitutes a reason for avoiding corrigibility. By assumption, the AGI is doing its best to inform the principal of the consequences of the AGI's actions.

A corrigible AI or human would be careful to give the preschooler the best practical advice about the consequences of decisions. It would work much like consulting a human expert. E.g. if I trust an auto mechanic to be honest, I don't need to understand his reasoning if he says better wheel alignment will produce better fuel efficiency.

Complexity does limit one important method of checking the AI's honesty. But there are multiple ways to evaluate honesty. Probably more ways with AIs than with humans, due to our ability to get some sort of evidence from the AI's internals. Evaluating honesty isn't obviously harder than what we'd need to evaluate for a value-aligned AI.

And again, why would we expect an alternative to corrigibility to do better?

Speed

A need for hasty decisions is a harder issue.

My main hope is that if there's a danger of a race between multiple AGIs that would pressure AGIs to act faster than they can be supervised, then the corrigible AGIs would prioritize a worldwide agreement to slow down whatever is creating such a race.

But suppose there's some need for decisions that are too urgent to consult the principal? That's a real problem, but it's unclear why that would be an argument for switching away from corrigibility.

How plausible is it that an AI will be persuasive enough at the critical time to coordinate a global agreement? My intuition that the answer is pretty similar regardless of whether we stick with corrigibility or switch to an alternative.

The only scenario that I see where switching would make sense is if we know how to make a safe incorrigible AI, but not a corrigible AI that's fully aligned. That seems to require that corrigibility be a feature that makes it harder to incorporate other safety features. I'm interested in reading more arguments on this topic, but my intuition is saying that keeping corrigibility is at least as safe as any alternative.

Conclusion

Corrigibility appears to be a path that leads in the direction of full value alignment. No alternative begins to look better as the system scales to superintelligence.

Most of the doubts about corrigibility seem to amount to worrying that a human will be in charge, and humans aren't up to the job.

Corrigible AI won't be as safe as I'd like during the critical path to superintelligence. I see little hope of getting something safer.

To quote Seth Herd:

I'm not saying that building AGI with this alignment target is a good idea; indeed, I think it's probably not as wise as pausing development entirely (depending on your goals; most of the world are not utilitarians). I'm arguing that it's a better idea than attempting value alignment. And I'm arguing that this is what will probably be tried, so we should be thinking about how exactly this could go well or go badly.

I'm slightly less optimistic than Seth about what will be tried, but more optimistic about how corrigibility will work if tried by the right people.

This post will not address how hard it may be to ensure that an AI is corrigible, or the conflicts associated with an AI being corrigible to just one principal versus multiple principals. There may be important risks from AI being corrigible to the wrong person / people, but those are mostly outside the scope of this post.

This is where the difference between Corrigibility and Value Learning really kicks in. Consider two-or-more opposed groups of humans (two-or-more tech titans, nation states, whatever) with corrigible aligned ASIs: let's assume the ASIs are smart, and learn to predict what their principals would correct, and how to extrapolate this correctly to situations too complex for the principals to understand. But they do not do anything more moral then or less confrontational than their principals — they just pursue their principal's goals with superhuman intelligence. This seems like a winner-take-all competition. Principals who hopefully aren't actually sociopaths, and don't personally want to die, and thus don't want humanity to go extinct, but who also don't want to lose at power politics or games of Chicken.

On the other hand, suppose they had Value Learning ASIs. These learn human vslues, including that: first of all don't kill all the humans. Extinction is forever, and the badness of killing all the humans is roughly minus the number of quality-adjusted-life-years there would have been in humanity's future lightcone if you hadn't killed all of them. This is hard to predict, but dominated by a long tail in which things go really well and humanity ends up spreading across the galaxy, giving a huge, literally astronomical number (like -10^25 or -10^30 quality-adjusted life years). So really, don't kill all the humans. Also, don't let them wipe themselves out in a nuclear war. In fact, keep a lid on their foolish conflicts over resources and status.

Corrigibility-based alignment of superintelligence doesn't give you wisdom; value learning-based alignment of superintelligence does. Superintelligence without wisdom is an x-risk — it's probably slower to kill us than unaligned superintelligence, but we still all die.

For more detail on this, see my posts Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV and Requirements for a Basin of Attraction to Alignment.

A corrigible AI will increasingly learn to understand what the principal wants

Oh! Well then sure, if you include this in the definition, of course everything follows. It's basically saying that to be confident we've got corrigibility, we should solve value learning as a useful step.

More importantly, the preschooler's alternatives to corrigibility suck. Would the preschooler instead do a good enough job of training an AI to reflect the preschooler's values? Would the preschooler write good enough rules for a Constitutional AI?

... would the preschooler do a good job of building corrigible AI? The preschooler just seems to be in deep trouble.

Does corrigibility make it harder to get such an international agreement, compared to alternative strategies? I can imagine an effect: the corrigible AI would serve the goals of just one principal, making it less trustworthy than, say, a constitutional AI. That effect seems hard to analyze. I expect that an AI will be able to mitigate that risk by a combination of being persuasive and being able to show that there are large risks to an arms race.

It's easy to get a corrigible ASI not to use its persuasiveness on you: you tell it not to do that.

I think you need to think harder about that "hard to analyze" bit — it's the fatal flaw (as in x-risk) of the corrigibility based approach. You don't get behavior any wiser or more moral than the principal. And 2–4% of principals are sociopaths (the figures for tech titans and heads of authoritarian states may well be higher).

I assume there will be a nontrivial period when the AI behaves corrigibly in situations that closely resemble its training environment, but would behave incorrigibly in some unusual situations.

As a rule of thumb, anything smart enough to be dangerous is dangerous because it can do scientific research and self-improve. If it can't tell when it's out-of-distribution and might need to generate some new hypotheses, it can't do scientific research, so it's not that dangerous. So yes, there might be some very unusual situation that decreased corrigibility: but for an ASI, just taking it out-of-training-distribution should pretty-reliably cause it to say "I know that I don't know what I'm doing, so I should be extra cautious/pessimistic, and that includes being extra-corrigible."

The engineering feedback loop will use up all its fuel

I discussed this with Jeremy Gillen in the comments of his post, and I'm still not clear what he meant by 'fuel' here. Possibly something to do with the problem of "fully-updated deference", a.k.a. the right to keep arbitrarily and inconsistently changing out minds?

A corrigible AI will increasingly learn to understand what the principal wants. In the limit, that means that the AI will increasingly do what the principal wants, without need for the principal to issue instructions.

I believe this to be a severe misunderstanding of corrigibility. It may on its own be fatal to this argument; I'm unsure.

Distinguish three things:

Do what I say
Do what I mean
Do what I want

'Do what I say' is obviously bad - from Tithonus to the Sorcerer's Apprentice and Amelia Bedelia we have extensive cultural warnings against it. 'Do what I want' is the ideal, the AI which does not need to be asked - in short, Friendly. CAST AI is neither. It is 'Do what I mean.' It will determine what you intended, and warn you about negative consequences, but if you clarify that you want it to do it anyway regardless of the reasons it believes (arguendo, correctly) that you will regret it, it will still proceed. In the limit it may suggest tasks it thinks you may desire, or ask for broad permission for actions on your behalf, but it will not act beyond its instructions. This is at the root of both why it is easier than value alignment and why it is safer to gradually approach than value alignment.

I don’t agree that complexity of what the AGI does constitutes a reason for avoiding corrigibility. By assumption, the AGI is doing its best to inform the principal of the consequences of the AGI’s actions.

That does not imply it is successful. CAST is valuable because, if true, at every stage it is amenable to being corrected if its interpretation of the principal's desires is in error. As complexity grows, so does the complexity of those errors. We should expect some detailed errors to be hidden before the scale and complexity of the AI's potential actions grows; if this causes the helpful property to disappear as correction grows impossible, that reduces the value of CAST significantly.

And again, why would we expect an alternative to corrigibility to do better?

I did not read any of the CAST sequence as claiming a relative advantage, but an absolute one. I read it as saying 'Here is an approach with a high chance of success, and a much higher feasibility than value alignment.' If these caveats leave it as best of a bad lot, but no longer with a high chance of success, then they are appropriate.

Also, if corrigibility fails only as power and intelligence grows, then it is effectively deceptively aligned.

I expect that for most people, "what I mean" will converge with "what I want" given superhuman help. I expect they will give increasingly broad instructions to the CAST AI, which will eventually approach "do what I want".

I guess I should replace "without need for the principal to issue instructions" with: without a need for a continuing set of instructions.

This post will not address how hard it may be to ensure that an AI is corrigible, or the conflicts associated with an AI being corrigible to just one principal versus multiple principals. There may be important risks from AI being corrigible to the wrong person / people, but those are mostly outside the scope of this post.

A corrigible AI will increasingly learn to understand what the principal wants

More importantly, the preschooler's alternatives to corrigibility suck. Would the preschooler instead do a good enough job of training an AI to reflect the preschooler's values? Would the preschooler write good enough rules for a Constitutional AI?

... would the preschooler do a good job of building corrigible AI? The preschooler just seems to be in deep trouble.

Does corrigibility make it harder to get such an international agreement, compared to alternative strategies? I can imagine an effect: the corrigible AI would serve the goals of just one principal, making it less trustworthy than, say, a constitutional AI. That effect seems hard to analyze. I expect that an AI will be able to mitigate that risk by a combination of being persuasive and being able to show that there are large risks to an arms race.

It's easy to get a corrigible ASI not to use its persuasiveness on you: you tell it not to do that.

I assume there will be a nontrivial period when the AI behaves corrigibly in situations that closely resemble its training environment, but would behave incorrigibly in some unusual situations.

The engineering feedback loop will use up all its fuel

A corrigible AI will increasingly learn to understand what the principal wants. In the limit, that means that the AI will increasingly do what the principal wants, without need for the principal to issue instructions.

I believe this to be a severe misunderstanding of corrigibility. It may on its own be fatal to this argument; I'm unsure.

Distinguish three things:

Do what I say
Do what I mean
Do what I want

I don’t agree that complexity of what the AGI does constitutes a reason for avoiding corrigibility. By assumption, the AGI is doing its best to inform the principal of the consequences of the AGI’s actions.

And again, why would we expect an alternative to corrigibility to do better?

Also, if corrigibility fails only as power and intelligence grows, then it is effectively deceptively aligned.

I guess I should replace "without need for the principal to issue instructions" with: without a need for a continuing set of instructions.

LESSWRONG
LW

LESSWRONG
LW

11

Corrigibility Scales To Value Alignment

11

Vetting

Incompletely Corrigible Stages of AI

Complexity

Speed

Conclusion

11

11