This post will not address how hard it may be to ensure that an AI is corrigible, or the conflicts associated with an AI being corrigible to just one principal versus multiple principals. There may be important risks from AI being corrigible to the wrong person / people, but those are mostly outside the scope of this post.
This is where the difference between Corrigibility and Value Learning really kicks in. Consider two-or-more opposed groups of humans (two-or-more tech titans, nation states, whatever) with corrigible aligned ASIs: let's assume the ASIs are smart, and learn to predict what their principals would correct, and how to extrapolate this correctly to situations too complex for the principals to understand. But they do not do anything more moral then or less confrontational than their principals — they just pursue their principal's goals with superhuman intelligence. This seems like a winner-take-all competition. Principals who hopefully aren't actually sociopaths, and don't personally want to die, and thus don't want humanity to go extinct, but who also don't want to lose at power politics or games of Chicken.
On the other hand, suppose they had Value Learning ASIs. These learn human vslues, including that: first of all don't kill all the humans. Extinction is forever, and the badness of killing all the humans is roughly minus the number of quality-adjusted-life-years there would have been in humanity's future lightcone if you hadn't killed all of them. This is hard to predict, but dominated by a long tail in which things go really well and humanity ends up spreading across the galaxy, giving a huge, literally astronomical number (like -10^25 or -10^30 quality-adjusted life years). So really, don't kill all the humans. Also, don't let them wipe themselves out in a nuclear war. In fact, keep a lid on their foolish conflicts over resources and status.
Corrigibility-based alignment of superintelligence doen't give you wisdom; value learning-based alignment of superintelligence does. Superintelligence without wisdom is an x-risk — it's probably slower to kill us than unaligned superintelligence, but we still all die.
For more detail on this, see my posts Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV and Requirements for a Basin of Attraction to Alignment.
A corrigible AI will increasingly learn to understand what the principal wants
Oh! Well then sure, if you include this in the definition, of course everything follows. It's basically saying that to be confident we've got corrigibility, we should solve value learning as a useful step.
More importantly, the preschooler's alternatives to corrigibility suck. Would the preschooler instead do a good enough job of training an AI to reflect the preschooler's values? Would the preschooler write good enough rules for a Constitutional AI?
... would the preschooler do a good job of building corrigible AI? The preschooler just seems to be in deep trouble.
Does corrigibility make it harder to get such an international agreement, compared to alternative strategies? I can imagine an effect: the corrigible AI would serve the goals of just one principal, making it less trustworthy than, say, a constitutional AI. That effect seems hard to analyze. I expect that an AI will be able to mitigate that risk by a combination of being persuasive and being able to show that there are large risks to an arms race.
It's easy to get a corrigible ASI not to use its persuasiveness on you: you tell it not to do that.
I think you need to think harder about that "hard to analyze" bit — it's the fatal flaw (as in x-risk) of the corrigibility based approach. You don't get behavior any wiser or more moral than the principal. And 2–4% of principals are sociopaths (the figures for tech titans and heads of authoritarian states may well be higher).
I assume there will be a nontrivial period when the AI behaves corrigibly in situations that closely resemble its training environment, but would behave incorrigibly in some unusual situations.
As a rule of thumb, anything smart enough to be dangerous is dangerous because it can do scientific research and left-improve. If it can't tell when you're out-of-distribution and might need to generate some new hypotheses, it can't do scientific research, so it's not that dangerous. So yes, there might be some very unusual situation that decreased corrigibility: but for an ASI, just taking it out-of-training-distribution should pretty-reliably cause it to say "I know that I don't know what I'm doing, so I should be extra cautious/pessimistic, and that includes being extra-corrigible."
The engineering feedback loop will use up all its fuel
I discussed this with Jeremy Gillen in the comments of his post, and I'm still not clear what he meant by 'fuel' here. Possibly something to do with the problem of "fully-updated deference", a.k.a. the right to keep arbitrarily and inconsistently changing out minds?
Epistemic status: speculation with a mix of medium confidence and low confidence conclusions.
I argue that corrigibility is all we need in order to make an AI permanently aligned to a principal.
This post will not address how hard it may be to ensure that an AI is corrigible, or the conflicts associated with an AI being corrigible to just one principal versus multiple principals. There may be important risks from AI being corrigible to the wrong person / people, but those are mostly outside the scope of this post.
I am specifically using the word corrigible to mean Max Harms' concept (CAST).
Max Harms writes that corrigibility won't scale to superintelligence:
I don't see it breaking down. Instead, I have a story for how, to the contrary, a corrigible AI scales fairly naturally to a superintelligence that is approximately value aligned.
A corrigible AI will increasingly learn to understand what the principal wants. In the limit, that means that the AI will increasingly do what the principal wants, without need for the principal to issue instructions. Probably that eventually becomes indistinguishable from a value-aligned AI.
It might still differ from a value-aligned AI if a principal's instructions differ from what the principal values. If the principal can't learn to minimize that with the assistance of an advanced AI, then I don't know what to recommend. A sufficiently capable AI ought to be able to enlighten the principal as to what their values are. Any alignment approach requires some translation of values to behavior. Under corrigibility, the need to solve this seems to happen later than under other approaches. It's probably safer to handle it later, when better AI assistance is available.
Corrigibility doesn't guarantee a good outcome. My main point is that I don't see any step in this process where existential risks are reduced by switching from corrigibility to something else.
Vetting
Why can't a preschooler vet the plans of a corrigible human? When I try to analyze my intuitions about this scenario, I find that I'm tempted to subconsciously substitute an actual human for the corrigible human. An actual human would have goals that are likely to interfere with corrigibility.
More importantly, the preschooler's alternatives to corrigibility suck. Would the preschooler instead do a good enough job of training an AI to reflect the preschooler's values? Would the preschooler write good enough rules for a Constitutional AI? A provably safe AI would be a good alternative, but the feasibility of that looks like a long-shot.
Now I'm starting to wonder who the preschooler is an analogy for. I'm fairly sure that Max wants the AI to be corrigible to the most responsible humans until some period of acute risk is replaced by a safer era.
Incompletely Corrigible Stages of AI
We should assume that the benefits of corrigibility depend on how reasonable the principal is.
I assume there will be a nontrivial period when the AI behaves corrigibly in situations that closely resemble its training environment, but would behave incorrigibly in some unusual situations.
That means a sensible principal would give high priority to achieving full corrigibility, while being careful to avoid exposing the AI to unusual situations. The importance of avoiding unusual situations is an argument for slowing or halting capabilities advances, but I don't see how it's an argument for replacing corrigibility with another goal.
When in this path to full corrigibility and full ASI does the principal's ability to correct the AI decrease? The principal is becoming more capable, due to having an increasingly smart AI helping them, and due to learning more about the AI. The principal's trust in the AI's advice ought to be increasing as the evidence accumulates that the AI is helpful, so I see decreasing risk that the principal will do something foolish and against the AI's advice.
Maybe there's some risk of the principal getting in the habit of always endorsing the AI's proposals? I'm unsure how to analyze that. It seems easier to avoid than the biggest risks. It still presents enough risk that I encourage you to analyze more deeply than I have so far.
Jeremy Gillen says in The corrigibility basin of attraction is a misleading gloss that we won't reach full corrigibility because
This seems like a general claim that alignment is hard, not a claim that corrigibility causes risks compared to other strategies.
Suppose it's really hard to achieve full corrigibility. We should be able to see this risk better when we're assisted by a slightly better than human AI than we can now. With a better than human AI, we should have increased ability to arrange an international agreement to slow or halt AI advances.
Does corrigibility make it harder to get such an international agreement, compared to alternative strategies? I can imagine an effect: the corrigible AI would serve the goals of just one principal, making it less trustworthy than, say, a constitutional AI. That effect seems hard to analyze. I expect that an AI will be able to mitigate that risk by a combination of being persuasive and being able to show that there are large risks to an arms race.
Complexity
Max writes:
I don't agree that complexity of what the AGI does constitutes a reason for avoiding corrigibility. By assumption, the AGI is doing its best to inform the principal of the consequences of the AGI's actions.
A corrigible AI or human would be careful to give the preschooler the best practical advice about the consequences of decisions. It would work much like consulting a human expert. E.g. if I trust an auto mechanic to be honest, I don't need to understand his reasoning if he says better wheel alignment will produce better fuel efficiency.
Complexity does limit one important method of checking the AI's honesty. But there are multiple ways to evaluate honesty. Probably more ways with AIs than with humans, due to our ability to get some sort of evidence from the AI's internals. Evaluating honesty isn't obviously harder than what we'd need to evaluate for a value-aligned AI.
And again, why would we expect an alternative to corrigibility to do better?
Speed
A need for hasty decisions is a harder issue.
My main hope is that if there's a danger of a race between multiple AGIs that would pressure AGIs to act faster than they can be supervised, then the corrigible AGIs would prioritize a worldwide agreement to slow down whatever is creating such a race.
But suppose there's some need for decisions that are too urgent to consult the principal? That's a real problem, but it's unclear why that would be an argument for switching away from corrigibility.
How plausible is it that an AI will be persuasive enough at the critical time to coordinate a global agreement? My intuition that the answer is pretty similar regardless of whether we stick with corrigibility or switch to an alternative.
The only scenario that I see where switching would make sense is if we know how to make a safe incorrigible AI, but not a corrigible AI that's fully aligned. That seems to require that corrigibility be a feature that makes it harder to incorporate other safety features. I'm interested in reading more arguments on this topic, but my intuition is saying that keeping corrigibility is at least as safe as any alternative.
Conclusion
Corrigibility appears to be a path that leads in the direction of full value alignment. No alternative begins to look better as the system scales to superintelligence.
Most of the doubts about corrigibility seem to amount to worrying that a human will be in charge, and humans aren't up to the job.
Corrigible AI won't be as safe as I'd like during the critical path to superintelligence. I see little hope of getting something safer.
To quote Seth Herd:
I'm slightly less optimistic than Seth about what will be tried, but more optimistic about how corrigibility will work if tried by the right people.