Why Corrigibility Matters (If It Matters At All)

Savannah Harlan

In this note I’ll try and lay out briefly some thoughts about corrigibility and why it matters. First, some housekeeping.

Definitions

When I say “corrigibility”, I mean “correctable” as per the original sense of the term—etymologically derived from corrigible, of the same meaning in French, itself hailing from the Latin verb “to correct”. This is correctable, not in any specific sense, as in correctable by a principal (mom tells you to stop smoking; CEO tells you to stop making paperclips), but correctable in general.

Agent-Invariant

In this note, “corrigibility” is agent invariant (I am not speaking only of corrigibility for humans, or artificial intelligence, but the concept itself, with no immediate constraint on which applications may be valid.

1 “Are Most People Even Good?” and Moral Sampling Bias

This is a bit of an aside so I’ll keep this short; are most people good? I’m certain the answer depends in part on personality (likely falling somewhat along the optimist-pessimist line, the Big Five, etc.) As a pessimist myself, restrained by epistemic humility—through the pedantry of my fellow rationalists in communities like this—I think it’s hard to say, but I lean “no”.

But I also think most people would say yes.

I want to propose two reasons why. First, I think people probably want to believe that others are good. Maybe for similar reasons to me, in that, no clear answer comes to mind from the observations salient enough for retrieval, so if it is a toss-up, they push in favor of the answer that doesn’t paint a picture of a shittier world. Fair enough.

But second, the one I’ve heard from a few people, is that everyone they know in day-to-day life seems nice enough. As in, they think people are broadly kind and good-natured because the people they know are kind and good-natured. Here are a few possible problems with this:

1. Most people are not your friends: extrapolating from ‘the people I know are good’ to ‘most people are good’ is a generalization. This is allowed, but we should be clear about this.

2. You may choose good people as friends:^[1] you probably have some kind of selection criteria for your friends. You probably wouldn’t spend too much time with someone you think is mean or cruel. Even if the people in your life are not strictly all your friends (e.g. a good number of the people I know are friends of my friends).

Even if your friends could theoretically be a fair, uniform draw from the pool of people that exist—deliberately or unconsciously, I’d argue you probably have some sort of filter for who you spend time with, even strangers. A stranger starting an interesting discussion with you about teleporters is probably on their way to becoming a friend; a person asking you why your handwriting looks like the E-MNIST^[2] dataset is probably not.

3. You don’t really know people: (this includes your friends.) This probably sounds a little sacrilegious to some. After all, the connotation is that your friends aren’t as good of people as you might imagine them to be. But think about it really. How do ‘selfish billionaires’ (bear with trope) have friends? How do assholes have friends? Presumably, they aren’t assholes to everyone.

Put another way, you may think “oh all the people I have in my life are really sweet” and that may feel true, but are they sweet, or are they sweet to you? Your friends may be nice because you chose nice people as friends, but they may also be nice because they are your friends.

2. Motivated Reasoning

Think about the last time you did something ‘bad’. Say, failed to donate to feeding the homeless or local housing reform. I’ll start: I ate a pork sausage last night (probably made from a pig raised in an industrial farm in really torturous, cramped conditions.) Why did I do that?

It’s easy for me to say something like, “I did it because I’m a broke student and it’s cheaper than vegan products”. I mean, I need the money now, for school (tuition) and rent (not being homeless is a big motivation). And once I finish school I will have even more money to do the morally good but expensive things people keep guilting me into doing.

So I’m not actually a bad person you see, I just make a few mistakes here and there, and I do some bad things now, but the lower cognitive load will let me eventually be an even better person in the future, and do good things in higher-impact ways.

Worryingly though, it’s just as easy for me to actually believe this. I suspect at least a few ‘selfish billionaires’^[3] use this to justify what they do and believe. “If I spend my money helping you now I won’t have money to help you more in the future” or “you need money to make money, so if you let me continue making more money now, I can help many more people in the future.”

2.1 Motivated Reasons Can Make A Lot of Sense

The trouble with all of these cases—and motivated reasoning in general—is that they really do make sense. If you take Parfit’s point that 'later events should not count for less merely because … [they] happen later', ^[4]and you’re not averse to numerical comparisons of human lives (in that saving 10 people is better than saving 1), saving 10 people in the future is definitely better than saving 1 now.

If motivated reasoning was just a ‘bad thing’ that bad people did (to justify them not doing good things) this would be an easier fight. But in the same way that the statement “all humans are selfish” is not really a Popperian-falsifiable statement (since we cannot read minds), a person may do an (apparently) bad thing for a good reason or a bad reason, a good/justifiable motivated reason or a bad motivated reason.^[5]

And motivated reasons can be genuine. They can start out as shaky rationales, to reduce the pain of thinking about the possibility that you are a bad person. But if you are even half-decent at argumentation, and—like me—a self-professed rationalist, you will quickly find at least some internally-consistent version of a justification. Wanting to believe that you are good is a really powerful thing.^[6]

2.2 Endless Deferral

This is not a novel argument; every principled utilitarian has likely considered and already countered this before. But for those who haven’t: when you do motivated reasoning to say the reason you don’t do good things now is because you can do more good things later (if you wait), consider that this logic is both quite consistent, and almost always true.

If you think you are good at investing for the future, or like most career professionals, expect to make more money in the future than you do now, there is no clear point at which motivated reasoning ever has to stop itself to say “hey, let’s cash that ‘good person check’ now.” Since money makes more money (at least if the stock market doesn’t horrifically crash), it is easy to argue (genuinely) that your ability to do more good things will probably keep going up.

If we can do this, a super-intelligence or AGI can certainly do this too. And if we can do it genuinely, it can do it genuinely too. This is probably a false dichotomy, but pretend for a moment it isn’t.

An AGI can work on curing child hunger permanently, and we won’t know how long it really takes, but it estimates roughly 10 to 20 years. Conversely the AGI can work on getting us to Mars, and that may take 20 or 100 years, but if it succeeds there’s no chance a planet-sized asteroid can ever wipe out the human race (which is the only form of life we know thus far that values science and art and music, etc.) You can’t get coffee if you’re dead, and you certainly can’t suffer starvation if you’re extinct, so it works on getting us to Mars.

Now it’s 50 years into the future, and we have a primitive Mars colony with 100 people. Their living conditions probably suck though: they live underground, there may be a lot of inbreeding for the first few subsequent generations, they probably don’t have too many steaks and caviar. And it’s rather doubtful that the colony could be long-term self-sustaining if Earth tomorrow exploded and stopped being able to send them certain supplies and emergency assistance.

You, starving person, turn to the AGI (who by now has taken all of the jobs that could have been a source of economic mobility so that you one-day stop starving and enter a middle class) and say: “How about now?”

It could start working on solving all of your problems. Or it might (again, rationally and even correctly) decide that the Mars colony isn’t really strong enough to be a hedge against a planet-wide extinction event on Earth yet, and that 50 more years studying, prototyping, and manufacturing terraforming technology is the better long-term strategy.

The point isn’t that the AGI is stupid, or that you are evil if you have ever thought this way. It’s just that unless you hardcode some length of time into the premise, you may never end up doing good things because you keep deferring for the reason that you will be able to do more good, later. And if you think being able to do more good later is itself good, then you can come to the conclusion that agents who have never (really meaningfully) done a good thing are good. This isn’t necessarily stupid, but I hope you can see why it’s problematic^[7]. And things don't have to be problematic for you, for them to be problematic. Even if you believe you are above the fray,^[8] all that's required is for you to understand others would probably not be.

3. Being ‘Good’ is Probably Probabilistic

This is the short part of the note, I promise. History is full of examples of bad people thinking they were good people doing good things. Remember, motivated reasoning can be genuine. Some Southern slave owners in the 18th century who identified as Christians probably really thought slavery was moral. Maybe most of them didn’t think too hard about it, and if they had, they would have realized it made no sense.

But some of them probably really did think really hard about it, and they may have had worldviews that made their motivated reasons rational, on top of genuine.^[9]

This is the deepest reason corrigibility matters: it’s about the gap between believing you are good and actually being good. It’s easy to think you are on the side of the abolitionists; this may be an argument in itself: no one ever wants to think they might be the slave owners.

But if capitalism is actually evil, or democracy is evil, or multiculturalism is evil, I may well be on the wrong side of history. I definitely already am in regard to eating animals.^[10]

I’m saying ‘you don’t know what you don’t know’. But I am also saying that you should use what you know you don’t know to guide your behaviour, insofar as it is possible.

4. Corrigibility is ‘Wanting to be Good’

In terms of alignment, (whether for humans or AI) corrigibility at its core is the basic idea of ‘wanting to be good’. I know some people take it to mean open-minded or open to change (or change by a principal)—I won’t contest usage drift, those definitions are valid too.

But the definition I use is not strictly changeability, nor a value-neutral concept, but something that incorporates a notion of directionality (i.e. ‘toward goodness’).^[11]

4.1 Why Does Corrigibility Matter?

So why all the words (and the post)? This is fundamentally about a key asymmetry: I believe that it is always easier to be wrong about being good, than it is to be wrong about wanting to be good.

Consider this: what would even mean to be wrong about ‘wanting to be good’? One version might be that you want to be good for self-interested reasons (à la Parfit), but those reasons are actually wrong. Maybe you think being good will make you the most friends, but actually being 80% good and 20% conniving is the optimal strategy.

Or maybe you want to be good, but you think being good means social cohesion and reducing the probability of widespread death and violence—in which case you support the status quo, which is universal suffrage and liberal democracy here in the 21st century, or chattel slavery in the 18th (and then whether you really are good becomes a coin toss).

But in the second case, (if we can agree that you are wrong, just because you are wrong about what is ultimately ‘good’), if you are someone who really wants to be good, there is at least arguably a larger chance of changing your mind than if you are someone who already believes that they are good.

4.2 Corrigibility is not about convincing

There is one final argument I’d like to try and make: corrigibility is not (strictly) about convincing. What I mean by this is that, typically when people hear corrigibility, one standard, straightforward interpretation is: ‘an agent that keeps an open mind and is thus always open to being convinced that they are wrong—and being corrected’. I wish to make the argument that this is a stronger, more restrictive version of corrigibility than the one I believe is most useful. Let us call this first, stronger, straightforward kind: ‘rational-posterior corrigibility’.

The argument I wish to preempt is this notion that you should never do something until you have been convinced it is ‘right’. In this view, corrigibility is about seeing yourself as someone who is open to being convinced, and insofar as ‘an open mind’ has anything to do with it, it is through allowing yourself to be corrected once you are shown you are wrong.

However, in my view, this inverts the instrumentality and benefit of corrigibility as a trait entirely. In my perspective, ‘being convinced’ is merely a sufficient condition for corrigibility, but not the necessary one. The necessary condition is having an open mind in itself, from which the ability to be corrected flows out naturally. Is this semantics? Possibly, but I would try to disagree.

4.2.1 Rationalism and RPC

Why? First, consider the biggest flaw with the rational-posterior form of corrigibility (“RPC”): practical implementation. Under the RPC view, corrigibility first and foremost is about being open to being convinced that you might be wrong. This seems airtight, so what’s the problem? Well, mainly, it’s the ‘every bad guy thinks they are the good guy’ problem, rehashed into the dangerous perception of a free lunch, especially for rationalists like us.

Almost everyone usually thinks they are being rational, even crazy people. That’s rather the defining trait that makes mental illnesses like mania hard to treat. If your ultimate failsafe for morality is corrigibility, but you define corrigibility as being willing to be convinced that you are wrong, this causes corrigibility to lose most of its stopping power because smuggled in the premise is the hidden implication that you are willing to be convinced you are wrong, but only if you are wrong.

But implicitly for rationalists, and explicitly for most people (by self-description), we would never be so arrogant or dogmatic to continue insisting we are right when we know we are wrong. But if this is the only time you are corrigible, it’s useless! Because of course if I show you you are wrong in a way that convinces/persuades you, you would come to my side—I’ve already (just) persuaded you! But that’s not corrigibility, or rather, it’s a definition of corrigibility that sounds sensible but is ultimately empty, if only because it quite literally is just common sense.

But corrigibility is not common sense. If it were, there would be no need to argue so hard for it. Rather, corrigibility, as I argue, is about having an open mind, for the sake of being convinced (that you are wrong); it is what creates an opening for the convincing, not a description that ‘some kind of opening is always there’, because if it was just about that, it becomes unfalsifiable too. If you reject 100/100 proposals I make to you, but you tell me, no, actually you are quite open-minded, I just haven’t opened your mind yet, I can neither call you a liar, nor all that corrigible.

4.3 Operationalizing Corrigibility

This would be fine if this was merely an academic discussion, but recall the stakes. X-risk demands a higher form of corrigibility than one that is only good in your head, and verifiable only to yourself.

The corrigibility I care about is the open-mindedness that produces an agent who is willing to try at least 1 of the 100 proposals I’ve made, not an agent that merely says they are open to trying proposals (because again, the latter is not a falsifiable definition of corrigibility). For corrigibility to be meaningful, it has to be of the painful, non-trivial kind. Imagine the manic patient in a seriously escalating episode who self-admits themself to the emergency room (this should be easier than trying to be a bat).^[12]

Ask yourself, what would it feel like to be that person in this thought experiment? From the outside view, reading this post, you and I know the ground truth: they are doing the right thing, they are allowing themselves to suffer temporary discomfort, and possibly be medicated in order to ‘correct’ themselves out of their manic episode. But from the inside view? This is not ‘updating your priors in the face of obviously correct truths’. I am not denying that this (the posterior-rationalist view of corrigibility) makes sense.

But I am saying it makes too much sense—so much sense, in fact, that it becomes trivial, because by the time you agree, you aren’t corrigible so much as just not an idiot. A person who worries about climate change once the waters are at their neck is not so much a science believer as an instrumental rationalist. Again, this is not wrong. We want our AIs to be corrigible to evidence. (and our humans).

However, to me, corrigibility matters if—and only if—it also lets you do what the manic patient does. It must allow you to not just confront, but confront and possibly do the thing that seems obviously wrong to you, against all your instincts. The thing whose correctness may not have even had the chance to present itself to you yet for an update, or which derives from a long chain of self-consistent arguments but which you have not yet had the time to fully understand and audit.

Now you might think, “wait, doesn’t this go too far in the other direction though? Aren’t you now just telling me to believe or accept things before they’ve been proven to me?”

4.3.1 Corrigibility and Institutional Deference

Yes, but let me show you why that isn’t as insane as it sounds. Have you ever seen an atom? What about the seminal 1973 paper by Fritzsch et al. on gluons as the carriers of the strong force?^[13] And have you ever really measured the relationship between a circle’s circumference and its diameter? Or did you just accept that it was approximately 3.141592654?

I’ll go first, I uncritically did. But is that really irrational? Sure, Occam’s razor, I know I know, ‘it would take a conspiracy of millions of people working together for pi to be a lie’. Yes, but also, do you know a million people? I don’t. Convincing me of pi did not take a million people, probably not even two.

My point is that if it really was a conspiracy, it’d still fly right over my head. Minimal global, secretive coordination required. But the reason much of society (and money) works is because of trust. We defer to institutions. I trust Nature and the Lancet to tell me the truth, just as much as I trust them to issue retractions when they find out they did not. I trust Terrence Tao to solve Kakeya, if and when he does it, even if I will never personally be able to interrogate his proof when he publishes it.

None of this is irrational. What it is, is not airtight. But corrigibility was never about air tight arguments in the first place. It is the best ‘worst case’ guarantee under a robust decision-making (RDM) framework. You don’t need corrigibility if the only thing that convinces you is a positive result from a proof checker, in the same way that it would not be difficult at all to treat manic and schizophrenic patients if you could simply make them see how silly they are acting.

If corrigibility means anything in the face of superintelligence, it cannot be because of what the sensible but ultimately unfalsifiable version of it (under RPC) gets us. The only construction that makes sense is the one that gives an agent the epistemic and moral space to take ‘leaps of faith’.

This is not the same as dogma. Dogma asks you to believe in god, period. Corrigibility, in my view, asks you to try the new ice-cream flavour before you reject it. You are allowed to reject it, just as much as you are allowed to reject pi until you have derived it for yourself from direct measurement.

But my point is that corrigibility is not about being convinced after you’ve been proven wrong, and in the context of AI alignment, it has never been about that. You are unlikely to show the 1200 IQ ASI why turning everyone into paperclips is actually disastrous 200 years down the road. I’m sure you have a good argument, but we can be certain the god machine will have a better one. The reason corrigibility matters, if it matters at all, is because it is maybe the only thing that can convince a god that they aren’t really god—or the manic patient to finally check themselves into the hospital.

Acknowledgements

Sincere thanks to Rauno Arike, Leo Zovic, and Evgenii Opryshko for help with critiques and brainstorming, and reflection.

^{^}
Snyder, Mark, Steve Gangestad, and Jeffry A. Simpson. "Choosing Friends as Activity Partners: The Role of Self-Monitoring." Journal of Personality and Social Psychology 45, no. 5 (1983): 1061.
^{^}
National Institute of Standards and Technology. "The EMNIST Dataset." Last updated December 2, 2024. https://www.nist.gov/itl/products-and-services/emnist-dataset.
^{^}
Invoked as irony and trope, not judgement. I'll leave smarter people to debate the actual ethics of billionaires and all that.
^{^}
Parfit, Derek. Reasons and Persons. Oxford: Clarendon Press, 1984. Chap. 1, "The Self-Interest Theory," [.epub so no page numbers, I'm sorry].
^{^}
Popper, Karl. Realism and the Aim of Science: From the Postscript to The Logic of Scientific Discovery. London and New York: Routledge, 1983, in "Introduction".
^{^}
And possibly even automatic. See: Jarcho, Johanna M., Elliot T. Berkman, and Matthew D. Lieberman. “The Neural Basis of Rationalization: Cognitive Dissonance Reduction during Decision-Making.” Social Cognitive and Affective Neuroscience 6, no. 4 (2011): 460-67. https://doi.org/10.1093/scan/nsq054.
^{^}
This should not be taken as some general critique against the idea of effective altruism altogether, or the math behind patient giving. See: groundsloth. “A Simple Case for Patient Philanthropy.” EA Forum. September 23, 2025. https://forum.effectivealtruism.org/posts/SEELNLbtNqyQdnuzE/a-simple-case-for-patient-philanthropy.
^{^}
"the fray" = motivated reasoning
^{^}
(e.g. there are inherent intellectual differences in race; people who look closer to early hominins—darker skin, more hair—really must be less evolved or stupider).
^{^}
Or any of the other dozen things I am raised to support here in the West (not unconditionally, but the power of default is powerful).
^{^}
Again, this was the original usage of the term.
^{^}
Nagel, Thomas. "What Is It Like to Be a Bat?" The Philosophical Review 83, no. 4 (October 1974): 435-450.
^{^}
The densest 4 pages you will ever read.
Fritzsch, H., M. Gell-Mann, and H. Leutwyler. "Advantages of the Color Octet Gluon Picture." Physics Letters B 47 (1973): 365-368.