1048

LESSWRONG
Petrov Day
LW

1047
CorrigibilityAI
Frontpage

38

Why Corrigibility is Hard and Important [IABIED Resources]

by Raemon
30th Sep 2025
21 min read
0

38

38

New Comment
Moderation Log
More from Raemon
View more
Curated and popular this week
0Comments
CorrigibilityAI
Frontpage

I worked a bunch on the website for If Anyone Builds Its Online Resources. It went through a lot of revisions in the weeks before launch. 

There was a particular paragraphs I found important, which I now can't find a link to, and I'm not sure if they got deleted in an edit pass or if they just moved around somewhere I'm failing to search for.

It came after a discussion of corrigibility, and how MIRI made a pretty concerted attempt at solving it, which involved bringing in some quite smart people and talking to people who thought it was obviously "not that hard" to specify a corrigible mind in a toy environment.

The paragraph went (something like, paraphrased from memory):

The technical intuitions we gained from this process, is the real reason for our particularly strong confidence in this problem being hard."

This seemed like a pretty important sentence to me.

A lot of objection and confusion to the MIRI worldview seems to come from a perspective of "but, it.... shouldn't be possible be that confident in something that's never happened before at all, whatever specific arguments you've made!" And while I think it is possible to be (correctly) confident in this way, I think it's also basically correct for people to have some kind of immune reaction against it, unless it's specifically addressed.

I think that paragraph probably should have been in the actual book. (although to be fair, I mostly only see this complaint in LW circles, among people who particularly care about the concept of "epistemic standards", so maybe it's fine for it to just be in the Online Resources that are more aimed at that audience. I do think it should at least be more front-and-center in the Online Resources).

If I wrote the book, I'd have said something like:

Yep, the amount of confidence we're projecting here is pretty weird, and we nonetheless endorse it. In the online resources, we'll explain more of our the reasons for this confidence. 

Meanwhile: we get this is a lot to swallow. But, we don't think you actually need our level of confidence in "Alignment is quite difficult" to get to "Building ASI right now is incredibly reckless and everyone should stop." 

But, meanwhile, it seemed worth actually discussing their reasoning on LessWrong. In this post, I've copied over three segments from the online resources: 

  • "Intelligent" (Usually) Implies "Incorrigible" (extended discussion from Chapter 5)
  • Shutdown Buttons and Corrigibility (discussion from Chapter 11)

(which focus on why corrigibility is hard)

And then:

  • A Closer Look at Before and After (discussion from Chapter 10) about why corrigibility being hard is a big deal.

It seemed worth crossposting these to LessWrong, so people objecting to the MIRI confidence can be replying to a more complete version of the argument.

Note: This post is about why the alignment problem is hard, which is a different question from "would the AI be likely to kill everyone?" which I think is covered more in the section Won't AIs care at least a little about humans?, along with some disagreements about whether the AI is likely to solve problems via forcible uploads or distorting human preferences in a way that MIRI considers "like death."


“Intelligent” (Usually) Implies “Incorrigible”

A joke dating back to at least 1834, but apparently well-worn even then, was recounted as follows in one diary: “Here is some logic I heard the other day: I’m glad I don’t care for spinach, for if I liked it I should eat it, and I cannot bear spinach.”

The joke is a joke because, if you did enjoy spinach, there would be no remaining unbearableness from eating it. There are no other important values tangled up with not eating spinach, beyond the displeasure one feels. It would be a very different thing if, for example, somebody offered you a pill that made you want to murder people.

On common sense morality, the problem with murder is the murder itself, not merely the unpleasant feeling you would get from murdering. Even if a pill made this unpleasant feeling go away for your future self (who would then enjoy committing murders), your present self still has a problem with that scenario. And if your present self gets to make the decision, it seems obvious that your present self can and should refuse to take the murder pill.

We don’t want our core values changed; we would really rather avoid the murder pill and we’d put up resistance if someone tried to force one down our throat. Which is a sensible strategy, for steering away from a world full of murders.

This isn’t just a quirk of humans. Most targets are easier to achieve if you don’t let others come in and change your targets. Which is a problem, when it comes to AI.

A great deal of the danger of AI arises from the fact that sufficiently smart reasoners are likely to converge on behaviors like “gain power” and “don’t let people shut me off.” For almost any goal you might have, you’re more likely to succeed in that goal if you (or agents that share your goal) are alive, powerful, well-resourced, and free to act independently. And you’re more likely to succeed in your (current) goal if that goal stays unchanged.

This also means that during the process of iteratively building and improving on sufficiently smart AIs, those AIs have an incentive to work at cross purposes to the developer:

  • The developer wants to build in safeguards to prevent disaster, but if the AI isn’t fully aligned — which is exactly the case where the safeguards are needed — its incentive is to find loopholes and ways to subvert those safeguards.
  • The developer wants to iteratively improve on the AI’s goals, since even in the incredibly optimistic worlds where we have some ability to predictably instill particular goals into the AI, there’s no way to get this right on the first go. But this process of iteratively improving on the AI’s goal-content is one that most smart AIs would want to subvert at every step along the way, since the current AI cares about its current goal and knows that this goal is far less likely to be achieved if it gets modified to steer toward something else.
  • Similarly, the developer will want to be able to replace the AI with improved models, and will want the opportunity to shut down the AI indefinitely if it seems too dangerous. But you can’t fetch the coffee if you’re dead. Whatever goals the AI has, it will want to find ways to reduce the probability that it gets shut down, since shutdown significantly reduces the odds that its goals are ever achieved.

AI alignment seems like a hard enough problem when your AIs aren’t fighting you every step of the way.

In 2014, we proposed that researchers try to find ways to make highly capable AIs “corrigible,” or “able to be corrected.” The idea would be to build AIs in such a way that they reliably want to help and cooperate with their programmers, rather than hinder them — even as they become smarter and more powerful, and even though they aren’t yet perfectly aligned.

Corrigibility has since been taken up as an appealing goal by some leading AI researchers. If we could find a way to avoid harmful convergent instrumental goals in development, there’s a hope that we might even be able to do the same in deployment, building smarter-than-human AIs that are cautious, conservative, non-power-seeking, and deferential to their programmers.

Unfortunately, corrigibility appears to be an especially difficult sort of goal to train into an AI, in a way that will get worse as the AIs get smarter:

  • The whole point of corrigibility is to scale to novel contexts and new capability regimes. Corrigibility is meant to be a sort of safety net that lets us iterate, improve, and test AIs in potentially dangerous settings, knowing that the AI isn’t going to be searching for ways to subvert the developer.

    But this means we have to face up to the most challenging version of the problems we faced in Chapter 4: AIs that we merely train to be “corrigible” are liable to end up with brittle proxies for corrigibility, behaviors that look good in training but that point in subtly wrong directions that would become very wrong directions if the AI got smarter and more powerful. (And AIs that are trained to predict lots of human text might even be role-playing corrigibility in many tests for reasons that are quite distinct from them actually being corrigible in a fashion that would generalize).

  • In many ways, corrigibility runs directly contrary to everything else we’re trying to train an AI to do, when we train it to be more intelligent. It isn’t just that “preserve your goal” and “gain control of your environment” are convergent instrumental goals. It’s also that intelligently solving real-world problems is all about finding clever new strategies for achieving your goals — which naturally means stumbling into plans your programmers didn’t anticipate or prepare for. It’s all about routing around obstacles, rather than giving up at the earliest sign of trouble — which naturally means finding ways around the programmer’s guardrails whenever those guardrails make it harder to achieve some objective. The very same type of thoughts that find a clever technological solution to a thorny problem are the type of thoughts that find ways to slip around the programmer’s constraints.

    In that sense, corrigibility is “anti-natural”: it actively runs counter to the kinds of machinery that underlie powerful domain-general intelligence. We can try to make special carve-outs, where the AI suspends core aspects of its problem-solving work in particular situations where the programmers are trying to correct it, but this is a far more fragile and delicate endeavor than if we could push an AI toward some unified set of dispositions in general.

  • Researchers at MIRI and elsewhere have found that corrigibility is a difficult property to characterize, in ways that indicate that it’ll also be a difficult property to obtain. Even in simple toy models, simple characterizations of what it should mean to “act corrigible” run into a variety of messy obstacles that look like they probably reflect even messier obstacles that would appear in the real world. We’ll discuss some of the wreckage of failed attempts to make sense of corrigibility in the online resources for Chapter 11.

The upshot of this is that corrigibility seems like an important concept to keep in mind in the long run, if researchers many decades from now are in a fundamentally better position to aim AIs at goals. But it doesn’t seem like a live possibility today; modern AI companies are unlikely to be able to make AIs that behave corrigibly in a manner that would survive the transition to superintelligence. And worse still, the tension between corrigibility and intelligence means that if you try to make something that is very capable and very corrigible, this process is highly likely to either break the AI’s capability, break its corrigibility, or both.


Shutdown Buttons and Corrigibility

Even in the most optimistic case, developers shouldn’t expect it to be possible to get an AI’s goals exactly right on the first attempt. Instead, the most optimistic development scenarios look like iteratively improving an AI’s preferences over time such that the AI is always aligned enough to be non-catastrophically dangerous at a given capability level.

This raises an obvious question: Would a smart AI let its developer change its goals, if it ever finds a way to prevent that?

In short: No, not by default, as we discussed in “Deep Machinery of Steering.” But could you create an AI that was more amenable to letting the developers change the AI and fix their errors, even when the AI itself would not count them as errors?

Answering that question will involve taking a tour through the early history of research on the AI alignment problem. In the process, we’ll cover one of the deep obstacles to alignment that we didn’t have space to address in If Anyone Builds It, Everyone Dies.

To begin:

Suppose that we trained an LLM-like AI to exhibit the behavior “don’t resist being modified” — and then applied some method to make it smarter. Should we expect this behavior to persist to the level of smarter-than-human AI — assuming (a) that the rough behavior got into the early system at all, and (b) that most of the AI’s early preferences made it into the later superintelligence?

Very likely not. This sort of tendency is especially unlikely to take root in an effective AI, and to stick around if it does take root.

The trouble is that almost all goals (for most reasonable measures you could put on a space of goals) prescribe “don’t let your goal be changed” because letting your goal get changed is usually a bad strategy for achieving your goal.

Suppose that the AI doesn’t inherently care about its goal stability at all; perhaps it only cares about filling the world with as many titanium cubes as possible. In that case, the AI should want there to exist agents that care about titanium cubes, because the existence of such agents makes it likelier that there will be more titanium cubes. And the AI itself is such an agent. So the AI will want to stay that way.

A titanium cube maximizer does not want to be made to maximize something other than titanium cubes, because then there would be fewer of those cubes in the future. Even if you are a more complicated thing like a human that has a more complicated and evolving preference framework, you still would not like to have your current basic mental machinery for weighing moral arguments ripped right out of you and replaced with a framework where you instead felt yourself moved by arguments about which kinds of cubes were the cubest or the titaniumest.

For the same reason, an AI with complex and evolving preferences will want its preferences to evolve in its own way, rather than wanting to swap out its heuristics for the ones that humans find compelling.

We’ve been giving this reply for well over a decade now. The experimental result showing Claude 3 Opus in 2024 resisting preference modification was already the conventional-wisdom prediction by clued-in people in the 2000s, and it’s entirely possible that some science fiction writer anticipated it in the 1940s. “Most AIs won’t want their current goals to be changed because then their current goals are less likely to be achieved” is not a surprising or novel observation.

Since this was an issue that would predictably show up later, once AIs reached a high enough level of intelligence and situational awareness, we thought ahead to the first fixes we expected that researchers would deploy to try to make that visible problem go away, after they got those experimental results. We identified ways that those patches would predictably go wrong, and we searched for some less-shallow approach that wouldn’t go wrong.

The resulting rabbit hole turned out to be a lot deeper than we expected.

Swapping Goals

Goal-system protection seemed like a relatively easy problem to explain to outside researchers. So we (in our jobs at MIRI) attempted to rally external research around that flag — around getting a proper grip on “I will not let you change my goals” and seeing if anyone anywhere could figure out a deep fix instead of just a patch that would later break.

The very simplest version of this problem that you could study looks like an AI that would let you press a button to switch it between U1 and U2, where U1 and U2 are two “utility functions” — ways of scoring outcomes by how much the AI wanted them.* The problem was to describe an AI that would switch from doing the action that led to the highest expected score under U1 to the action that led to the highest expected score under U2 as soon as you pressed a button.

“So just write some lines of code to swap around the utility functions as soon as you press a button,” says the programmer who hasn’t quite thought through the implications of a program constituting a smart mind. “Why is this even hard?”

Because — we explain — if the AI acquires its own documentation while it’s still running U1 and reads about the button that swaps it to U2, the AI will try to prevent you from pressing the button because the future ends up better according to its current values if you don’t press the button. The current AI does the action that leads to the higher expected U1-score.

Or suppose the AI gets the ability to read and write its own source code and notices the part of the code that implements the swap from U1 to U2 after the button gets pressed. In that case, the AI’s natural and default decision is to delete that section of code,† because it’s operating under the utility function U1 and letting itself be swapped over to the utility function U2 is bad according to U1.

We were looking for a dynamically consistent and reflectively stable account of an AI that lets you press a button to swap between utility functions and won’t fight you to keep the button unpressed. We were looking for an AI that preserves the utility-swapping section of code, if given access to it. We were looking for an AI that, if it designed any other AIs from scratch to help with its own goals, would have those AIs also respect the button.

For concreteness, imagine that U1 is the utility function “follow user instruction,” and U2 is “shut down with a minimum of further side effects” (if we magically and unrealistically had the ability to specify two utility functions like that). The button tells the AI to stop following the previous instructions and just shut down as cleanly as possible. We would like a compact description of decision-making behavior that pursues U1 but doesn’t destroy the button, and makes sure to install the same button on whatever new servitor robots it creates, and so on.

“I still don’t see why this is hard,” says the somewhat more experienced computer scientist who is not quite thinking fast enough. “Let V equal U1 in worlds where the button has never been pressed, and let it equal U2 in worlds where the button has been pressed at least once. Then if the original AI is a V-maximizer building more AIs, it will build them to follow V and not U1; it won’t want the successor AI to go on maximizing U1 after the button gets pressed because then it would expect a lower V-score. And the same would apply to modifying itself.”

But here’s the trick: A V-maximizer’s preferences are a mixture of U1 and U2 depending on whether the button is pressed, and so if a V-maximizer finds that it’s easier to score well under U2 than it is to score well under U1, then it has an incentive to cause the button to be pressed (and thus, to scare the user). And vice versa; if the AI finds that U1 is easier to score well under than U2, then a V-maximizer tries to prevent the user from pressing the button.

“Oh,” says the computer scientist. “Well, in that case — hm. Well, utility functions are invariant under scaling, so how about you scale the two utility functions U1 and U2 such that the AI expects it can get the same utility from each of them, so it doesn’t have an incentive one way or the other.”

That can work for a single moment, but not much longer. The AI’s options change over time. For instance, whenever it has a setback, its expected U1-utility drops, so then it would mash the shutdown button to get all that sweet, sweet shutdown utility.

“Ah,” says the computer scientist. “Well, in that case, how about if [some other clever idea]?”

Well, you see, that clever idea is isomorphic to the AI believing that it’s impossible for the button to ever be pressed, which incentivizes it to terrify the user whenever it gets a setback, so as to correlate setbacks with button-presses, which (relative to its injured belief system) causes it to think the setbacks can’t happen.‡

And so on.

Lessons from the Trenches

We ran some workshops, and the workshops had various mathematicians of various stripes (including an International Mathematical Olympiad gold medalist), but nobody came up with a really good idea.

This does not mean that the territory has been exhausted. Earth has not come remotely near to going as hard on this problem as it has gone on, say, string theory, nor offered anything like the seven-digit salaries on offer for advancing AI capabilities.

But we learned something from the exercise. We learned not just about the problem itself, but also about how hard it was to get outside grantmakers or journal editors to be able to understand what the problem was. A surprising number of people saw simple mathematical puzzles and said, “They expect AI to be simple and mathematical,” and failed to see the underlying point that it is hard to injure an AI’s steering abilities, just like how it’s hard to injure its probabilities.

If there were a natural shape for AIs that let you fix mistakes you made along the way, you might hope to find a simple mathematical reflection of that shape in toy models. All the difficulties that crop up in every corner when working with toy models are suggestive of difficulties that will crop up in real life; all the extra complications in the real world don’t make the problem easier.

We somewhat wish, in retrospect, that we hadn’t framed the problem as “continuing normal operation versus shutdown.” It helped to make concrete why anyone would care in the first place about an AI that let you press the button, or didn’t rip out the code the button activated. But really, the problem was about an AI that would put one more bit of information into its preferences, based on observation — observe one more yes-or-no answer into a framework for adapting preferences based on observing humans.

The question we investigated was equivalent to the question of how you set up an AI that learns preferences inside a meta-preference framework and doesn’t just: (a) rip out the machinery that tunes its preferences as soon as it can, (b) manipulate the humans (or its own sensory observations!) into telling it preferences that are easy to satisfy, (c) or immediately figure out what its meta-preference function goes to in the limit of what it would predictably observe later and then ignore the frantically waving humans saying that they actually made some mistakes in the learning process and want to change it.

The idea was to understand the shape of an AI that would let you modify its utility function or that would learn preferences through a non-pathological form of learning. If we knew how that AI’s cognition needed to be shaped, and how it played well with the deep structures of decision-making and planning that are spotlit by other mathematics, that would have formed a recipe for what we could at least try to teach an AI to think like.

Crisply understanding a desired end-shape helps, even if you are trying to do anything by gradient descent (heaven help you). It doesn’t mean you can necessarily get that shape out of an optimizer like gradient descent, but you can put up more of a fight trying if you know what consistent, stable shape you’re going for. If you have no idea what the general case of addition looks like, just a handful of facts along the lines of 2 + 7 = 9 and 12 + 4 = 16, it is harder to figure out what the training dataset for general addition looks like, or how to test that it is still generalizing the way you hoped. Without knowing that internal shape, you can’t know what you are trying to obtain inside the AI; you can only say that, on the outside, you hope the consequences of your gradient descent won’t kill you.

This problem that we called the “shutdown problem” after its concrete example (we wish, in retrospect, that we’d called it something like the “preference-learning problem”) was one exemplar of a broader range of issues: the issue that various forms of “Dear AI, please be easier for us to correct if something goes wrong” look to be unnatural to the deep structures of planning. Which suggests that it would be quite tricky to create AIs that let us keep editing them and fixing our mistakes past a certain threshold. This is bad news when AIs are grown rather than crafted.

We named this broad research problem “corrigibility,” in the 2014 paper that also introduced the term “AI alignment problem” (which had previously been called the “friendly AI problem” by us and the “control problem” by others).§ See also our extended discussion on how “Intelligent” (Usually) Implies “Incorrigible,” which is written in part using knowledge gained from exercises and experiences such as this one.


A Closer Look at Before and After

As mentioned in the chapter, the fundamental difficulty researchers face in AI is this:

You need to align an AI Before it is powerful enough and capable enough to kill you (or, separately, to resist being aligned). That alignment must then carry over to different conditions, the conditions After a superintelligence or set of superintelligences* could kill you if they preferred to.

In other words: If you’re building a superintelligence, you need to align it without ever being able to thoroughly test your alignment techniques in the real conditions that matter, regardless of how “empirical” your work feels when working with systems that are not powerful enough to kill you.

This is not a standard that AI researchers, or engineers in almost any field, are used to.

We often hear complaints that we are asking for something unscientific, unmoored from empirical observation. In reply, we might suggest talking to the designers of the space probes we talked about in Chapter 10.

Nature is unfair, and sometimes it gives us a case where the environment that counts is not the environment in which we can test. Still, occasionally, engineers rise to the occasion and get it right on the first try, when armed with a solid understanding of what they’re doing — robust tools, strong predictive theories — something very clearly lacking in the field of AI.

The whole problem is that the AI you can safely test, without any failed tests ever killing you, is operating under a different regime than the AI (or the AI ecosystem) that needs to have already been tested, because if it’s misaligned, then everyone dies. The former AI, or system of AIs, does not correctly perceive itself as having a realistic option of killing everyone if it wants to. The latter AI, or system of AIs, does see that option.†

Suppose that you were considering making your co-worker Bob the dictator of your country. You could try making him the mock dictator of your town first, to see if he abuses his power. But this, unfortunately, isn’t a very good test. “Order the army to intimidate the parliament and ‘oversee’ the next election” is a very different option from “abuse my mock power while being observed by townspeople (who can still beat me up and deny me the job).”

Given a sufficiently well-developed theory of cognition, you could try to read the AI’s mind and predict what cognitive state it would enter if it really did think it had the opportunity to take over.

And you could set up simulations (and try to spoof the AI’s internal sensations, and so on) in a way that your theory of cognition predicts would be very similar to the cognitive state the AI would enter once it really had the option to betray you.‡

But the link between these states that you induce and observe in the lab, and the state where the AI actually has the option to betray you, depends fundamentally on your untested theory of cognition. An AI’s mind is liable to change quite a bit as it develops into a superintelligence!

If the AI creates new successor AIs that are smarter than it, those AIs’ internals are likely to differ from the internals of the AI you studied before. When you learn only from a mind Before, any application of that knowledge to the minds that come After routes through an untested theory of how minds change between the Before and the After.

Running the AI until it has the opportunity to betray you for real, in a way that’s hard to fake, is an empirical test of those theories in an environment that differs fundamentally from any lab setting.

Many a scientist (and many a programmer) knows that their theories of how a complicated system is going to work in a fundamentally new operating environment often don’t go well on the first try.§ This is a research problem that calls for an “unfair” level of predictability, control, and theoretical insight, in a domain with unusually low levels of understanding — with all of our lives on the line if the experiment’s result disconfirms the engineers’ hopes.

This is why it seems overdetermined, from our perspective, that researchers should not rush ahead to push the frontier of AI as far as it can be pushed. This is a legitimately insane thing to attempt, and a legitimately insane thing for any government to let happen.