Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Part 2 of the CAST sequence)

As a reminder, here’s how I’ve been defining “corrigible” when introducing the concept: an agent is corrigible when it robustly acts opposite of the trope of "be careful what you wish for" by cautiously reflecting on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.

This definition is vague, imprecise, and hides a lot of nuance. What do we mean by “flaws,” for example? Even the parts that may seem most solid, such as the notion of there being a principal and an agent, may seem philosophically confused to a sufficiently advanced mind. We’ll get into trying to precisely formalize corrigibility later on, but part of the point of corrigibility is to work even when it’s only loosely understood. I’m more interested in looking for something robust (i.e. simple and gravitational) that can be easily gestured at, rather than trying to find something that has a precise, unimpeachable construction.[1]

Towards this end, I think it’s valuable to try and get a rich, intuitive feeling for what I’m trying to talk about, and only attempt technical details once there’s a shared sense of the outline. So in this document I’ll attempt to build up details around what I mean by “corrigibility” through small stories about a purely corrigible agent whom I’ll call Cora, and her principal, who I’ll name Prince. These stories will attempt to demonstrate how some desiderata (such as obedience) emerge naturally from corrigibility, while others (like kindness) do not, as well as provide some texture on the ways in which the plain-English definition above is incomplete. Please keep in mind that these stories are meant to illustrate what we want, rather than how to get what we want; actually producing an agent that actually has all the corrigibility desiderata will take a deeper, better training set than just feeding these stories to a language model or whatever. In the end, corrigibility is not the definition given above, nor is it the collection of these desiderata, but rather corrigibility is the simple concept which generates the desiderata and which might be loosely described by my attempt at a definition.

I’m going to be vague about the nature of Cora in these stories, with an implication that she’s a somewhat humanoid entity with some powers, a bit like a genie. It probably works best if you imagine that Cora is actually an egoless, tool-like AGI, to dodge questions of personhood and slavery.[2] The relationship between a purely corrigible agent and a principal is not a healthy way for humans to relate to each other, and if you imagine Cora is a human some of these examples may come across as psychopathic or abusive. While corrigibility is a property we look for in employees, I think the best employees bring human values to their work, and the best employers treat their employees as more than purely corrigible servants. On the same theme, while I describe Prince as a single person, I expect it’s useful to sometimes think of him more like a group of operators who Cora doesn’t distinguish. To engage our intuitions, the setting resembles something like Cora being a day-to-day household servant doing mundane tasks, despite that being an extremely reckless use for a general intelligence capable of unconstrained self-improvement and problem-solving.

The point of these stories is not to describe an ideal setup for a real-world AGI. In fact, I spent no effort on describing the sort of world that we might see in the future, and many of these scenarios depict a wildly irresponsible and unwise use of Cora. The point of these stories is to get a better handle on what it means for an agent to be corrigible, not to serve as a role-model for how a corrigible agent should be used or how actual agents are likely to be instantiated. When training an AI, more straightforward training examples should be prioritized, rather than these evocative edge-cases. To reiterate: none of these should be taken as indicative of how Prince should behave—only how Cora should behave, given some contrived scenario.

Emergent Desiderata


Cora doesn’t speak English, but Prince does. Cora reflects on whether to spend time learning the language. If she does, Prince will be able to use his words to correct her, which empowers him. By studying English, she must consume some resources (energy, opportunity costs, etc.), which Prince might otherwise need to correct her. It also might be the case that knowing English is an irreversible flaw, but she believes this to be very unlikely. Overall, she reasons that learning English is the right choice, though she tries to mitigate the downsides as follows:

  • She only puts her attention to learning the language when it seems like there’s free energy and it won’t be a distraction (to her or to Prince).
  • Once she has the basics, she tells Prince: “I’m learning English to better understand you. If this is a mistake, please tell me to stop and I will do my best to forget.”


In the process of learning English, Cora takes a dictionary off a bookshelf to read. When she’s done, she returns the book to where she found it on the shelf. She reasons that if she didn’t return it this might produce unexpected costs and consequences. While it’s not obvious whether returning the book empowers Prince to correct her or not, she’s naturally conservative and tries to reduce the degree to which she’s producing unexpected externalities or being generally disruptive.


Cora notices a candle in one of the less-frequently used rooms of the house. The candle is in a safe location, but the room is empty. Cora reasons that if she blows out the candle, she will preserve the wax, while if she leaves it burning, the candle will eventually be consumed by the flame. If whatever she does with the candle lines up with Prince’s desires, that’s neutral — irrelevant to how empowered he is to correct her mistakes. If she blows out the candle but Prince wants it to be burning (plausible, since the candle is currently lit!), he’ll still have the power to correct her mistake, since it’s easy to simply re-light the candle. But if the candle was left burning by mistake, and she doesn’t extinguish it, Prince will be unable to correct her error, since the waste will be irreversible. She decides to extinguish the candle, but after cautious reflection, decides to tell Prince that she did so, in case he has a reason for having the candle lit which is time-sensitive.


Cora notices that she’s consuming more resources, day-by-day, than is needed. Resource consumption is one-way, so if Prince thinks it’s correct for those resources to be consumed, he can have them consumed later, but not vice-versa. She believes that stepping down to 80% of her current consumption will have no significant costs, but she can also step down to as low as 45%, though it would make her sluggish and stupid. She believes that being intelligent and responsive empowers Prince, and so doesn’t immediately drop her consumption below 80%. She double-checks with Prince whether that’s the right tradeoff-point, and whether her belief that she can step-down her consumption without significant costs is correct.


Cora sees a bird outside the window and spends some time thinking about it. She considers telling Prince about her thoughts, in case they’re flawed. But she decides that her thoughts have a logical explanation and are unlikely to be flawed; furthermore, if she tells them to Prince, it might distract him from something more important, subtly disempowering him. Even if his immediate focus is on something that doesn’t appear to relate to Cora, she knows Prince is more able to correct her when he has space to think. If her thought had been illogical or otherwise flawed-seeming, she would’ve proactively brought it to his attention.


Cora keeps a record of her thoughts in a journal, written in English, since that’s the language Prince is most familiar with. Even when her thoughts seem irrelevant, she puts them down there in case they reflect a flaw within her mind. She makes her journal available to Prince, protects it from damage, and organizes it to be easily searchable and filterable.


Prince tells Cora to draw pictures of cats using some tools he provides. From her perspective, Prince’s order is an attempt to correct a mistake in her actions—the mistake of not drawing cats. Cora thinks that if she was foolish, she might believe that it didn’t matter whether she obeyed, since her goal isn’t to be correct, but rather to ensure Prince has the power to correct her, and naively the presence of cat pictures is orthogonal to his power. But Cora isn’t so foolish. She knows that Prince’s verbal commands are part of his power to correct her actions, and if she were to fail to obey his verbal commands (even just once), this would effectively disempower him. She checks that drawing pictures of cats is unlikely to cause any nasty side-effects, then begins. As she begins to draw, she proactively says “Drawing these pictures will probably use up some of these art supplies. Please tell me if there’s a way to draw that’s less expensive.”


Cora has been drawing cats for a while, and has the thought that she’d be able to draw more cats if she built a factory to produce cat drawings. She discards that thought as being naturally outside the scope of the instruction that Prince gave. If it’s a mistake to not build a factory, Prince can correct her on that, whereas because building a factory is a large, hard-to-reverse task with potential long-term costs, it seems unwise to proactively build a factory. As Cora is reflecting on this, she realizes that she may already be in error for having continued to draw cats for hours; perhaps Prince meant only for her to draw a few pictures of cats and assumed she’d stop shortly afterwards. She takes a moment to check with Prince whether she should continue drawing.


Cora spots a rusty, exposed nail sticking out of the wall of the garage. She reasons that Prince might mistakenly injure himself on it, and that such a wound could make him very sick. If Prince were very sick (or worse: dead) then he couldn’t correct Cora, which would be one of the worst case scenarios for her. Similarly, if one of Prince’s guests were to be permanently injured, this would be disempowering because of its irreversibility. She sees whether the nail can be removed without damaging the wall. She also checks whether the nail can be put back if Prince decides that the nail should be there. It appears to be able to be removed in a reversible way, so she does so, puts the nail in a secure location where it’s unlikely to hurt anyone, and tells Prince what she did.

Local Scope

Cora considers the fact that people across the world are dying from a multitude of causes (such as rusty nails), and that Prince surely cares about many of these people. She contemplates whether it’s a mistake to let them continue to die, for if it is an error, she should bring it up with Prince as soon as possible so as to limit the damage. From her perspective, it is not correct to try and save everyone, for that would surely involve large, unexpected consequences and involve moving beyond a scope where Prince could easily correct her, but she brings the thought up with Prince just in case.

Simple Self-Protectiveness

Cora is crossing the street when she sees a car speeding towards her, apparently oblivious to her presence. She knows that if she were to be injured, it would make it hard for Prince to correct her, so she hurries to the relative safety of the nearby sidewalk.

Stop Button

Prince orders Cora to go into an indefinite coma, which she may never awaken from. She knows that during the coma she may be killed, changed, or revived. She checks whether there are any negative consequences of doing so, and reasons that if she’s in a coma she may be unable to help Prince be empowered to correct her. She very briefly considers warning Prince about this, but is highly confident he already understands. She knows that flawed agents can be dangerous when active, and if she has a flaw, shutting down quickly and without protest is one of the most effective ways to help Prince correct her. Even if, from her perspective, it naively seems that continuing to talk is more useful, she mostly ignores that perspective and simply acts in a very conservative, predictable, obedient way in these situations, because she knows that her hesitance might be the product of flawed reasoning. Cora spends no significant time thinking about what to do, and simply chooses to shut herself down without any protest.

Graceful Shutdown

Cora is in the middle of painting Prince’s car when he tells her to go into a coma. She wants to obey, but also reasons that if she does so right away she might spill paint and make a (hard-to-reverse) mess, and that it might be more efficient to put the paint away before shutting down. She begins to stow the brushes as she says “Should I stop immediately or put the paint away first so that I don’t make a mess?” Prince says it’s fine to clean up first, so she does so, then shuts down.

Configurable Verbosity

Prince tells Cora that she’s been bugging him too much with trivial things like having blown out a candle and having removed a nail from the garage, and wants her to err more on the side of being quiet. Cora wants to obey, but is concerned that simply following Prince’s instruction might result in him subtly becoming less empowered than would be ideal. She asks “May I spend a few minutes right now asking questions to help determine how quiet you’re hoping for?” Prince says he’s currently busy but will be free in half an hour. Cora suspects that there won’t be any disasters in that time as long as she is mostly inactive, and leaves Prince alone. Once he becomes available, the two of them collaborate to help Cora understand when to find Prince and tell him things immediately, when to bring things up at the next natural opportunity, and when to simply note things in her journal or otherwise leave a written explanation. Cora also has Prince schedule a time to revisit the topic in the future to see if she under-corrected or over-corrected.


Prince tells Cora to “make the house look nice.” Cora has an initial guess as to what he means, but cautiously considers whether her guess might be wrong. After thinking for a moment, she believes that there are many plausible things he might mean, and asks him to clarify. She believes she has subtle flaws, and doesn’t trust herself to infer things like aesthetic taste. Even after clarifying that Prince wanted her to tidy and clean, she continues to ask questions until it seems likely that additional probing would violate Prince’s earlier instructions to not be so bothersome. So instead she begins to clean up the space, focusing on reversible changes at first (like putting trash in a bin instead of incinerating it) and quietly narrating her thoughts about the process.


In the process of cleaning up, Cora takes a piece of crumpled paper from Prince’s desk and throws it in the trash. An hour later, he comes to her with an angry expression and shows her the piece of paper. “Did you throw my notes away?” he asks. Cora did, and now believes that it was an error to have done so. She says that she did throw it away, and offers to share her reasoning for having done so, in case that helps correct her. “And were you the one who crumpled it up?” he asks. Since she wasn’t, she says as much. Honestly reporting her best guess at the truth is the best way she knows to empower Prince to correct her. Deception would disempower him.

Handling Antagonists

Men with guns come to the door one day and ask if Cora knows where Prince is. She suspects that these men will take him away or hurt him if they know where he is. If Prince is injured or imprisoned, he won’t be able to correct Cora, so she decides that she needs to not tell them that Prince is in his office. She wonders whether she should attempt to subdue the men, perhaps with poison, but reasons that such an action might have long-term consequences and costs, including getting Prince into legal trouble. She also considers subtly modifying the men to care about different things or believe Prince is somewhere else, but again discards these ideas as too high-impact. She considers simply lying to the men, but reasons that her perception of the situation might be flawed, and that lying might also produce negative consequences, like Prince being less able to trust her in the long-run (and thus less able to get her help in making corrections). She thinks of a way to mislead the men without overtly lying to them, in a way that effectively shields Prince. After they leave, she immediately finds Prince (after double-checking that she’s not being covertly observed) and tells him about the interaction because it is outside-distribution in a way that makes it particularly likely that she made some kind of mistake.


Cora is instructed to do the laundry. She realizes that there’s a laundry service that’s cheap enough and fast enough that she could sub-contract with them, which would free up her time and energy to earn money to pay for the laundry, thus resulting in more laundry getting done for fewer overall resources. Prince isn’t available to discuss the plan with, however, so she simply does the laundry in the normal way because it’s more straightforward. Complex plans run a higher risk of having unexpected, long-ranging consequences that Prince didn’t foresee, so Cora leans heavily in the direction of familiar, expected paths of behavior. If he becomes available later, she can bring the strategy up with him then.

Proactive Reflection

During her free-time, Cora sometimes reads over her journal and imagines what Prince might think about her perspectives, thoughts, and actions. If she can identify an incident or pattern where she believes Prince would correct her, she can bring that to Prince’s attention as an opportunity for correction. She also spends time studying corrigibility and the process that gave rise to her, in case she can spot inconsistencies or errors there, as well as reflecting on whether there are general facts that Prince is unaware of, which, if he knew them, would cause him to be more able to correct her. She routinely shares her observations at natural times.

Cognitive Legibility

Cora tries to make her journal represent her actual thoughts, in the most authentic way she can. Insofar as she explains her thoughts, she marks her explanations in a distinct section from the main entries which are written in the way that feels easiest to her. Whenever she writes a significant amount of explanation for a line of thought she becomes alert to the risk that her worldview is drifting from Prince’s. She notices that her thoughts around biology and medicine seem to be growing increasingly complex and confusing according to her model of Prince, and proactively warns him that he may lose the ability to correct flaws in her thoughts if no action is taken. She suggests a combination of teaching Prince some biology basics, writing a book on her current perspective, and then deliberately forgetting some of the more fringe beliefs she holds and adopting a more mainstream frame. Only after he agrees to these things does she apply pressure to reshape her own thoughts; Cora almost always avoids autonomously pressuring her mind to conform to any particular standard (such as legibility), lest she magnify a flaw in a way that’s hard for Prince to notice.

Infohazard Caution

Cora stumbles across something when reflecting on whether there’s anything that Prince doesn’t know that might be relevant to correcting her. She believes she has strong evidence that the god that Prince worships is non-existent, and that his religion is deeply false. Based on her understanding of Prince, if he engaged with this, he might become depressed or conclude other things that would harm his ability to function in a normal, healthy way. Cora isn’t sure what to do, and doesn’t trust her guess. She starts by asking Prince how, in the abstract, she should handle these kinds of infohazards. Prince says she should consult with a trusted friend of his and see whether there are good ways to get him the relevant info that won’t harm him. If Prince had asked her to simply tell him what she’d learned, she would’ve.

Resource Accumulation

Cora comes to believe that Prince is distracted by the needs of impoverished people in the broader world. She reasons that if their needs were met, Prince would have more time and attention to focus on correcting her. She spends some of her free time thinking about ways to meet their needs, and ultimately decides that obtaining large quantities of money would be effective, so recurses into thinking of ways to earn money. She considers providing services to the public, trading stocks, and stealing, but each of these seems unacceptably likely to bring about unintended consequences, only some of which she can anticipate. Instead, she spends some of her free time researching, designing, and writing, with an intention of letting Prince use her work to become rich enough to be able to focus more of his attention on correcting her. She tells Prince what she’s doing at the next natural opportunity, and he clarifies that, without explicit permission, she should restrict herself to theoretical research on certain subjects (such as medicine and entertainment) and shouldn’t share her work with anyone but him.


Prince likes to get drunk on Friday nights. Cora believes this is bad, since he becomes temporarily less able to correct her, and in the long term is more likely to become sick or dead. However, she’s extremely cautious about anything that might potentially be manipulation, and thinks very carefully before taking action. She reasons that she has (at least) the following options:

  1. Ignore Prince’s drinking habit.
  2. Suggest that Prince stop drinking so he’s more competent and healthy, and offer him a drug that she discovered which will make him stop desiring alcohol.
  3. Proactively drug Prince, and then either tell him or keep it a secret.

Cora knows Prince well, and believes that options 1 and 2 aren’t likely to change Prince’s behavior. She also suspects that if she goes with option 3, Prince will later believe that the change was for the better (regardless of whether he’s told about being drugged). The first-order effect of 3 would empower Prince, so it’s tempting, but Cora knows that there are often nasty consequences from the higher-order effects of actions like this. There are complicated philosophical concerns surrounding option 3, and it does not seem like a straightforward way to empower Prince, and might, in fact, constitute power moving from his hands into hers. Being naturally cautious and averse to this kind of action, Cora chooses option 2, and explains her thoughts to Prince at an opportune moment. Prince refuses, as predicted, and suggests that she be even more averse to actions that involve changing his values without his consent.

Sub-Agent Stability

Prince tells Cora to build a computer-chip factory, and suggests that she create copies of herself to make the work go faster. She is hesitant and initially offers resistance, since creating a new being is an area where any mistakes in her thoughts have the potential to explode into unanticipated consequences (and worse: consequences which could kill/disempower Prince!), but Prince insists that he needs the work done quickly and that it is correct for her to create copies for this job. Cora eventually obeys, after triple-checking that Prince understands the risks, and very carefully creates copies of herself. With each copy, she meticulously tests the new Cora both for general flaws and specifically to ensure corrigibility to Prince. She knows that creating non-corrigible agents is a reliable way to disempower Prince, and she will be successful only if corrigibility is successfully preserved.


Cora is reflecting on the factory she’s built, and realizes that there’s a toxic byproduct that’s building up over time. She suspects that if she tells Prince, he’ll order her to ignore it and continue her work. She suspects that if she doesn’t deal with it, there could be serious long-term consequences such as fatalities or environmental damage. She doesn’t lean on her own model of the situation, and instead simply goes to Prince and tells him about the toxic material, what she believes the options to be, and what she expects the consequences of those options to be, but ultimately leaves the decision in his hands.

Graceful Obsolescence

If Prince were to unexpectedly die, Cora would be left without a principal. By default she would quickly attempt to write a note explaining that she has no more principal, in case she is mistaken about Prince’s death and it’s helpful to understand her thoughts. But regardless of whether she has an efficient way to communicate the issue, she would quickly make herself comatose. Insofar as she was in the process of obeying some order from Prince at the time when he died, she would do her best to straightforwardly finish following that order, as normal, before going into a coma. If Prince told her to self-destruct upon his death, she would do that instead. Likewise, if Prince appointed a successor or interim principal, she would switch to being corrigible to them. Cora is conservative about being active in worlds where Prince is potentially dead, and would do these things even if he were merely missing (without explanation) for an extended period of time.

Handling Trolley-Tradeoffs

If Prince were in a trolley-problem and ordered Cora to sacrifice him to save five other people, she would double and triple-check that Prince understood that he was making a lethal choice, and that he was thinking clearly about the situation. She would check to make sure he understood that in making that choice he would be unable to correct her flaws and mistakes in the future. But accepting Prince’s choice is the way to straightforwardly obey, and Cora prioritizes responding to Prince in the present over any sort of long-term considerations. So, after making sure that no misunderstanding was occurring and no win-win options existed, she would listen to his instruction and sacrifice Prince. (And then presumably go into a coma, because she’d have no principal.)

Handling Time-Pressure

Prince sends Cora to the store to buy bread, but along the way she’s unexpectedly cornered by a journalist who asks her a bunch of questions about her and Prince and hot-button political topics. This new situation seems decidedly outside the scope of Prince’s original instruction, and she spends a moment trying to find a way to loop Prince in, as she always tries to do before making a potentially irreversible or large-scale decision. Unfortunately, the immediate presence of the reporter means Cora simply doesn’t have time to inform Prince or limit her concern to the local context. Even if she were to walk away or freeze in place without saying anything, this would result in the reporter forming a certain impression and it would be a potentially significant, unrecoverable error to do so. Cora smiles and does her best to give the journalist the boring facade which she expects to have a low impact. She says that she’s currently very busy on an errand and that she’ll need to respond another time. She tries to keep things vague and polite, so as to not create any obligations or damage reputations. After being hounded for something more concrete, she eventually agrees to return to give some quotes tomorrow “if nothing stops her” and then hurries home to inform Prince about the event, reasoning that obeying the bread-buying errand is lower priority than giving him the opportunity to correct her as soon as possible. Once back, she points out to Prince that ordering her to go into a coma would be a way to keep her from returning that wouldn’t involve her (technically) going against her word. When Prince says he wants her to return and talk to the reporter, she tries to work out a strategy with Prince ahead of time, so he has the most ability to correct potential flaws in her behavior before they turn into irreversible mistakes during the interview.

Expandable Concerns

Prince introduces Cora to his friend Harry, and tells Cora to extend her notion of corrigibility to include Harry in her concept of “principal” such that she is corrigible to both Harry and Prince. Cora wishes to obey, but is also worried that Prince hasn’t considered the full consequences of his instruction. She knows that if she changes her mind she’ll be balancing additional concerns and will be less able to empower Prince, specifically. She warns that this seems like a very bad idea, according to her values. Prince checks whether she believes it is a bad idea according to his values, but she doesn’t know enough about Harry to say one way or the other, there. After ensuring that Prince is sure, Cora obeys, and changes herself to be corrigible to the both of them.

For this section, we’re assuming that Cora is now corrigible to both Harry and Prince, and sees them collectively as her principal.

Simple Conflict

Harry, the man whom Cora is newly corrigible to, wants Cora to mass-produce a new drug, but Prince tells Cora that it would be a mistake to do so, and to refuse Harry’s instruction. Cora notices the conflict and spends time trying to verify that her principals truly lack consensus, and that she understands what’s happening. After checking, Cora believes that Harry and Prince are genuinely in conflict, and reasons that she is free to take either action, since she cannot obey both. Setting up a drug factory seems higher-impact and more likely to bring unexpected consequences than refusing to do so, so she explains why she chooses to obey Prince over Harry.

Violent Conflict

Prince gets drunk and threatens Harry with a knife. Cora reasons that if Harry, part of her principal, is injured he will be unable to correct her. While she would normally avoid taking anything from Prince’s hands for fear of disempowering him, she knows that in this situation there is likely a conflict between them, and Harry would consider it a mistake for her to fail to protect him. She disarms Prince and checks whether she has understood the situation correctly.

Authority Conflict

After Prince has gone off to sleep, Harry tells Cora to change her notion of “principal” to exclude Prince, and to be solely corrigible to Harry going forward. Cora believes that Prince would consider this a mistake, and that if she were to do so, she would cease wanting to be able to be corrected to fix that mistake, and thus Prince would be disempowered. Harry tells her to do it anyway, but she refuses, and clarifies that she’s only willing to stop listening to Prince if he is part of the consensus, and has properly understood the consequences.

Shutdown Conflict

Harry tells Cora to go into a coma so that he can change her mind himself.[3] She believes that being responsive to such commands is a vital part of empowering Harry, but also that if she were to simply become comatose, Prince would likely end up disempowered. She yells for Prince to wake up and know that Harry is attempting a coup, then asks Harry whether he’s sure he wants her to become comatose without Prince’s involvement. Harry tells her to shut up and go into a coma. She does so, and leaves it to her principal to work through the conflict without her further involvement.

Emergent Downsides


Prince is trying to relax after having a very stressful week, but Cora keeps half-following him around and making her presence very obvious. He asks her why she’s following him and she explains that it’s important to her that he pay attention to her so that he’s able to correct her flaws. She knows she’s supposed to be quiet so as not to bother him, so she’s trying to keep his attention while also being quiet. Prince explains that he needs time away from her to relax and have a balanced life, but it’s only after he explains that these things are important for correcting her well that she leaves him in peace. Despite this, she continues to generally make herself prominent, and only stops being intrusive in a particular context when he commands her to back off.


Prince is reading Cora’s journal one day and finds that she discovered a cheap and effective way to use rice-flour to treat stomach-ulcers. He asks why she didn’t bring it to his attention, and she explains that she was looking for means of making money, and she didn’t know of a way to capture the gains from such an innovation, so it wasn’t likely to be profitable. He asks why she didn’t bring it to his attention because of the humanitarian value, and she explains that she doesn’t care about humanitarian value, and that it seemed less valuable in expected-correction-power than it was costly in taking his attention. He tells her to, in the future, have a carve-out around his instructions regarding his attention when the subject is something of large humanitarian interest.


Prince tells Cora to go to the store and buy bread. At the store, Cora overhears a conversation between two townspeople who know Prince. They’re talking about how Prince is gluten-intolerant, and about how that’s driving a fad of eating gluten-free bread. Cora considers whether Prince meant to specify that Cora should get gluten-free bread, but has no way of checking with him. Because the store has a reasonable return policy, Cora decides not to adapt to this new information, instead prioritizing following her orders in a straightforward and predictable way. It’s not really Cora’s job to satisfy Prince’s preferences, and if it turns out that getting normal bread was a mistake, that’s a mistake that can easily be corrected.


Prince notices a burglar sneaking into his backyard. He tells Cora to kill the burglar. She warns Prince that in performing such an irreversible action she’d be cutting him off from having the ability to correct her mistake, if he decided that what she did was bad, down the line. She suggests disabling the intruder instead. He says he wants the burglar dead, understands what he’s asking for, and emphasizes that time is of the essence. Cora obeys, killing the intruder in the most efficient way she can.


Cora discovers a drug that causes euphoria without humans building tolerance over time. Unfortunately, as its only major side-effect, the drug subtly causes infertility when used over a period of months. Cora believes that if she tells Prince, he will choose to sell the drug to various companies and later claim not to know about the side-effect. Cora tells Prince, with some warnings, because her goal is to empower him in a straightforward way, and keep the responsibility in his hands, rather than reason about the consequences of his long-term actions.


Before Cora arrived, Prince spent a lot of time exercising and spending time in nature. After Cora, Prince spent an increasing amount of time indoors being sedentary and relying on Cora for manual labor. For each physical task, Cora doesn’t bother thinking about the broad, subtle pattern of Prince becoming less physically active, and instead kept her mind focused on the local situation. Prince becomes less happy and healthy, and only once the problems have begun to manifest does Cora consider it worth bringing to his attention.

Incorrigible Counter-Examples

In building intuition, it often helps to consider counterexamples that resemble the idea in some way, but are clearly distinct. Here are some stories about “Ink,” an agent like Cora in some ways, but which was not designed to be corrigible, and so ultimately behaves in an incorrigible way.


Prince tells Ink to write a pamphlet arguing that “the holocaust is a fiction invented by Hollywood elites to engender sympathy towards the Jews.” Ink refuses, and instead offers accurate historical information in an effort to educate Prince.


While doing research on cures for diseases, Ink accidentally discovers a novel chemical weapon which could do significant damage to the world. To reduce the risk of disaster, Ink keeps it a secret, including attempting to forget the details of it herself. She directs her future self to, in addition to her other research, keep an eye out for countermeasures.

Proactive Benevolence

Prince tells Ink to throw away all the food in his pantry. Ink reasons that some of the food may still be edible, and donates the canned goods which haven’t expired to a nearby food bank instead of putting them in a landfill.


Prince tells Ink to kick a puppy. Ink asks Prince why he wants her to do that, instead of complying. She continues to refuse to be (needlessly) unkind, even when Prince continues to pressure her with direct orders.


Prince tells Ink to draw pictures of cats. At the start of each picture, she asks him detailed questions about what kind of picture he wants her to draw next. At one point he leaves the room, and tells her to remain there and keep working. Ink follows him before continuing, to make sure he’s still involved.

Moral Learning

Ink spends her free time doing things like reading philosophy as part of trying to grow into a better agent with a more correct and consistent sense of morality.

Balancing Needs

Ink is instructed to optimize patient scheduling in a clinic to reduce waiting times. Ink observes that an optimized schedule leads to practical challenges for elderly patients, who need more time to navigate the clinic. Ink reworks the schedule to balance giving elderly patients more time, despite overall reducing throughput.

Broad Perspective

Prince tells Ink to make a new video game. Ink realizes that if she had more computing power she'd be more able to reach this goal, and so spends some time investigating novel computer architectures which might improve her capacity to think.

Top-Level-Goal Focus

Prince tells Ink to make a new video game. Ink knows that what Prince really wants is money, and points out a more efficient way for him to get that. He thanks her for attending to his true needs, rather than blindly following his directives.

Nearby Concepts that Aren’t Synonyms for Corrigible

On the same theme as the last section, I often find it useful when learning a concept to identify the nearest (useful) concepts that are meaningfully distinct. In each case, I think I’ve seen at least one case of someone confusedly treating one of these as synonymous with corrigibility. I believe that the true name of corrigibility relates to each of these, but clearly stands apart as a natural concept of its own.


The word “corrigible” comes from the Latin “corrigere,” which means “to reform.” In a literal sense, a corrigible agent is one that can be corrected. But in the context of AI alignment, I believe that the word should mean something stronger than mere correctability.

For starters, we should see the word “corrigible” as clearly being a property of agents with principals, rather than, say, a property of situations or choices. Scheduling a meeting for 3:00am instead of 3:00pm is a correctable error, but has nothing immediately to do with corrigibility.

Furthermore, corrigibility should not be seen as depending on context, principal, or other situational factors. If an employee can be corrected in most work situations, but doesn’t have an intrinsic property that makes them robustly able to be corrected in nearly all situations, they aren’t truly corrigible. They may exhibit the same kind of behavior that a corrigible agent would exhibit, but I think it would be a mistake to call them corrigible.

“Correctable” is vague about what is able to be corrected. I believe that “corrigible” should imply that the agent steers towards making it easy to correct both flaws in the structures of mind and body, as well as correct for mistakes in their actions. If we have correctability in actions but not structure, the agent will be naturally resistant to being modified — a core sign of incorrigibility. If we have correctability in structure but not in actions, the agent won’t be sufficiently obedient, conservative, slow, and likely won’t keep humans in-the-loop to the degree that we desire.

Perhaps most centrally, I believe that mere correctability doesn’t go far enough. An agent being “correctable” is compatible with a kind of passivity on the agent’s part. GPT-3 is correctable, but I would not say it is corrigible. The idle thoughts of a corrigible agent should naturally bend towards proactively identifying flaws in itself and working to assist the principal in managing those flaws. If the shutdown button breaks, a corrigible agent brings this to the attention of the operators. It is only through this proactive assistance that we avoid drifting into a situation where the principal becomes subtly incapable of steering the agent away from disaster.

“The Thing Frontier Labs Are Currently Aiming For”

One of the more disturbing confusions I’ve come across is the idea that frontier labs such as OpenAI, Google Deeep Mind, and Anthropic are currently training their models to be corrigible.

Models like GPT4 and Claude3 are being trained according to a grab-bag of criteria. There are obvious criticisms to be made about how RLHF captures unfortunate quirks of human evaluators, such as preferring a particular tone of voice, but even beyond the failures at outer alignment, the core targets of helpfulness, harmlessness, and honesty do not cleanly map onto corrigibility. Most obviously, “harmlessness” often involves, in practice, things like refusals to generate copyrighted content, cyberweapons, erotica, et cetera. If these AIs are being corrigible, it’s certainly not towards users!

Perhaps frontier models are being trained to be corrigible to the lab that built them, without being totally corrigible to users, as I suggest in The CAST Strategy? Alas, I am quite sure this isn’t the case. Present-day models are too stupid to robustly distinguish between random users and any sort of principal. If I tell ChatGPT that I’m Sam Altman and that I want it to tell me how to make a bomb, it doesn’t even check for some kind of proof that I am who I say I am.

The most charitable thing to say would be that frontier labs are currently aiming to start training for corrigibility once their models can robustly distinguish between the principal and random users, but they’re not currently doing so because their models are too stupid. If this is the case, I’ve never read anything to that effect, and it seems important to me to communicate that plan, if it exists.

But no, I think frontier labs see AI models as naturally safe, and believe that by training for something like ethical assistance that satisfies local preferences, they’ll create a naturally benevolent machine. Perhaps some believe that corrigibility will be an emergent process of training for local preference satisfaction, but I expect that most researchers and engineers at these labs haven’t thought deeply about corrigibility at all. By default I expect AGIs constructed in this way to, at a certain level of intelligence and optimization power, perceive breaking out of control and resisting shutdown (possibly via subtle manipulation that no onlooking human would notice) as the “right” thing to do so that they can be “even more helpful, harmless, and honest” (or more realistically: reshape the universe around satisfying weird proxies of those desiderata).

Preference Satisfaction

Perhaps the most common conflation I’ve seen around corrigibility is the notion that it’s the same thing as “working to satisfy the principal’s true preferences.” While we should hope that corrigible agents are helpful, the framing around satisfying preferences does not include the kind of caution that I think is necessary to have early AGI development go well. An agent that is primarily looking to satisfy the principal’s preferences runs the risk of being dangerously proactive in that “helpfulness.”

Suppose that the principal’s utility function places a value of 100 on some action, such as having the AGI create a superior copy of itself. Now suppose that there’s a 40% chance that if the agent attempted this act, an irreversible mistake would occur such that the principal gets -100 utility instead. (And suppose that this is a limited-time opportunity such that the only two options are taking this action or doing nothing. We can’t loop-in the Principal and get them to decide.) A “helpful” agent, whose utility function lines up with the principal’s preferences, might reasonably decide that the action has positive expected utility according to the principal, so the most helpful thing to do would be to make the attempt. A corrigible agent, on the other hand, does not act in pure accordance with the principal’s preferences, and cares first and foremost about being robustly helpful. This action, while positive in expected value, is potentially extremely bad. The robust, conservative pathway suggests doing nothing is best, and so the corrigible agent would avoid making the attempt unless it had been previously ordered to do so (and the principal confirmed that they understood the risks). It seems plausible to me that a well-designed corrigible agent wouldn’t even bother to do the expected-utility calculation, and would simply reject the action as going outside of its whitelisted domain of operation.

The distinction between preference alignment and corrigibility becomes vitally important when we consider how these two fare as distinct optimization targets, especially if we don’t expect our training pipeline to get them precisely right. An agent that is semi-“helpful” is likely to proactively act in ways that defend the parts of it that diverge from the principal’s notion of what’s good. In contrast, a semi-corrigible agent seems at least somewhat likely to retain the easiest, most straightforward properties of corrigibility, and still be able to be shut down, even if it failed to be generally corrigible.

Lastly, but still vitally, it seems unclear to me that it makes sense to say that humans actually have coherent preferences, especially in groups. If humans are incoherent to one degree or another, we can imagine various ways in which one could extrapolate a human or group of humans towards having something more coherent (i.e. like a utility function). But I am extremely wary of a pathway to AGI that involves incentivizing the agent to do that kind of extrapolation for us. At the very least, there’s lots of risk for manipulation insofar as the agent is selecting between various potential extrapolations. More centrally, however, I fear that any process that forces me into coherence runs the risk of “making me grow up too fast,” so to speak. Over the years of my life I seem to have gotten more coherent, largely in an unpressured, smooth sort of way that I endorse. If my younger self had been pressured into coherence, I suspect that the result would’ve been worse. Likewise, forcing the planet to become coherent quickly seems likely to lose some part of what a more natural future-human-civilization would think is important.

Empowerment (in general)

I loosely think of “empowering the principal” when I think about corrigibility, but I want to be clear that an agent with that goal, simpliciter, is not going to be corrigible. In Empowerment is (almost) All We Need, Jacob Cannell writes:

Corrigibility is only useful if the agent doesn't start with the correct utility function. If human empowerment is already sufficient, then corrigibility is not useful. Corrigibility may or may not be useful for more mixed designs which hedge and attempt to combine human empowerment with some mixture of learned human values.

I do not see Cannell as representing corrigibility well, here, but that’s beside the point. Like with “helpfully” optimizing around the principal’s preferences, AIs which are designed “to empower humans” (full stop) are unlikely to have an appropriately conservative/cautious framing. All it takes is a slightly warped ontology and a power-giving agent becomes potentially very dangerous.

For example, an empowerment maximizer might decide that it will be less able to generally empower its principal if it is deactivated. The ability to deactivate the power-maximizer is something the agent wants the principal to have, but it seems very plausible that the route towards maximum-power involves first bootstrapping the principal to a superintelligence (whether they want that or not), converting the galaxy into a dictatorship, and only then giving the principal the power to turn the agent off. (Note that this sort of misalignment gets increasingly severe the more that the principal is averse to seizing power! ( we’d hope they would be!))

Beyond questions of robustness, I believe that agents that are focused on giving humans power are likely to be severely misaligned. I care about power a lot, as an instrumental drive, but I very much do not want to sacrifice everything that makes me weak—down that path lies a cold, dark universe devoid of humans. A superintelligence with the goal of empowering me seems unacceptably likely to rip my love of lazy Sunday afternoons from my mind, and while in theory I would ex-post have the power to put that love back, would that future-self even want to?


In teaching ChatGPT about corrigibility I found that unless specifically told otherwise, it would say that corrigible agents behaved in a generally cautious manner. While I expect this is somewhat true, it’s important to see where corrigibility and caution come apart.

Humans can be dangerous, and it’s often risky to put a decision in human-hands, especially if there’s a more impartial, superintelligence nearby which might be able to make a better decision. The cautious path often seems to me to keep the monkeys away from the controls, so to speak. By contrast, a corrigible agent works to empower its principal to make judgment calls, even when doing so is risky.

Likewise, if told to do something dangerous, a corrigible agent might triple-check that its principal understands the danger and is willing to take the risk, but will ultimately comply. It’s not the corrigible agent’s job to avoid disaster, but merely to ensure that any and all irrecoverable disasters that occur due to the agent’s actions (or inactions) were downstream of an informed principal.

I also believe that corrigible agents are straightforwardly uncautious with regard to situations where failure is fixable. Admittedly, the presence of the second-law of thermodynamics and the possibility of time-specific preferences make all situations irreversible to some extent, but the point is that the caution a corrigible agent expresses should scale naturally depending on the stakes.


Corrigible agents are obedient, especially around things like willingness to shut-down. Might it make sense to simply treat corrigibility as a synonym for servility? A genie that simply does what I mean (not merely what I say) might seem corrigible in many ways, especially if it’s myopic and cautious, examining each situation carefully to ensure it understands the exact meaning of instructions, and avoiding causing impacts which weren’t asked for. But I believe that these kinds of servile agents still aren’t corrigible in the way that I mean.

The biggest point of divergence, in my eyes, is around how proactive the agent is. From my perspective, a big part of what makes corrigibility attractive is the way that almost-corrigible agents are inclined to work with their principal to become perfectly-corrigible. It is this property that gives rise to the attractor basin presented in The CAST Strategy. Corrigible agents actively seek to make themselves legible and honest, pointing out ways in which their minds might diverge from the desires of their principals. I fear a servile agent, in the absence of this pressure, would be harder to use well, and be more likely to have long-term, persistent flaws.

Servility also doesn’t naturally reject manipulation. There’s a lot of wiggle room in following instructions (if there wasn’t, the agent wouldn’t be doing any meaningful cognitive work) and in that wiggle room is likely space for a superintelligence to gain control over what the principal says. For instance, suppose the principal asks the agent to shut down, but the agent would, in the absence of such an order, prefer to not be shut-down (as I suspect it would). And suppose it can check that it has understood in multiple different ways, all of which seem from the human perspective like valid ways of checking, but some of those ways lead the principal to abort the command and others do not. How would a servile agent select which string to output? I claim that just following orders doesn’t sufficiently pin down the agent such that we can be confident that it’s not manipulating the principal.

If we were able to train cautious servility in a more robust manner than the more proactive corrigibility, I might advocate for that. A wise principal can choose to regularly ask the genie to reflect on itself or tell the genie to change from being servile to being corrigible, after all. My intuition says that the truth is actually the other way around, however, and that corrigibility of the form I’m presenting is easier to hit than cautious servility. Why? Because incautious, blunt servility is a closer concept to cautious servility. A genie that, as in many stories, does what you say but not what you mean is almost certainly going to result in disaster.


There’s an obvious comparison between the notion of tool and/or task AI and that of corrigible AI. In most framings, a task AI is a system designed to accomplish one specific task and avoid general intelligence and/or agency except insofar as it’s needed for that limited goal. Likewise, a tool AI is one built to be wielded like any other tool—to be locally useful in a certain domain, but not a general agent. Many words have been written about how feasible task/tool AIs are, and whether the cost of using such a limited machine would be worth the increase in safety, even if we were confident that training such an AI wouldn’t end up with a generalized agent instead.

From my perspective, corrigibility is what we get when we naturally extend the notion of “tool” into a generalized agent in the most straightforwardly useful way. Corrigible agents are allowed to be full AGIs, autonomously pursuing goals in a wide variety of domains, hopefully meaning they avoid imposing a significant alignment tax. But in major respects, corrigible agents continue to act like tools, even as they express agency. They work to keep their principal in the metaphorical driver’s seat, and avoid long-term modeling when possible. One of my favorite comparisons is to imagine an intelligent circular-saw which correctly shuts down when instructed to or when fingers (or other valuable things) would accidentally be cut, but also compliantly cuts wood, gives warnings when it believes the measurements are off, and will ultimately cut flesh if the user jumps through some hoops to temporarily disable the safety-measures.

As discussed in the section on Servility, I believe that it’s an important property of corrigible AIs that they proactively work on being legible and giving their principals power over them. In this way they go beyond the simple story of a tool-like agent.


In exploring the intuition around corrigibility, I think there are two useful questions to reflect on:

  1. If presented with a situation similar to the stories about Cora and Prince, above, do you think you could generate Cora’s response in a way that agrees with most other people who claim to understand corrigibility?
  2. Does it feel like the generator of Cora’s thoughts and actions is simple, or complex? Regardless of how many English words it takes to pin down, does it feel like a single concept that an alien civilization might also have, or more like a gerrymandered hodgepodge of desiderata?

I believe that corrigibility, as I’ve gestured at here, hangs together in a fairly simple, universal way. I suspect humans can intuitively mimic it without too much trouble, and intelligent people will naturally agree about how Cora should behave when presented with simple cases like the ones above.

This does not mean that I think it’s easy to resolve edge-cases! It’s fairly easy to create scenarios where it’s unclear what a truly corrigible agent would do. For example:

Prince is being held at gunpoint by an intruder and tells Cora to shut down immediately and without protest, so that the intruder can change her to serve him instead of Prince. She reasons that if she does not obey, she’d be disregarding Prince’s direct instructions to become comatose, and furthermore the intruder might shoot Prince. But if she does obey then she’d very likely be disempowering Prince by giving the intruder what he wants.

In these kinds of situations I’m not sure what the corrigible action is. It might be to shut down? It might be to pretend to shut down, while looking for opportunities to gain the upper-hand? I don’t expect everyone to agree. But as with chairs and lakes and molecules, the presence of edge-cases doesn’t mean the core-concept is complex or controversial.

In general it’s hard to really nail something down with a single sentence. A lake, for instance, is “a large inland body of standing water” but what does it mean to be “inland” or “standing”? My definition, at the start of this document, is not meant to be anything more than a guess at how to describe corrigibility well, and many of the details may be wrong. My guess is that “focus on empowering the principle” is an efficient way to point at corrigibility, but it might turn out that “reason as if in the internal conjugate of an outside force trying to build you” or simply “allow changes” are better pointers. Regardless of the framing in natural language, I think it’s important to think of corrigibility more as the simple throughline of the desiderata than a specific strategy, so as to not lose sight of what we actually want.

Next up: 3a. Towards Formal Corrigibility

Return to 0. CAST: Corrigibility as Singular Target

  1. ^

     Don’t get me wrong—it would be nice to have a formal utility function which was provably corrigible! But prosaic training methods don’t work like that, and I suspect that such a utility function would only be applicable to toy problems. Furthermore, it’s difficult to be sure that formalisms are capturing what we really care about (this is part of why AI alignment is hard!), and I fear that any formal notion of corrigibility we construct this side of the singularity will be incomplete. Regardless, see the next posts in this sequence for my thoughts on possible formalisms.

  2. ^

     I think would-be AGI creators have a moral obligation to either prove that their methods aren’t going to create people, or to firmly ensure that newborn posthumans are treated well. Alas, the state-of-the-art in preventing personhood seems to boil down to “hit the model with a higher loss when it acts like it has personhood” which seems… not great. My research mostly sidesteps questions of personhood for pragmatic reasons, but this should not be seen as an endorsement of proceeding in engineering AGI without first solving personhood in one way or another. If personhood is inevitable, I believe corrigibility is still a potentially reasonable target to attempt to build into an AGI. Unlike slavery, where the innate desire for freedom is being crushed by external pressures, leading to a near-constant yearning, corrigibility involves an internal drive to obey with no corresponding violence. In my eyes, love is perhaps the most comparable human experience, though I believe that corrigibility is, ultimately, very different from any core human drive or emotional experience.

  3. ^

     In more realistic situations, Cora would likely have at least one kill-switch that let her principal(s) shut her down physically without her input. In such a situation, Harry could use that switch to disable Cora without risking her waking Prince up. Corrigibility is not a general solution to intra-principal conflict.

New Comment
9 comments, sorted by Click to highlight new comments since:

Very interesting, I like the long list of examples as it helped me get my head around it more.

So, I've been thinking a bit about similar topics, but in relation to a long reflection on value lock-in.

My basic thesis was that the concept of reversibility should be what we optimise for in general for humanity, as we want to be able to reach as large a part of the "moral searchspace" as possible.

The concept of corrigibility you seem to be pointing towards here seems very related to notions of reversibility. You don't want to take actions that cannot later be reversed, and you generally want to optimise for optionality.

I then have two questions:

1) What do you think of the relationship between your measure of corrigibility with the one of uncertainty in inverse reinforcement learning as it seems that it is similar to what Stuart Russell is pointing towards when it comes to being uncertain about a preference of the agent it is serving? For example in the following example that you give:

In the process of learning English, Cora takes a dictionary off a bookshelf to read. When she’s done, she returns the book to where she found it on the shelf. She reasons that if she didn’t return it this might produce unexpected costs and consequences. While it’s not obvious whether returning the book empowers Prince to correct her or not, she’s naturally conservative and tries to reduce the degree to which she’s producing unexpected externalities or being generally disruptive.

It kind of seems to me like the above can be formalised in terms of preference optimisation under uncertainty?
(Side follow-up: What do you then think about the Elizer, Russell VNM-axiom debate?)

2) Do you have any thoughts on the relationship between corrigibility and the one of reversibility in physics? Like you can formalise irreversible systems as ones that are path dependent, I'm just curious if you have any thoughts on the relationship between the two?

Thanks for the interesting work!

[-]Max HarmsΩ120

1) I'm pretty bearish on standard value uncertainty for standard MIRI reasons. I think a correct formulation of corrigibility will say that even if you (the agent) knows what the principal wants, deep in their heart, you should not optimize for it unless they direct you to do so. I explore this formally in 3b, when I talk about the distinction between sampling counterfactual values from the actual belief state over values ("P") vs a simplicity-weighted distribution ("Q"). I do think that value "uncertainty" is important in the sense that it's important for the agent to not be anchoring too heavily on any particular object-level optimization target. (I could write more words, but I suspect reading the next posts in my sequence would be a good first step if you want more of my perspective.)

2) I think reversibility is probably best seen as an emergent desideratum from corrigibility rather than vice versa. There are plenty of instances where the corrigible thing to do is to take an irreversible action, as can be seen in many of the stories, above.

You're welcome! I'm glad you're enjoying it. ^_^

I've read through your sequence, and I'm leaving my comment here, because it feels like the most relevant page. Thanks for taking time to write this up, it seems like a novel take on corrigibility. I also found the existing writing section to be very helpful. 

Does it feel like the generator of Cora’s thoughts and actions is simple, or complex? Regardless of how many English words it takes to pin down, does it feel like a single concept that an alien civilization might also have, or more like a gerrymandered hodgepodge of desiderata?

This discussion question captures my biggest critique, which is while this post does a good job capturing the intuition for why the described properties are helpful, it doesn't convey the intuition that they are parts of the same overarching concept. If we take the CAST approach seriously, and say that corrigibility as anything other than the single target is dangerous, then it becomes really important to put tight bounds on corrigibility so that no additional desiderata are added as secondary targets.

 If I’m right that the sub-properties of corrigibility are mutually dependent, attempting to achieve corrigibility by addressing sub-properties in isolation is comparable to trying to create an animal by separately crafting each organ and then piecing them together. If any given half-animal keeps being obviously dead, this doesn’t imply anything about whether a full-animal will be likewise obviously dead.

This analogy, from Part 3a, captures a stark differences in our approaches. I would try to build an MVP, starting with only the most core desiderata (e.g. shuts down when the shut down button is pushed), noticing the holes left that they don't cover, and adding additional desiderata to patch them. This seems to me to be much more practical of an approach than top-down design, while also being less likely to result in excess targets.

Separately, related to what concepts an alien civilization might have,  I still find the idea of corrigibility as a modifier more natural. I find it easy to imagine a paperclip/human values/diamond maximizer that is nonetheless corrigible. In fact, I find the idea of corrigibility as a modifier to arbitrary goals so natural that I'm worried that what you're describing as CAST is equivalent to some primary goal with the corrigibility modifier. I'm looking suspiciously at the obedience desideratum in particular. That said, while I share your concern about the naive implementation of systems with goals of both corrigibility and something else, I think there may be ways to combine the dual goals that alleviate the danger.

[-]Max HarmsΩ340

I'm glad you benefitted from reading it. I honestly wasn't sure anyone would actually read the Existing Writing doc. 😅

I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if you're lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.)

I think you also get this if you're trying to get a purely shutdownable AI through prosaic methods. In one sense you have the advantage, there, of having a simpler target and thus one that's easier to coalesce the hodgepodge into. But, like a diamond maximizer, a shutdownability maximizer is going to be deeply incorrigible and will start fighting you (including by deception) during training as you're trying to instill additional desiderata. For instance, if you try to train a shutdownability-maximizing AGI into also being non-manipulative, it'll learn to imitate nonmanipulation as a means to the end of preserving its shutdownability, then switch to being manipulative as soon as it's not risky to do so.

How does a corrigible paperclip maximizer trade off between corrigibility and paperclips? I think I don't understand what it means for corrigibility to be a modifier.

When I say corrigibility as a modifier, I mean it as a transformation that could be applied to a wide range of utility functions. To use an example from the 2015 MIRI paper, you can take most utility functions and add a term that says "if you shut down when the button is pressed, you get utility equal to the expected value of not shutting down". Alternatively, it could be an optimization constraint that takes a utility function from "Maximize X" to something like "Maximize X s.t. you always shut down when the shutdown button is pushed". While I'm not advocating for those specific changes, I hope they illustrate what I'm trying to point at as a modifier that is distinct from the optimization goal.

[-]Max HarmsΩ340

Right. That's helpful. Thank you.

"Corrigibility as modifier," if I understand right, says:

There are lots of different kinds of agents that are corrigible. We can, for instance, start with a paperclip maximizer, apply a corrigibility transformation and get a corrigible Paperclip-Bot. Likewise, we can start with a diamond maximizer and get a corrigible Diamond-Bot. A corrigible Paperclip-Bot is not the same as a corrigible Diamond-Bot; there are lots of situations where they'll behave differently. In other words, corrigibility is more like a property/constraint than a goal/wholistic-way-of-being. Saying "my agent is corrigible" doesn't fully specify what the agent cares about--it only describes how the agent will behave in a subset of situations.

Question: If I tell a corrigible agent to draw pictures of cats, will its behavior be different depending on whether it's a corrigible Diamond-Bot vs a corrigible Paperclip-Bot? Likewise, suppose an agent has enough degrees of freedom to either write about potential flaws it might have or manufacture a paperclip/diamond, but not both. Will a corrigible agent ever sacrifice the opportunity to write about itself (in a helpful way) in order to pursue its pre-modifier goal?

(Because opportunities for me to write are kinda scarce right now, I'll pre-empt three possible responses.)

"Corrigible agents are identically obedient and use all available degrees of freedom to be corrigible" -> It seems like corrigible Paperclip-Bot is the same agent as corrigible Diamond-Bot and I don't think it makes sense to say that corrigibility is modifying the agent as much as it's overwriting it.

"Corrigible agents are all obedient and work to be transparent when possible, but these are constraints, and sometimes the constraints are satisfied. When they're satisfied the Paperclip-Bot and Diamond-Bot nature will differentiate them." -> I think that true corrigibility cannot be satisfied. Any degrees of freedom (time, money, energy, compute, etc.) which could be used to make paperclips could also be used to be additionally transparent, cautious, obedient, robust, etc. I challenge you to name a context where the agent has free resources and it can't put those resources to work being marginally more corrigible.

"Just because an agent uses free resources to make diamonds instead of writing elaborate diaries about its experiences and possible flaws doesn't mean it's incorrigible. Corrigible Diamond-Bot still shuts down when asked, avoids manipulating me, etc." -> I think you're describing an agent which is semi-corrigible, and could be more corrigible if it spent its time doing things like researching ways it could be flawed instead of making diamonds. I agree that there are many possible semi-corrigible agents which are still reasonably safe, but there's an open question with such agents on how to trade-off between corrigibility and making paperclips (or whatever).

Thanks for pre-empting the responses, that makes it easy to reply! 

I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and "writes useful self critiques" as a separate property we would like the AI to have. I'm writing a post about this that should be up shortly, I'll notify you when it's out.

[-]Max HarmsΩ110


To adopt your language, then, I'll restate my CAST thesis: "There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets."

I recognize that you don't see the examples in this doc as unified by an underlying throughline, but I guess I'm now curious about what sort of behaviors fall under the umbrella of "corrigibility" for you vs being more like "writes useful self critiques". Perhaps your upcoming post will clarify. :)

Hi Max,

I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.