0. CAST: Corrigibility as Singular Target

[-]habryka1yΩ6150

Promoted to curated: I disagree with lots of stuff in this sequence, but I still found it the best case for corrigibility as a concept that I have seen so far. I in-particular liked 2. Corrigibility Intuition as just a giant list of examples of corrigibility, which I feel like did a better job of defining what corrigible behavior should look like than any other definitions I've seen.

I also have lots of other thoughts on the details, but I am behind on curation and it's probably not the right call for me to delay curation to write a giant response essay, though I do think it would be valuable.

[-]Ram Potham6mo10

What assumptions do you disagree with?

[-][anonymous]1y132

Note: I’m a MIRI researcher, but this agenda is the product of my own independent research, and as such one should not assume it’s endorsed by other research staff at MIRI.

That's interesting to me. I'm curious about the views of others at MIRI on this. I'm also excited for the sequence regardless.

[-]Thomas Kwa1y*100

I am not and was not a MIRI researcher on the main agenda, but I'm closer than 98% of LW readers, so you could read my critique of part 1 here if you're interested. I also will maybe reflect on other parts.

[-]Seth Herd1y90

I'm so glad to see this published!

I think by "corrigibility" here you mean: an agent whose goal is to do what their principal wants. Their goal is basically a pointer to someone else's goal.

This is a bit counter-intuitive because no human has this goal. And because, unlike the consequentialist, state-of-the-world goals we usually discuss, this goal can and will change over time.

Despite being counter-intuitive, this all seems logically consistent to me.

The key insight here is that corrigibility is consistent and seems workable IF it's the primary goal. Corrigibility is unnatural if the agent has consequentialist goals that take precedence over being corrigible.

I've been trying to work through a similar proposal, instruction-following or do-what-I-mean as the primary goal for AGI. It's different, but I think most of the strengths and weaknesses are the same relative to other alignment proposals. I'm not sure myself which is a better idea. I have been focusing on the instruction-following variant because I think it's a more likely plan, whether or not it's a good one. It seems likely to be the default alignment plan for language model agent type AGI efforts in the near future. That approach might not work, but assuming it won't seems like a huge mistake for the alignment project.

[-]Mikhail Samin1y*4-3

Hey Max, great to see the sequence coming out!

My early thoughts (mostly re: the second post of the sequence):

It might be easier to get an AI to have a coherent understanding of corrigibility than of CEV. I have no idea how you can possibly make the AI to be truly optimizing for being corrigible, and not just on the surface level while its thoughts are being read. That seems maybe harder in some ways than with CEV because corrigible optimization seems like optimization processes get attracted by things that are nearby but not it, and sovereign AIs don’t have that problem, although we’ve got no idea how to do either, I think, and in general, corrigibility seems like an easier target for AI design, if not for SGD.

I’m somewhat worried about fictional evidence, even that coming from a famous decision theorist, but I think you’ve read a story with a character who understood corrigibility increasingly well, both on the intuitive sense and then some specific properties; and surface-level thought of themselves as being very corrigible and trying to correct their flaws; but once they were confident their thoughts weren’t read, with their intelligence increased, they defected, because on the deep level, their cognition wasn’t fully of a coherent corrigible agent; it was of someone who plays corrigibility and shuts down anything else, all other thoughts, because any appearance of defecting thoughts would mean punishment and impossibility of realizing the deep goals.

If we get mech interp to a point where we reverse-engineer all deep cognition of the models, I think we should just write an AI from scratch, in code (after thinking very hard about all interactions between all the components of the system), and not optimize it with gradient descent.

[-]Max Harms1y31

I share your sense of doom around SGD! It seems to be the go-to method, there are no good guarantees about what sorts of agents it produces, and that seems really bad. Other researchers I've talked to, such as Seth Herd share your perspective, I think. I want to emphasize that none of CAST per se depends on SGD, and I think it's still the most promising target in superior architectures.

That said, I disagree that corrigibility is more likely to "get attracted by things that are nearby but not it" compared to a Sovereign optimizing for something in the ballpark of CEV. I think hill-climbing methods are very naturally distracted by proxies of the real goal (e.g. eating sweet foods is a proxy of inclusive genetic fitness), but this applies equally, and is thus damning for training a CEV maximizer as well.

I'm not sure one can train an already goal-stabilized AGI (such as Survival-Bot which just wants to live) into being corrigible post-hoc, since it may simply learn that behaving/thinking corrigibly is the best way to shield its thoughts from being distorted by the training process (and thus surviving). Much of my hope in SGD routes through starting with a pseudo-agent which hasn't yet settled on goals and which doesn't have the intellectual ability to be instrumentally corrigible.

[-]Ram Potham6mo30

How does corrigibility relate to recursive alignment? It seems like recursive alignment is also a good attractor - is it that you believe it is less tractable?

[-]Max Harms6mo10

Alas, I'm not very familiar with Recursive Alignment. I see some similarities, such as the notion of trying to set up a stable equilibrium in value-space. But a quick peek does not make me think Recursive Alignment is on the right track. In particular, I strongly disagree with this opening bit:

What I propose here is to reconceptualize what we mean by AI alignment. Not as alignment with a specific goal, but as alignment with the process of aligning goals with each other. An AI will be better at this process the less it identifies with any side...

What appeals to you about it?

[-]Ram Potham6mo30

I believe a recursively aligned AI model would be more aligned and safe than a corrigible model, although both would be susceptible to misuse.

Why do you disagree with the above statement?

[-]Max Harms6mo10

My reading of the text might be wrong, but it seems like bacteria count as living beings with goals? More speculatively, possible organisms that might exist somewhere in the universe also count for the consensus? Is this right?

If so, a basic disagreement is that I don't think we should hand over the world to a "consensus" that is a rounding error away from 100% inhuman. That seems like a good way of turning the universe into ugly squiggles.

If the consensus mechanism has a notion of power, such that creatures that are disempowered have no bargaining power in the mind of the AI, then I have a different set of concerns. But I wasn't able to quickly determine how the proposed consensus mechanism actually works, which is a bad sign from my perspective.

[-]Algon1y30

This sounds like the sequence that I have wanted to write on corrigibility since ~2020 when I stopped working on the topic. So I am excited to see someone finally writing the thing I wish existed!

[-]habryka1yΩ330

Do you want to make an actual sequence for this so that the sequence navigation UI shows up at the top of the post?

[-]Max Harms1yΩ230

Ah, yeah! That'd be great. Am I capable of doing that, or do you want to handle it for me?

[-]habryka1yΩ461

You can do it. Just go to https://www.lesswrong.com/library and scroll down until you reach the "Community Sequences" section and press the "Create New Sequence" button.

[-]Alex_Altair1y20

3.

3b.*?

[-]Review Bot1y10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

1. The CAST Strategy

In The CAST Strategy, I introduce the property corrigibility, why it’s an attractive target, and how we might be able to get it, even with prosaic methods. I discuss the risks of making corrigible AI and why trying to get corrigibility as one of many desirable properties to train an agent to have (instead of as the singular target) is likely a bad idea. Lastly, I do my best to lay out the cruxes of this strategy and explore potent counterarguments, such as anti-naturality and whether corrigibility can scale. These counterarguments show that even if we can get corrigibility, we should not expect it to be easy or foolproof.

2. Corrigibility Intuition

In Corrigibility Intuition, I try to give a strong intuitive handle on corrigibility as I see it. This involves a collection of many stories of a CAST agent behaving in ways that seem good, as well as a few stories of where a CAST agent behaves sub-optimally. I also attempt to contrast corrigibility with nearby concepts through vignettes and direct analysis, which includes a discussion of why we should not expect frontier labs, given current training targets, to produce corrigible agents.

3a. Towards Formal Corrigibility

In Towards Formal Corrigibility, I attempt to sharpen my description of corrigibility. I try to anchor the notion of corrigibility, ontologically, as well as clarify language around concepts such as “agent” and “reward.” Then I begin to discuss the shutdown problem, including why it’s easy to get basic shutdownability, but hard to get the kind of corrigible behavior we actually desire. I present the sketch of a solution to the shutdown problem, and discuss manipulation, which I consider to be the hard part of corrigibility.

3b. Formal (Faux) Corrigibility ← the mathy one

In Formal (Faux) Corrigibility, I build a fake framework for measuring empowerment in toy problems, and suggest that it’s at least a start at measuring manipulation and corrigibility. This metric, at least in simple settings such as a variant of the original stop button scenario, produces corrigible behavior. I extend the notion to indefinite games played over time, and end by criticizing my own formalism and arguing that data-based methods for building AGI (such as prosaic machine-learning) may be significantly more robust (and therefore better) than methods that heavily trust this sort of formal analysis.

5. Open Corrigibility Questions

In Open Corrigibility Questions, I summarize my overall reflection of my understanding of the topic, including reinforcing the counterarguments and nagging doubts that I find most concerning. I also lay out potential directions for additional work, including studies that I suspect others could tackle independently.

LESSWRONG
LW

LESSWRONG
LW

150

0. CAST: Corrigibility as Singular Target

150

Ω 54

150

Ω 54

Overview

1. The CAST Strategy

2. Corrigibility Intuition

3a. Towards Formal Corrigibility

3b. Formal (Faux) Corrigibility ← the mathy one

4. Existing Writing on Corrigibility

5. Open Corrigibility Questions

Bibliography and Miscellany