Can corrigibility be learned safely?

EDIT: Please note that the way I use the word "corrigibility" in this post isn't quite how Paul uses it. See this thread for clarification.

This is mostly a reply to Paul Christiano's Universality and security amplification and assumes familiarity with that post as well as Paul's AI alignment approach in general. See also my previous comment for my understanding of what corrigibility means here and the motivation for wanting to do AI alignment through corrigibility learning instead of value learning.

Consider the translation example again as an analogy about corrigibility. Paul's alignment approach depends on humans having a notion of "corrigibility" (roughly "being helpful to the user and keeping the user in control") which is preserved by the amplification scheme. Like the information that a human uses to do translation, the details of this notion may also be stored as connection weights in the deep layers of a large neural network, so that the only way to access them is to provide inputs to the human of a form that the network was trained on. (In the case of translation, this would be sentences and associated context, while in the case of corrigibility this would be questions/tasks of a human understandable nature and context about the user's background and current situation.) This seems plausible because in order for a human's notion of corrigibility to make a difference, the human has to apply it while thinking about the meaning of a request or question and "translating" it into a series of smaller tasks.

In the language translation example, if the task of translating a sentence is broken down into smaller pieces, the system could no longer access the full knowledge the Overseer has about translation. By analogy, if the task of breaking down tasks in a corrigible way is itself broken down into smaller pieces (either for security or because the input task and associated context is so complex that a human couldn't comprehend it in the time allotted), then the system might no longer be able to access the full knowledge the Overseer has about "corrigibility".

In addition to "corrigibility" (trying to be helpful), breaking down a task also involves "understanding" (figuring out what the intended meaning of the request is) and "competence" (how to do what one is trying to do). By the same analogy, humans are likely to have introspectively inaccessible knowledge about both understanding and competence, which they can't fully apply if they are not able to consider a task as a whole.

Paul is aware of this problem, at least with regard to competence, and his proposed solution is:

I propose to go on breaking tasks down anyway. This means that we will lose certain abilities as we apply amplification. [...] Effectively, this proposal replaces our original human overseer with an impoverished overseer, who is only able to respond to the billion most common queries.

How bad is this, with regard to understanding and corrigibility? Is an impoverished overseer who only learned a part of what a human knows about understanding and corrigibility still understanding/corrigible enough? I think the answer is probably no.

With regard to understanding, natural language is famously ambiguous. The fact that a sentence is ambiguous (has multiple possible meanings depending on context) is itself often far from apparent to someone with a shallow understanding of the language. (See here for a recent example on LW.) So the overseer will end up being overly literal, and misinterpreting the meaning of natural language inputs without realizing it.

With regard to corrigibility, if I try to think about what I'm doing when I'm trying to be corrigible, it seems to boil down to something like this: build a model of the user based on all available information and my prior about humans, use that model to help improve my understanding of the meaning of the request, then find a course of action that best balances between satisfying the request as given, upholding (my understanding of) the user's morals and values, and most importantly keeping the user in control. Much of this seems to depend on information (prior about humans), procedure (how to build a model of the user), and judgment (how to balance between various considerations) that are far from introspectively accessible.

So if we try to learn understanding and corrigibility "safely" (i.e., in small chunks), we end up with an overly literal overseer that lacks common sense understanding of language and independent judgment of what the user's wants, needs, and shoulds are and how to balance between them. However, if we amplify the overseer enough, eventually the AI will have the option of learning understanding and corrigibility from external sources rather than relying on its poor "native" abilities. As Paul explains with regard to translation:

This is potentially OK, as long as we learn a good policy for leveraging the information in the environment (including human expertise). This can then be distilled into a state maintained by the agent, which can be as expressive as whatever state the agent might have learned. Leveraging external facts requires making a tradeoff between the benefits and risks, so we haven’t eliminated the problem, but we’ve potentially isolated it from the problem of training our agent.

So instead of directly trying to break down a task, the AI would first learn to understand natural language and what "being helpful" and "keeping the user in control" involve from external sources (possibly including texts, audio/video, and queries to humans), distill that into some compressed state, then use that knowledge to break down the task in a more corrigible way. But first, since the lower-level (less amplified) agents are contributing little besides the ability to execute literal-minded tasks that don't require independent judgment, it's unclear what advantages there are to doing this as an Amplified agent as opposed to using ML directly to learn these things. And second, trying to learn understanding and corrigibility from external humans has the same problem as trying to learn from the human Overseer: if you try to learn in large chunks, you risk corrupting the external human and then learning corrupted versions of understanding and corrigibility, but if you try to learn in small chunks, you won't get all the information that you need.

The conclusion here seems to be that corrigibility can't be learned safely, at least not in a way that's clear to me.

92 comments, sorted by
magical algorithm
Highlighting new comments since Today at 11:07 PM
Select new highlight date
Moderation Guidelinesexpand_more

I curated this post for these reasons:

  • The post helped me understand a critique of alignment via amplifying act-based agents, in terms of the losses via scaling down.
  • The excellent comment section had a lot of great clarifications around definitions and key arguments in alignment from all involved. This is >50% of the reason I curated the post.

I regret I don't have the time right now to try to summarise the specific ideas I found valuable (and what exact updates I made). Paul's subsequent post on the definition of alignment was helpful, and I'd love to read anyone else's attempt to summarise updates from the comment thread (added: I've just seen William_S's new post following on in part from this conversation).

Biggest hesitation with curating this post:

  • The post + comments require some background reading and are quite a high-effort read. I would be much happier curating this post if it allowed readers to get the picture of what it was responding to without having to visit elsewhere first (e.g. via quotes and a summary of the position being argued against).

But the comments helped make explicit some key considerations within alignment, which was the strongest variable for me. Thanks Wei Dai for starting and moving forward this discussion.

See also my previous comment for my understanding of what corrigibility means here and the motivation for wanting to do AI alignment through corrigibility learning instead of value learning.
...
The conclusion here seems to be that corrigibility can't be learned safely, at least not in a way that's clear to me.

1) Are you more comfortable with value learning, or do both seem unsafe at present?

2) If we had a way to deal with this particular objection (where, as I understand it, subagents are either too dumb to be sophisticatedly corrigible, or are smart enough to be susceptible to attacks), would you be significantly more hopeful about corrigibility learning? Would it be your preferred approach?

From my current understanding of Paul's IDA approach, I think there are two different senses in which corrigibility can be thought about in regards to IDA, both with different levels of guarantee.

From An unaligned benchmark

1. On average, the reward function incentivizes behaviour which competes effectively and gives the user effective control.
2. There do not exist inputs on which the policy choose an action because it is bad, or the value function outputs a high reward because the prior behaviour was bad. (Or else the policy on its own will generate bad consequences.)
3. The reward function never gives a behaviour a higher reward because it is bad. (Or else the test-time optimisation by MCTS can generate bad behaviour.) For example, if the AI deludes the human operator so that the operator can’t interfere with the AI’s behaviour, that behaviour can’t receive a higher reward even if it ultimately allows the AI to make more money.

Property 1 is dealing with "consequence corrigibility" (competence at producing actions that will produce outcomes in the world we would describe as corrigible)

Properties 2&3 are dealing with corrigibility in terms of "intent corrigibility" (guaranteeing that the system does not optimise for bad outcomes). This does not cover the agent incompetently causing bad actions in the world, only the agent deliberately trying to produce bad outcomes.

I think IDA doesn't require or claim worst-case guarantees on the task of "consequence corrigibility" (and that this is an impossible goal for bounded reasoners).

I think that average-case good performance on "consequence corrigibility" is claimed by IDA, but only as a subset of general competence.

I think that providing worst-case guarantees on "intent corrigibility" is required and claimed by IDA.

Roughly, I think that:

  • Versions of IDA that allow the overseer nodes more information could be generally competent (including predicting what behaviour could be corrigible), but could fail to be "intent corrigible"
  • Versions of IDA that allow the overseer nodes only a highly restricted set of queries could be "intent corrigible" but fail to be generally competent, and hence not be "consequence corrigible"
  • Standard ML approaches will, at some level of optimisation power, fail to behave "intent corrigibly" (even if you train them to be "consequence corrigible")

The question I'm uncertain about is whether there's a middle point in tradeoff space where both properties are sufficiently satisfied to produce good outcomes.

Do you agree or disagree with how I've broken down corrigibility claims for IDA, and which claims do you think your argument bears on?

Suppose the AI misinterprets the meaning of a request, or does something bad because it didn't model the user as thinking that thing is bad (for example, I ask my AI to make some money for me, and it undetectably hacks into my bank not knowing that's against my morals). Are these actions violations of "intent corrigibility" or "consequence corrigibility" or something else, in your schema? Since that's not clear to me, I'm not sure how to answer your question directly, but my argument is basically that these kinds of violations of corrigibility seem hard to avoid if we use security amplification because the security amplification makes it impossible for the AI to learn much of what the Overseer knows about natural language understanding, modeling the user, and other information needed to corrigibly break down tasks.

(for example, I ask my AI to make some money for me, and it undetectably hacks into my bank not knowing that's against my morals). Are these actions violations of "intent corrigibility" or "consequence corrigibility" or something else, in your schema?

I don't consider this an example of incorrigibility, and don't think it can be avoided in general (though this example is easy enough that I expect a competent agent would avoid it).

This behavior would be incorrigible if the agent hacked undetectably because it new that if it hacked detectably I would have provided corrective feedback, or if the agent scaled back usual mechanisms for eliciting corrective feedback or informing humans, or so on. But those all require failures of commission and so are much more plausibly avoidable.

The goal is to

1. Reliably avoid incorrigible behavior.

2. Determine whether and in what sense that creates a "basin of attraction" towards desirable outcomes.

3. Achieve a high enough level of competence that we are in the basin of attraction.

Hacking a bank would be an example of a competence failure, that I believe is both (a) easy enough that we can probably avoid it, (b) still not nearly severe enough to kick you out of the basin of attraction.

I think your concerns are relevant both to step (1) and step (3) of this plan, but are probably most relevant to step (3). This may also be related to your doubts about the coherence of the concept in (2).

Trying to understand the boundary lines around incorrigibility, looking again at this example from Universality and Security Amplification

For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.

It sounds like from this that this only counts as incorrigible if the optimization in “What behavior is best according to those values?” is effectively optimizing for something that the user doesn't want, but is not incorrigible if it is optimizing for something that the user doesn't want in a way that the user can easily correct? (so incorrigibilty requires something more than just being malign)

One way to describe this is that the decomposition is incorrigible if the models of the user that are used in “What behavior is best according to those values?” are better than the models used in “What does the user want?” (as this could lead the AI to maximize an approximation V* of the user's values V and realize that if the AI reveals to the user that they are maximizing V*, the user will try to correct what the AI is doing, which will perform worse on V*).

So acceptable situations are where both subqueries get the same user models, the first subquery gets a user better model than the second, or the situation where “What behavior is best according to those values?” is performing some form of mild optimization. Is that roughly correct?

I think "what behavior is best according to those values" is never going to be robustly corrigible, even if you use a very good model of the user's preferences and optimize very mildly. It's just not a good question to be asking.

If meta-execution asks “What does the user want?” what am I supposed to do instead?

This is actually a fine way of deciding "what does the user want," depending on exactly what the question means. For example, this is how you should answer "What action is most likely to be judged as optimal by the user?" I was sloppy in the original post.

It's an incorrigible way of deciding "what should I do?" and so shouldn't happen if we've advised the humans appropriately and the learning algorithm has worked well enough. (Though you might be able to entirely lean on removing the incorrigible optimization after distillation, I don't know.)

The recommendation is to ask "Given that my best guess about the user's values are {V}, what should I do?" instead of "What behavior is best according to values {V}?"

This is a totally different question, e.g. the policy that is best according to values {V} wouldn't care about VOI, but the best guess about what you should would respect VOI.

(Even apart from really wanting to remain corrigible, asking "What behavior is best according to values {V}" is kind of obviously broken.)

Can you do all the optimization in this way, carrying the desire to be corrigible through the whole thing? I'm not sure, it looks doable to me, but it's similar to the basic uncertainty about amplification.

(As an aside: doing everything corrigibly probably means you need a bigger HCH-tree to reach a fixed level of performance, but the hope is that it doesn't add any more overhead than corrigibility itself to the learned policy, which should be small.)

This is a really good example of hard communication can be. When I read

For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?”

I assumed that "representation of their values" would include uncertainty about their values, and then "What behavior is best according to those values?" would take that uncertainty into account. (To not do that seemed like too obvious a mistake, as you point out yourself.) I thought that you were instead making the point that if meta-execution was doing this, it would collapse into value learning, so to be corrigible it needs to prioritize keeping the user in control more, or something along those lines. If you had added a sentence to that paragraph saything "instead, to be corrigible, it should ..." this misunderstanding could have been avoided. Also, I think given that both William and I were confused about this paragraph, probably >80% of your readers were also confused.

So, a follow up question. Given:

The recommendation is to ask "Given that my best guess about the user's values are {V}, what should I do?" instead of "What behavior is best according to values {V}?"

Why doesn't this just collapse into value learning (albeit one that takes uncertainty and VOI into account)? Are there some advantages to doing this through an Amplification setup versus a more standard value learning setup? Is it that the "what should I do?" part could include my ideas about keeping the user in control, which would be hard to design into an AI otherwise? Is it that the Amplification setup could more easily avoid accidentally doing an adversarial attack on the user while trying to learn their values? Is it that we don't know how to do value learning well in general, and the Amplified AI can figure that out better than we can?

It's not enough to represent uncertainty about their values, you also need to represent the fact that V is supposed to be *their* values, in order to include what counts as VOI.

I thought that you were instead making the point that if meta-execution was doing this, it would collapse into value learning, so to be corrigible it needs to prioritize keeping the user in control more, or something along those lines.

To answer "What should I do if the user's values are {V}" I should do backwards chaining from V, but should also avoid doing incorrigible stuff. For example, if I find myself backwards chaining through "And then I should make sure this meddlesome human doesn't have the ability to stop me" I should notice that step is bad.

If you had added a sentence to that paragraph saying "instead, to be corrigible, it should ..." this misunderstanding could have been avoided.

Point taken that this is confusing. But I also don't know exactly what the overseer should do in order to be corrigible, so don't feel like I could write this sentence well. (For example, I believe that we are still in a similar state of misunderstanding, because the sentence I gave about how to behave corrigibly has probably been misunderstood.)

My point with the example was just: there are plausible-looking things that you can do that introduce incorrigible optimization.

Are there some advantages to doing this through an Amplification setup versus a more standard value learning setup?

What do you mean by a "standard value learning setup"? It would be easier to explain the difference with a concrete alternative in mind. It seems to me like amplification is currently the most plausible way to do value learning.

The main advantages I see of amplification in this context are:

  • It's a potential approach for learning a "comprehensible" model of the world, i.e. one where humans are supplying the optimization power that makes the model good and so understand how that optimization works. I don't know of any different approach to benign induction, and moreover it doesn't seem like you can use induction as an input into the rest of alignment (since solving benign induction is as hard as solving alignment), which nixes the obvious approaches to value learning. Having a comprehensible model is also needed for the next steps. Note that a "comprehensible" model doesn't mean that humans understand everything that is happening in the model---they still need to include stuff like "And when X happens, Y seems to happen after" in their model.
  • It's a plausible way of learning a reasonable value function (and in particular a value function that could screen off incorrigibility from estimated value). What is another proposal for learning a value function? What is even the type signature of "value" in the alternative you are imagining?

If comparing to something like my indirect normativity proposal, the difference is that amplification serves as the training procedure of the agent, rather than serving as a goal specification which needs to be combined with some training procedure that leads the agent to pursue the goal specification.

I believe that the right version of indirect normativity in some sense "works" for getting corrigible behavior, i.e. the abstract utility function would incentivize corrigible behavior, but that abstract notion of working doesn't tell you anything about the actual behavior of the agent. (This is a complaint which you raised at the time about the complexity of the utility function.) It seems clear that, at a minimum, you need to inject corrigibility at the stage where the agent is reasoning logically about the goal specification. It doesn't suffice to inject it only to the goal specification.

Why doesn't this just collapse into value learning (albeit one that takes uncertainty and VOI into account)?

The way this differs from "naively" applying amplification for value learning is that we need to make sure that none of the optimization that the system is applying produces incorrigibility.

So you should never ask a question like "What is the fastest way to make the user some toast?" rather than "What is the fastest way to corrigibly make the user some toast?" or maybe "What is the fastest way to make the user some toast, and what are the possible pros and cons of that way of making toast?" where you compute the pros and cons at the same time as you devise the toast-making method.

Maybe you would do that if you were a reasonable person doing amplification for value learning. I don't think it really matters, like I said, my point was just that there are ways to mess up and in order to have the process be corrigible we need to avoid those mistakes.

(This differs from the other kinds of "mistakes" that the AI could make, where I wouldn't regard the mistake as resulting in an unaligned AI. Just because H is aligned doesn't mean the AI they train is aligned, we are going to need to understand what H needs to satisfy in order to make the AI aligned and then ensure H satisfies those properties.)

It's not enough to represent uncertainty about their values, you also need to represent the fact that V is supposed to be their values, in order to include what counts as VOI.

Ah, ok.

To answer "What should I do if the user's values are {V}" I should do backwards chaining from V, but should also avoid doing incorrigible stuff. For example, if I find myself backwards chaining through "And then I should make sure this meddlesome human doesn't have the ability to stop me" I should notice that step is bad.

Ok, this is pretty much what I had in mind when I said 'the "what should I do?" part could include my ideas about keeping the user in control'.

For example, I believe that we are still in a similar state of misunderstanding, because the sentence I gave about how to behave corrigibly has probably been misunderstood.

It seems a lot clearer to me now compared to my previous state of understanding (right after reading that example), especially given your latest clarifications. Do you think I'm still misunderstanding it at this point?

My point with the example was just: there are plausible-looking things that you can do that introduce incorrigible optimization.

I see, so part of what happened was that I was trying to figure out where exactly is the boundary between corrigible/incorrigible, and since this example is one of the few places you talk about this, ended up reading more into your example than you intended.

What is another proposal for learning a value function? What is even the type signature of "value" in the alternative you are imagining?

I didn't have a specific alternative in mind, but was just thinking that meta-execution might end up doing standard value learning things in the course of trying to answer "What does the user want?” (so the type signature of "value" in the alternative would be the same as the type signature in meta-execution). But if the backwards chaining part is trying to block incorrigible optimizations from happening, at least that seems non-standard.

I also take your point that it's 'a potential approach for learning a "comprehensible" model of the world', however I don't have a good understanding of how this is really supposed to work (e.g., how does the comprehensibility property survive the distillation steps). But I'm happy to take your word about this for now until you or someone else writes up an explanation that I can understand.

Just because H is aligned doesn't mean the AI they train is aligned, we are going to need to understand what H needs to satisfy in order to make the AI aligned and then ensure H satisfies those properties.

I'm still pretty confused about the way you use aligned/unaligned here. I had asked you some questions in private chat about this that you haven't answered yet. Let me try rephrasing the questions here to see if that helps you give an answer. It seems like you're saying here that an aligned H could have certain misunderstandings which causes the AI they train to be unaligned. But whatever unaligned thing that the AI ends up doing, H could also do as a result of the same misunderstanding (if we put a bunch of H's together, or let one H run for a long subjective time), so why does it make sense to call this AI unaligned but this H aligned?

I also take your point that it's 'a potential approach for learning a "comprehensible" model of the world', however I don't have a good understanding of how this is really supposed to work (e.g., how does the comprehensibility property survive the distillation steps)

Models and facts and so on are represented as big trees of messages. These are distilled as in this post. You train a model that acts on the distilled representations, but to supervise it you can unpack the distilled representation.

(so the type signature of "value" in the alternative would be the same as the type signature in meta-execution

But in meta-execution the type signature is a giant tree of messages (which can be compressed by an approval-directed encoder); I don't see how to use that type of "value" with any value-learning approach not based on amplification (and I don't see what other type of "value" is plausible).

It seems like you're saying here that an aligned H could have certain misunderstandings which causes the AI they train to be unaligned. But whatever unaligned thing that the AI ends up doing, H could also do as a result of the same misunderstanding (if we put a bunch of H's together, or let one H run for a long subjective time), so why does it make sense to call this AI unaligned but this H aligned?

A giant organization made of aligned agents can be unaligned. Does this answer the question? This seems to be compatible with this definition of alignment, of "trying to do what we want it to do." There is no automatic reason that alignment would be preserved under amplification. (I'm hoping to preserve alignment inductively in amplification, but that argument isn't trivial.)

Do you think I'm still misunderstanding it at this point?

Probably not, I don't have a strong view.

But in meta-execution the type signature is a giant tree of messages (which can be compressed by an approval-directed encoder); I don’t see how to use that type of “value” with any value-learning approach not based on amplification (and I don’t see what other type of “value” is plausible).

In the (source text --> meaning) example, I thought meta-execution would end up with a data structure that's more or less equivalent to some standard data structure that's used in linguistics to represent meaning. Was that a misunderstanding, or does the analogy not carry over to "value"? (EDIT: Maybe it would help if you expanded the task tree a bit for "value"?)

A giant organization made of aligned agents can be unaligned. Does this answer the question?

What about the other part of my question, the case of just one "aligned" H, doing the same thing that the unaligned AI would do?

There is no automatic reason that alignment would be preserved under amplification. (I’m hoping to preserve alignment inductively in amplification, but that argument isn’t trivial.)

You're saying that alignment by itself isn't preserved by amplification, but alignment+X hopefully is for some currently unknown X, right?

In the (source text --> meaning) example, I thought meta-execution would end up with a data structure that's more or less equivalent to some standard data structure that's used in linguistics to represent meaning. Was that a misunderstanding, or does the analogy not carry over to "value"?

I think it would be similar to a standard data structure, though probably richer. But I don't see what the analogous structure would be in the case of "value."

Representations of value would include things like "In situations with character {x} the user mostly cares about {y}, but that might change if you were able to influence any of {z}" where z includes things like "influence {the amount of morally relevant experience} in a way that is {significant}", where the {}'s refer to large subtrees, that encapsulate all of the facts you would currently use in assessing whether something affects the amount of morally relevant conscious experience, all of the conditions under which you would change your views about that, etc.

What about the other part of my question, the case of just one "aligned" H, doing the same thing that the unaligned AI would do?

If I implement a long computation, that computation can be unaligned even if I am aligned, for exactly the same reason.

You're saying that alignment by itself isn't preserved by amplification, but alignment+X hopefully is for some currently unknown X, right?

We could either strengthen the inductive invariant (as you've suggested) or change the structure of amplification.

Could we approximate a naive function that agents would be attempting to maximize (for the sake of understanding)? I imagine it would include:

1. If the user were to rate this answer, where supplemental & explanatory information is allowed, what would be their expected rating?

2. How much did the actions of this agent positively or negatively affect the system's expected corrigibility?

3. If a human were to rank the overall safety of this action, without the corrigibility, what is their expected rating?

*Note: maybe for #1, #3, the user should be able to call HCH additional times in order to evaluate the true quality of the answer. Also, #3 is mostly a "catch-all", it would be better of course to define it in more concrete details, and preferably break it up.

A very naive answer value function would be something like:

HumanAnswerRating + CorrigibilityRating + SafetyRating

I think that the bank example falls into "intent corrigibility". The action "hack the bank" was output because the AI formed an approximate model of your morals and then optimised the approximate model of your morals "too hard", coming up with an action that did well on the proxy but not on the real thing. The understanding of how not do do this doesn't depend on how well you can understand the goal specification, but the meta-level knowledge that optimizing approximate reward functions can lead to undesired results.

(The AI also failed to ask you clarifying questions about it's model of your morals, failed to realize that it could instead have tried to do imitation learning or quantilization to come up with a plan more like what you had in mind, etc.)

I think the argument that worst-case guarantees about "intent corrigibility" are possible is that 1) it only needs to cover the way that the finite "universal core" of queries are handled 2) It's possible to do lots of pre-computation as I discuss in my other comment, as well as delegating to other subagents. So you aren't modelling "Would someone with 15 minutes to think about answering this query find the ambiguity", it's "Would a community of AI researchers with a long time to think about answering this be able to provide training to someone so that they and a bunch of assistants find the ambiguity"? I agree that this seems hard and it could fail, but I think I'm at the point of "let's try this through things like Ought's experiments", and it could either turn out to seem possible or impossible based on that.

(An example of "consequence corrigibility" would be if you were okay with hacking the bank but only as long as it doesn't lead to you going to jail. The AI comes up with a plan to hack the bank that it thinks won't get caught by the police. But the AI underestimated the intelligence of the police, gets caught, and this lands you in jail. This situation isn't "corrigible" in the sense that you've lost control over the world.)

“Would a community of AI researchers with a long time to think about answering this be able to provide training to someone so that they and a bunch of assistants find the ambiguity”?

But this seems as hard as writing an algorithm that can model humans and reliably detect any ambiguities/errors in its model. Since the Overseer and assistants can't use or introspectively access their native human modeling and ambiguity detection abilities, aren't you essentially using them as "human transistors" to perform mechanical computations and model the user the same way an algorithm would? If you can do that with this and other aspects of corrigibility, why not just implement the algorithms in a computer?

I agree that this seems hard and it could fail, but I think I’m at the point of “let’s try this through things like Ought’s experiments”, and it could either turn out to seem possible or impossible based on that.

Yeah, I'm uncertain enough in my conclusions that I'd also like to see empirical investigations. (I sent a link of this post to Andreas Stuhlmüller so hopefully Ought will do some relevant experiments at some point.)

aren't you essentially using them as "human transistors" to perform mechanical computations and model the user the same way an algorithm would? If you can do that with this and other aspects of corrigibility, why not just implement the algorithms in a computer?

In general, the two advantages are:

  • You may be able to write an algorithm which works but is very slow (e.g. exponentially slow). In this case, amplification can turn it into something competitive.
  • Even if you need to reduce humans to rather small inputs in order to be comfortable about security, you still have much more expressive power than something hand-coded.

I think the first advantage is more important.

You may be able to write an algorithm which works but is very slow (e.g. exponentially slow). In this case, amplification can turn it into something competitive.

In this case we don't need a human Overseer, right? Just an algorithm that serves as the initial H? And then IDA is being used as a method of quickly approximating the exponentially slow algorithm, and we can just as well use another method of approximation, if there's one that more specifically suited for the particular algorithm that we want to approximate?

Even if you need to reduce humans to rather small inputs in order to be comfortable about security, you still have much more expressive power than something hand-coded.

William was saying that AI researchers could provide training to the Overseer to help them detect ambiguity (and I guess to build models of the user in the first place). It's hard for me to think of kind of training they can provide, such that the amplified Overseer would then be able to (by acting on small inputs) model humans and reliably detect any ambiguities/​errors in its model, without that training essentially being "execute this hand-coded algorithm".

In this case we don't need a human Overseer, right? Just an algorithm that serves as the initial H? And then IDA is being used as a method of quickly approximating the exponentially slow algorithm, and we can just as well use another method of approximation, if there's one that more specifically suited for the particular algorithm that we want to approximate?

IDA is a way of approximating that algorithm that can be competitive with deep RL. If you found some other approximation method that would be competitive with deep RL, then that would be a fine replacement for IDA in this scenario (which I think has about 50% probability conditioned on IDA working). I'm not aware of any alternative proposals, and it doesn't seem likely to me that the form of the algorithm-to-be-approximated will suggest a method of approximation that could plausibly be competitive.

If I thought MIRI was making optimal progress towards a suitable algorithm-to-be-approximated by IDA then I'd be much more supportive of their work (and I'd like to discuss this with some MIRI folk and maybe try to convince them to shift in this direction).

I don't think that e.g. decision theory or naturalized induction (or most other past/current MIRI work) is a good angle of attack on this problem, because a successful system needs to be able to defer that kind of thinking to have any chance and should instead be doing something more like metaphilosphy and deference. Eliezer and Nate in the past have explicitly rejected this position, because of the way they think that "approximating an idealized algorithm" will work. I think that taking IDA seriously as an approximation scheme ought to lead someone to work on different problems than MIRI.

>It's hard for me to think of kind of training they can provide, such that the amplified Overseer would then be able to (by acting on small inputs) model humans and reliably detect any ambiguities/​errors in its model, without that training essentially being "execute this hand-coded algorithm".

I agree that "reliably detect ambiguities/errors" is out of reach for a small core of reasoning.

I don't share this intuition about the more general problem (that we probably can't find a corrigible, universal core of reasoning unless we can hard code it), but if your main argument against is that you don't see how to do it then this seems like the kind of thing that can be more easily answered by working directly on the problem rather than by trying to reconcile intuitions.

I don't share this intuition about the more general problem (that we probably can't find a corrigible, universal core of reasoning unless we can hard code it)

If your definition of "corrigible" does not include things like the ability to model the user and detect ambiguities as well as a typical human, then I don't currently have a strong intuition about this. Is your view/hope then that starting with such a core, if we amplify it enough, eventually it will figure out how to safely learn (or deduce from first principles, or something else) how to understand natural language, model the user, detect ambiguities, balance between the user's various concerns, and so on? (If not, it would be stuck with either refusing to doing anything except literal-minded mechanical tasks that don't require such abilities, or frequently making mistakes of the type "hack a bank when I ask it to make money", which I don't think is what most people have in mind when they think of "aligned AGI".)

Is your view/hope then that starting with such a core, if we amplify it enough, eventually it will figure out how to safely learn (or deduce from first principles, or something else) how to understand natural language, model the user, detect ambiguities, balance between the user's various concerns, and so on?

Yes. My hope is to learn or construct a core which:

  • Doesn't do incorrigible optimization as it is amplified.
  • Increases in competence as it is amplified, including competence at tasks like "model the user," "detect ambiguities" or "make reasonable tradeoffs about VOI vs. safety" (including info about the user's preferences, and "safety" about the risk of value drift). I don't have optimism about finding a core which is already highly competent at these tasks.

I grant that even given such a core, we will still be left with important and unsolved x-risk relevant questions like "Can we avoid value drift over the process of deliberation?"

It appears that I seriously misunderstood what you mean by corrigibility when I wrote this post. But in my defense, in your corrigibility post you wrote, "We say an agent is corrigible (article on Arbital) if it has these properties." and the list includes helping you "Make better decisions and clarify my preferences" and "Acquire resources and remain in effective control of them" and to me these seem to require at least near human level ability to model the user and detect ambiguities. And others seem to have gotten the same impression from you. Did your conception of corrigibility change at some point, or did I just misunderstand what you wrote there?

Since this post probably gave even more people the wrong impression, I should perhaps write a correction, but I'm not sure how. How should I fill in this blank? "The way I interpreted Paul's notion of corrigibility in this post is wrong. It actually means ___."

Increases in competence as it is amplified, including competence at tasks like “model the user,” “detect ambiguities” or “make reasonable tradeoffs about VOI vs. safety”

Is there a way to resolve our disagreement/uncertainty about this, short of building such an AI and seeing what happens? (I'm imagining that it would take quite a lot of amplification before we can see clear results in these areas, so it's not something that can be done via a project like Ought?)

I think your post is (a) a reasonable response to corrigibility as outlined in my public writing, (b) a reasonable but not decisive objection to my current best guess about how amplification could work. In particular, I don't think anything you've written is too badly misleading.

In the corrigibility post, when I said "AI systems which help me do X" I meant something like "AI systems which help me do X to the best of their abilities," rather than having in mind some particular threshold for helpfulness at which an AI is declared corrigible (similarly, I'd say an AI is aligned if it's helping me achieve my goals to the best of its abilities, rather than fixing a certain level of helpfulness at which I'd call it aligned). I think that post was unclear, and my thinking has become a lot sharper since then, but the whole situation is still pretty muddy.

Even that's not exactly right, and I don't have a simple definition. I do have a lot of intuitions about why there might be a precise definition, but those are even harder to pin down.

(I'm generally conflicted about how much to try to communicate publicly about early stages of my thinking, given how frequently it changes and how fuzzy the relevant concepts are. I've decided to opt for a medium level of communication, since it seems like the potential benefits are pretty large. I'm sorry that this causes a lot of trouble though, and in this case I probably should have been more careful about muddying notation. I also recognize it means people are aiming at a moving target when they try to engage; I certainly don't fault people for that, and I hope it doesn't make it too much harder to get engagement with more precise versions of similar ideas in the future.)

Is there a way to resolve our disagreement/uncertainty about this, short of building such an AI and seeing what happens? (I'm imagining that it would take quite a lot of amplification before we can see clear results in these areas, so it's not something that can be done via a project like Ought?)

What uncertainty in particular?

Things I hope to see before we have very powerful AI:

  • Clearer conceptual understanding of corrigibility.
  • Significant progress towards a core for metaexecution (either an explicit core, or an implicit representation as a particular person's policy), which we can start to investigate empirically.
  • Amplification experiments which show clearly how complex tasks can be broken into simpler pieces, and let us talk much more concretely about what those decompositions look like and in what ways they might introduce incorrigible optimization. These will also directly resolve logical uncertainty about whether proposed decomposition techniques actually work.
  • Application of amplification to some core challenges for alignment, most likely either (a) producing competitive interpretable world models, or (b) improving reliability, which will make it especially easy to discuss whether amplification can safely help with these particular problems.

If my overall approach is successful, I don't feel like there are significant uncertainties that we won't be able to resolve until we have powerful AI. (I do think there is a significant risk that I will become very pessimistic about the "pure" version of the approach, and that it will be very difficult to resolve uncertainties about the "messy" version of the approach in advance because it is hard to predict whether the difficulties for the pure version are really going to be serious problems in practice.)

I’m generally conflicted about how much to try to communicate publicly about early stages of my thinking, given how frequently it changes and how fuzzy the relevant concepts are.

Among people I've had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand. From a selfish perspective I wish you'd spend more time writing down more details and trying harder to model your readers and preempt ambiguities and potential misunderstandings, but of course the tradeoffs probably look different from your perspective. (I also want to complain (again?) that Medium.com doesn't show discussion threads in a nice tree structure, and doesn't let you read a comment without clicking to expand it, so it's hard to see what questions other people asked and how you answered. Ugh, talk about trivial inconveniences.)

What uncertainty in particular?

How much can the iterated amplification of an impoverished overseer safely learn about how to help humans (how to understand natural language, build models of users, detect ambiguity, being generally competent)? Is it enough to attract users and to help them keep most of their share of the cosmic endowment against competition with malign AIs?