Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

AI Alignment is a confusing topic in general, but even compared to other alignment topics, IDA seems especially confusing. Some of it is surely just due to the nature of communicating subtle and unfinished research ideas, but other confusions can be cleared up with more specific language or additional explanations. To help people avoid some of the confusions I or others fell into in the past while trying to understand IDA (and to remind myself about them in the future), I came up with this list of past confusions that I think have mostly been resolved at this point. (However there's some chance that I'm still confused about some of these issues and just don't realize it. I've included references to the original discussions where I think the confusions were cleared up so you can judge for yourself.)

I will try to maintain this list as a public reference so please provide your own resolved confusions in the comments.

alignment = intent alignment

At some point Paul started using "alignment" refer to the top-level problem that he is trying to solve, and this problem is narrower (i.e., leaves more safety problems to be solved elsewhere) than the problem that other people were using "alignment" to describe. He eventually settled upon "intent alignment" as the formal term to describe his narrower problem, but occasionally still uses just "aligned" or "alignment" as shorthand for it. Source

short-term preferences ≠ narrow preferences

At some point Paul used "short-term preferences" and "narrow preferences" interchangeably, but no longer does (or at least no longer endorses doing so). Source

preferences = "actual" preferences (e.g., preferences-on-reflection)

When Paul talks about preferences he usually means "actual" preferences (for example the preferences someone would arrive at after having a long time to think about it while having access to helpful AI assistants, if that's a good way to find someone's "actual" preferences). He does not mean their current revealed preferences or the preferences they would state or endorse now if you were to ask them. Source

corrigibility ≠ based on short-term preferences

I had misunderstood Paul to be using "corrigibility to X" as synonymous with "based on X's short-term preferences". Actually "based on X's short-term preferences" is a way to achieve corrigibility to X, because X's short-term preferences likely includes "be corrigible to X" as a preference. "Corrigibility" itself means something like "allows X to modify the agent" or a generalization of this concept. Source

act-based = based on short-term preferences-on-reflection

My understanding is that "act-based agent" used to mean something different (i.e., a simpler kind of AI that tries to do the same kind of action that a human would), but most people nowadays use it to mean an AI that is designed to satisfy someone's short-term preferences-on-reflection, even though that no longer seems particularly "act-based". Source

act-based corrigibility

Evan Hubinger used "act-based corrigibility" to mean both a method of achieving corrigibility (based on short-term preferences) and the kind of corrigibility achieved by that method. (I'm not sure if he still endorses using the term this way.) Source

learning user preferences for corrigibility isn't enough for corrigible behavior

Because an act-based agent is about "actual" preferences not "current" preferences, it may be incorrigible even if it correctly learns that the user currently prefers the agent to be corrigible, if it incorrectly infers or extrapolates the user's "actual" preferences, or if the user's "actual" preferences do not actually include corrigibility as a preference. (ETA: Although in the latter case presumably the "actual" preferences include something even better than corrigibility.) Source

distill ≈ RL

Summaries of IDA often describe the "distill" step as using supervised learning, but Paul and others working on IDA today usually have RL in mind for that step. Source

outer alignment problem exists? = yes

The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source

corrigible to the user? ≈ no

IDA is typically described as being corrigible to the user. But in reality it would be trying to satisfy a combination of preferences coming from the end user, the AI developer/overseer, and even law enforcement or other government agencies. I think this means that "corrigible to the user" is very misleading, because the AI is actually not likely to respect the user's preferences to modify (most aspects of) the AI or to be "in control" of the AI. Sources: this comment and a talk by Paul at an AI safety workshop

strategy stealing ≠ literally stealing strategies

When Paul says "strategy stealing" he doesn't mean observing and copying someone else's strategy. It's a term borrowed from game theory that he's using to refer to coming up with strategies that are as effective as someone else's strategy in terms of gaining resources and other forms of flexible influence. Source

New to LessWrong?

New Comment
18 comments, sorted by Click to highlight new comments since: Today at 6:00 AM
[-]Ben Pace5yΩ10250

This is a great post! I know there's been lots of conversations here and elsewhere about this topic, often going for dozens of comments, and I felt like a lot of them needed summarising else they'd be lost to history. Thanks for summarising them briefly and linking back to them.

[-]Wei Dai5yΩ10230

Thanks! Yeah, one of my motivations for this post is that I was losing track of these discussions myself and falling back into confusion that was already cleared up. For example, after reading one of Paul's latest clarifications, I had a strong feeling that he had told me that already on a previous occasion, but I couldn't remember when. Another push came from my discussion with Raymond Arnold (Raemon) about distillation where we talked about how it's weird to summarize a debate/disagreement as one of the participants, and it kind of made me realize that summarizing resolved confusions has less of this problem.

[-]Ben Pace5yΩ3110

Curated.

I'm not sure how "resolved" this confusion is, but I've gone back and forth a few times on what's the core reason(s) that we're supposed to expect IDA to create systems that won't do anything catastrophic: (1) because we're starting with human imitation / human approval which is safe, and the amplification step won't make it unsafe? (2) because "Corrigibility marks out a broad basin of attraction"? (3) because we're going to invent something along the lines of Techniques for optimizing worst-case performance? and/or (4) something else?

For example, in Challenges to Christiano’s capability amplification proposal Eliezer seemed to be under the impression that it's (1), but Paul replied that it was really (3), if I'm reading it correctly..?

act-based = based on short-term preferences-on-reflection

For others who were confused about what "short-term preferences-on-reflection" would mean, I found this comment and its reply to be helpful.

Putting it into my own words: short-term preferences-on-reflection are about what you would want to happen in the near term, if you had a long time to think about it.

By way of illustration, AlphaZero's long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.

By way of illustration, AlphaZero's long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.

Short-term preferences are the value function one or a few moves out. If the algorithm is "reasonable," then its short-term preference-on-reflection are the true function P(I win the game|I make this move). You could also talk about intermediate degrees of reflection.

I used to think that after the initial distillation step, the AI would be basically human-level. Now I understand that after the initial distillation step, the AI will be superhuman in some respects and subhuman in others, but wouldn't be "basically human" in any sense. Source

[-]Ofer5yΩ350
The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source

I'm confused about what outer alignment problems might exist when using supervised learning for distillation (though maybe this is just due to me using an incorrect/narrower interpretation of "outer alignment problems" or "using supervised learning for distillation").

I still feel confused about "distill ≈ RL". In RL+Imitation (which I assume is also talking about distillation, and which was written after Semi-supervised reinforcement learning), Paul says things like "In the same way that we can reason about AI control by taking as given a powerful RL system or powerful generative modeling, we could take as given a powerful solution to RL+imitation. I think that this is probably a better assumption to work with" and "Going forward, I’ll preferentially design AI control schemes using imitation+RL rather than imitation, episodic RL, or some other assumption".

Was there a later place where Paul went back to just RL? Or is RL+Imitation about something other than distillation? Or is the imitation part such a small contribution that writing "distill ≈ RL" is still accurate?

ETA: From the FAQ for Paul's agenda:

1.2.2: OK, so given this am­plified al­igned agent, how do you get the dis­til­led agent?

Train a new agent via some com­bi­na­tion of imi­ta­tion learn­ing (pre­dict­ing the ac­tions of the am­plified al­igned agent), semi-su­per­vised re­in­force­ment learn­ing (where the am­plified al­igned agent helps spec­ify the re­ward), and tech­niques for op­ti­miz­ing ro­bust­ness (e.g. cre­at­ing red teams that gen­er­ate sce­nar­ios that in­cen­tivize sub­ver­sion).

and:

The imi­ta­tion learn­ing is more about get­ting this new agent off the ground than about en­sur­ing al­ign­ment. The bulk of the al­ign­ment guaran­tee comes from the semi-su­per­vised re­in­force­ment learn­ing, where we train it to work on a wide range of tasks and an­swer ques­tions about its cog­ni­tion.

At some point Paul used "short-term preferences" and "narrow preferences" interchangeably, but no longer does (or at least no longer endorses doing so).

I would like to have these two terms defined. Let me offer my understanding from reading the relevant thread.

short-term preferences = short-term preferences-on-reflection ≠ narrow preferences

Short-term preferences refer to the most useful action I can take next, given my ultimate goals. This is to be contrasted with my current best guess about the outcome of that process. It's what I would want, not what I do want.

An AI optimising for my short-term preferences may reasonably say "No, don't take this action, because you'd actually prefer this alternative action if you only thought longer. It fits your true short-term preferences, you're just mistaken about them." This is in contrast with something you might call narrow preferences, which is where you tell the AI to do what you said anyway.

My understanding is that Paul never meant to introduce the term "narrow preferences" (i.e. "narrow" is not an adjective that applies to preferences), and the fact that he talked about narrow preferences in the act-based agents post was an accident/something he no longer endorses.

Instead, when Paul says "narrow", he's talking not about preferences but about narrow vs ambitious value learning. This is what Paul means when he says "I've only ever used [the term "narrow"] in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning."

See also this comment and the ambitious vs narrow value learning post.

Oh, okay. Is it not important to have a name for the class of thing we could accidentally train an ML system to optimise for that isn't our ultimate preferences? Is there a term for that?

I think Paul calls that "preferences-as-elicited", so if we're talking about act-based agents, it would be "short-term preferences-as-elicited" (see this comment).

Seems odd to have the idealistic goal get to be the standard name, and the dime-a-dozen failure mode be a longer name that is more confusing.

I note that Wei says a similar thing happened to 'act-based':

My understanding is that "act-based agent" used to mean something different (i.e., a simpler kind of AI that tries to do the same kind of action that a human would), but most people nowadays use it to mean an AI that is designed to satisfy someone's short-term preferences-on-reflection, even though that no longer seems particularly "act-based".

Is there a reason why the standard terms are not being used to refer to the standard, short-term results?

(I suppose that economics assumes rational agents who know their preferences, so taking language from economics might lead to this situation with the 'short-term preferences' decision.)

In the post Wei contrasts "current" and "actual" preferences. "Stated" vs "reflective" preferences also seem like nice alternatives too.

Seems odd to have the idealistic goal get to be the standard name, and the dime-a-dozen failure mode be a longer name that is more confusing.

I agree this is confusing.

Is there a reason why the standard terms are not being used to refer to the standard, short-term results?

As far as I know, Paul hasn't explained his choice in detail. One reason he does mention, in this comment, is that in the context of strategy-stealing, preferences like "help me stay in control and be well-informed" do not make sense when interpreted as preferences-as-elicited, since the current user has no way to know if they are in control or well-informed.

In the post Wei contrasts "current" and "actual" preferences. "Stated" vs "reflective" preferences also seem like nice alternatives too.

I think current=elicited=stated, but actual≈reflective (because there is the possibility that undergoing reflection isn't a good way to find out our actual preferences, or as Paul says 'There’s a hy­poth­e­sis that “what I’d say af­ter some par­tic­u­lar ideal­ized pro­cess of re­flec­tion” is a rea­son­able way to cap­ture “ac­tual prefer­ences,” but I think that’s up for de­bate—e.g. it could fail if me-on-re­flec­tion is self­ish and has val­ues op­posed to cur­rent-me, and cer­tainly it could fail for any par­tic­u­lar pro­cess of re­flec­tion and so it might just hap­pen to be the case that there is no pro­cess of re­flec­tion that satis­fies it.')

As far as I know, Paul hasn't explained his choice in detail. One reason he does mention, in this comment, is that in the context of strategy-stealing, preferences like "help me stay in control and be well-informed" do not make sense when interpreted as preferences-as-elicited, since the current user has no way to know if they are in control or well-informed.

I agree this example adds nuance, and I'm unsure how to correctly categorise it.

You have a section titled

learning user preferences for corrigibility isn't enough for corrigible behavior

Would this be more consistently titled "Learning narrow preferences for corrigibility isn't enough for corrigible behavior"?

I understand Paul to be saying that he hopes that corrigibility will fall out if we train an AI to score well on your short-term preferences, not just your narrow-preferences.