List of resolved confusions about IDA

[-]Ben Pace6yΩ10250

This is a great post! I know there's been lots of conversations here and elsewhere about this topic, often going for dozens of comments, and I felt like a lot of them needed summarising else they'd be lost to history. Thanks for summarising them briefly and linking back to them.

[-]Wei Dai6yΩ10230

Thanks! Yeah, one of my motivations for this post is that I was losing track of these discussions myself and falling back into confusion that was already cleared up. For example, after reading one of Paul's latest clarifications, I had a strong feeling that he had told me that already on a previous occasion, but I couldn't remember when. Another push came from my discussion with Raymond Arnold (Raemon) about distillation where we talked about how it's weird to summarize a debate/disagreement as one of the participants, and it kind of made me realize that summarizing resolved confusions has less of this problem.

[-]Ben Pace6yΩ3110

Curated.

[-]Steven Byrnes6y180

I'm not sure how "resolved" this confusion is, but I've gone back and forth a few times on what's the core reason(s) that we're supposed to expect IDA to create systems that won't do anything catastrophic: (1) because we're starting with human imitation / human approval which is safe, and the amplification step won't make it unsafe? (2) because "Corrigibility marks out a broad basin of attraction"? (3) because we're going to invent something along the lines of Techniques for optimizing worst-case performance? and/or (4) something else?

For example, in Challenges to Christiano’s capability amplification proposal Eliezer seemed to be under the impression that it's (1), but Paul replied that it was really (3), if I'm reading it correctly..?

[-]ESRogs6y140

act-based = based on short-term preferences-on-reflection

For others who were confused about what "short-term preferences-on-reflection" would mean, I found this comment and its reply to be helpful.

Putting it into my own words: short-term preferences-on-reflection are about what you would want to happen in the near term, if you had a long time to think about it.

By way of illustration, AlphaZero's long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.

[-]paulfchristiano6y130

By way of illustration, AlphaZero's long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.

Short-term preferences are the value function one or a few moves out. If the algorithm is "reasonable," then its short-term preference-on-reflection are the true function P(I win the game|I make this move). You could also talk about intermediate degrees of reflection.

[-]riceissa6y80

I used to think that after the initial distillation step, the AI would be basically human-level. Now I understand that after the initial distillation step, the AI will be superhuman in some respects and subhuman in others, but wouldn't be "basically human" in any sense. Source

[-]Ofer6yΩ350

The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source

I'm confused about what outer alignment problems might exist when using supervised learning for distillation (though maybe this is just due to me using an incorrect/narrower interpretation of "outer alignment problems" or "using supervised learning for distillation").

[-]riceissa6y*50

I still feel confused about "distill ≈ RL". In RL+Imitation (which I assume is also talking about distillation, and which was written after Semi-supervised reinforcement learning), Paul says things like "In the same way that we can reason about AI control by taking as given a powerful RL system or powerful generative modeling, we could take as given a powerful solution to RL+imitation. I think that this is probably a better assumption to work with" and "Going forward, I’ll preferentially design AI control schemes using imitation+RL rather than imitation, episodic RL, or some other assumption".

Was there a later place where Paul went back to just RL? Or is RL+Imitation about something other than distillation? Or is the imitation part such a small contribution that writing "distill ≈ RL" is still accurate?

ETA: From the FAQ for Paul's agenda:

1.2.2: OK, so given this amplified aligned agent, how do you get the distilled agent?

Train a new agent via some combination of imitation learning (predicting the actions of the amplified aligned agent), semi-supervised reinforcement learning (where the amplified aligned agent helps specify the reward), and techniques for optimizing robustness (e.g. creating red teams that generate scenarios that incentivize subversion).

and:

The imitation learning is more about getting this new agent off the ground than about ensuring alignment. The bulk of the alignment guarantee comes from the semi-supervised reinforcement learning, where we train it to work on a wide range of tasks and answer questions about its cognition.

[-]Ben Pace6yΩ120

At some point Paul used "short-term preferences" and "narrow preferences" interchangeably, but no longer does (or at least no longer endorses doing so).

I would like to have these two terms defined. Let me offer my understanding from reading the relevant thread.

short-term preferences = short-term preferences-on-reflection ≠ narrow preferences

Short-term preferences refer to the most useful action I can take next, given my ultimate goals. This is to be contrasted with my current best guess about the outcome of that process. It's what I would want, not what I do want.

An AI optimising for my short-term preferences may reasonably say "No, don't take this action, because you'd actually prefer this alternative action if you only thought longer. It fits your true short-term preferences, you're just mistaken about them." This is in contrast with something you might call narrow preferences, which is where you tell the AI to do what you said anyway.

[-]riceissa6y40

My understanding is that Paul never meant to introduce the term "narrow preferences" (i.e. "narrow" is not an adjective that applies to preferences), and the fact that he talked about narrow preferences in the act-based agents post was an accident/something he no longer endorses.

Instead, when Paul says "narrow", he's talking not about preferences but about narrow vs ambitious value learning. This is what Paul means when he says "I've only ever used [the term "narrow"] in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning."

[-]Ben Pace6y20

Oh, okay. Is it not important to have a name for the class of thing we could accidentally train an ML system to optimise for that isn't our ultimate preferences? Is there a term for that?

[-]riceissa6y10

I think Paul calls that "preferences-as-elicited", so if we're talking about act-based agents, it would be "short-term preferences-as-elicited" (see this comment).

[-]Ben Pace6y60

Seems odd to have the idealistic goal get to be the standard name, and the dime-a-dozen failure mode be a longer name that is more confusing.

I note that Wei says a similar thing happened to 'act-based':

My understanding is that "act-based agent" used to mean something different (i.e., a simpler kind of AI that tries to do the same kind of action that a human would), but most people nowadays use it to mean an AI that is designed to satisfy someone's short-term preferences-on-reflection, even though that no longer seems particularly "act-based".

Is there a reason why the standard terms are not being used to refer to the standard, short-term results?

(I suppose that economics assumes rational agents who know their preferences, so taking language from economics might lead to this situation with the 'short-term preferences' decision.)

In the post Wei contrasts "current" and "actual" preferences. "Stated" vs "reflective" preferences also seem like nice alternatives too.

[-]riceissa6y50

Seems odd to have the idealistic goal get to be the standard name, and the dime-a-dozen failure mode be a longer name that is more confusing.

I agree this is confusing.

Is there a reason why the standard terms are not being used to refer to the standard, short-term results?

As far as I know, Paul hasn't explained his choice in detail. One reason he does mention, in this comment, is that in the context of strategy-stealing, preferences like "help me stay in control and be well-informed" do not make sense when interpreted as preferences-as-elicited, since the current user has no way to know if they are in control or well-informed.

In the post Wei contrasts "current" and "actual" preferences. "Stated" vs "reflective" preferences also seem like nice alternatives too.

I think current=elicited=stated, but actual≈reflective (because there is the possibility that undergoing reflection isn't a good way to find out our actual preferences, or as Paul says 'There’s a hypothesis that “what I’d say after some particular idealized process of reflection” is a reasonable way to capture “actual preferences,” but I think that’s up for debate—e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.')

[-]Ben Pace6y40

As far as I know, Paul hasn't explained his choice in detail. One reason he does mention, in this comment, is that in the context of strategy-stealing, preferences like "help me stay in control and be well-informed" do not make sense when interpreted as preferences-as-elicited, since the current user has no way to know if they are in control or well-informed.

I agree this example adds nuance, and I'm unsure how to correctly categorise it.

[-]Ben Pace6yΩ240

You have a section titled

learning user preferences for corrigibility isn't enough for corrigible behavior

Would this be more consistently titled "Learning narrow preferences for corrigibility isn't enough for corrigible behavior"?

[-]Ben Pace6yΩ120

I understand Paul to be saying that he hopes that corrigibility will fall out if we train an AI to score well on your short-term preferences, not just your narrow-preferences.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

97

List of resolved confusions about IDA

97

Ω 38

97

Ω 38

short-term preferences = short-term preferences-on-reflection ≠ narrow preferences

alignment = intent alignment

short-term preferences ≠ narrow preferences

preferences = "actual" preferences (e.g., preferences-on-reflection)

corrigibility ≠ based on short-term preferences

act-based = based on short-term preferences-on-reflection

act-based corrigibility

learning user preferences for corrigibility isn't enough for corrigible behavior

distill ≈ RL

outer alignment problem exists? = yes

corrigible to the user? ≈ no

strategy stealing ≠ literally stealing strategies