Richard Ngo writes

Since Stuart Russell's proposed alignment solution in Human Compatible is the most publicly-prominent alignment agenda, I should be more explicit about my belief that it almost entirely fails to address the core problems I expect on realistic pathways to AGI.

Specifying an update rule which converges to a desirable goal is just a reframing of the problem of specifying a desirable goal, with the "uncertainty" part as a red herring. https://arbital.com/p/updated_deference/… In other words, Russell gives a wrong-way reduction.

I originally included CIRL in my curriculum (https://docs.google.com/document/d/1mTm_sT2YQx3mRXQD6J2xD2QJG1c3kHyvX8kQc_IQ0ns/edit?usp=drivesdk…) out of some kind of deferent/catering to academic mainstream instinct. Probably a mistake; my current annoyance about deferential thinking has reminded me to take it out.

Howie writes:

My impression is that ~everyone I know in the alignment community is very pessimistic about SR's agenda. Does it sound right that your view is basically a consensus? (There's prob some selection bias in who I know).

Richard responds:

I think it's fair to say that this is a pretty widespread opinion. Partly it's because Stuart is much more skeptical of deep learning (and even machine learning more generally!) than almost any other alignment researcher, and so he's working in a different paradigm.

Is Richard correct and if so why? (I would also like a clearer explanation why Richard is skeptical of Stuart's agenda. I agree that the reframing doesn't completely solve the problem, but I don't understand why it can't be a useful piece).

New Answer
Ask Related Question
New Comment

2 Answers sorted by

His hostility to the program as I understand it is that is CIRL doesn't much answer the question of how to specify specify a learning procedure that would go from an observations of a human being to a correct model of a human being's utility function. This is the hard part of the problem. This is why he says "specifying an update rule which converges to a desirable goal is just a reframing of the problem of specifying a desirable goal, with the "uncertainty" part a red herring". 

One of the big things that CIRL was claimed to have going for it is that this uncertainty about what the true reward function was would lead to deferential properties which would lead to a more corrigible system (it would let you shut it down for example). This doesn't seem like it holds up because a CIRL agent would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential. 

This doesn't seem like it holds up because a CIRL agent would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential.

But maybe continuing to be deferential (in many/most situations) would be part of the utility function it converged towards? Not saying this consideration refutes your point, but it is a consideration.

(I don't have much of an opinion regarding the study-worthiness of CIRL btw, and I know very little about CIRL. Though I do have the perspective th... (read more)

CIRL, or similar procedures, rely on having a satisfactory model of how the human's preferences ultimately relate to real-world observations. We do not have this. Also, the inference process scales impractically as you make the environment bigger and longer-running. So even if you like CIRL (which I do), it's not a solution, it's a first step in direction that has lots of unsolved problems.

CIRL lacks many properties that have been proposed as corrigibility goals. But I just want an AI that does good things and not bad things. Fully updated deference is not a sine qua non. (Though other people are probably more attached to it than I.)

9 comments, sorted by Click to highlight new comments since: Today at 5:00 PM

I think CIRL is pretty promising as an alignment approach for certain approaches to building AGI (notably though, not promising for scaled-up deep learning).

I also think most of the reasons people give for being skeptical of CIRL (including everything currently on this post) are pretty bad.

I'm not going to defend this view here, when I've tried in the past it hasn't made any difference.

Do you have available URLs to comments/posts where you have done so in the past?

I've done it most via in-person conversations and private Slacks, but here's one. I also endorse Paul's comment.

Why isn't promising for scaled-up Deep Learning specifically and what kind of approach might it be promising with?

Copying a private comment I wrote recently:

I broadly agree that assistance games are a pretty great framework. The main reason I don’t work on them is that it doesn’t seem like it works as a solution if you expect AGI via scaled up deep learning. (Whereas I’d be pretty excited about pushing forward on it if it looked like we were getting AGI via things like explicit hierarchical planning or search algorithms.)

The main difference in the deep learning case is that with scaled up deep learning it looks like you are doing a search over programs for a program that performs well on your loss function, and the intelligent thing is the learned program as opposed to the search that found the learned program. if you wanted assistance-style safety, then the learned program needs to reason in a assistance-like way (i.e. maintain uncertainty over what the humans want, and narrow down the uncertainty by observing human behavior).

But then you run into a major problem, which is that we have no idea how to design the learned program, precisely because it is learned — all we do is constrain the behavior of the learned program on the particular inputs that we trained on, and there are many programs you could learn that have that behavior, some of which reason in a CIRL-like way and some of which don’t. (If you then try to solve this problem, you end up regenerating many of the directions that other alignment people work on.)

I'm not sure why Rohin thinks the arguments against CIRL are bad, but I wrote a post today on why I think the argument from fully updated deference / corrigibility is weak. I also found Paul Christiano's response very helpful as an outline of objections to the utility uncertainty agenda.

Also relevant is this old comment from Rohin on difficulties with utility uncertainty.

I also just remembered this comment, which is more recent and has more details. Also I agree with Paul's response.

If you had a defense of the idea, or a link to one I could read, I would be very interested to hear it. I wasn't trying to be dogmatically skeptical. 

Responded above