Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The purpose of this book is to explain why [superintelligence] might be the last event in human history and how to make sure that it is not... The book is intended for a general audience but will, I hope, be of value in convincing specialists in artificial intelligence to rethink their fundamental assumptions.

Yesterday, I eagerly opened my copy of Stuart Russell's Human Compatible (mirroring his Center for Human-Compatible AI, where I've worked the past two summers). I've been curious about Russell's research agenda, and also how Russell argued the case so convincingly as to garner the following acclamations from two Turing Award winners:

Human Compatible made me a convert to Russell's concerns with our ability to control our upcoming creation—super-intelligent machines. Unlike outside alarmists and futurists, Russell is a leading authority on AI. His new book will educate the public about AI more than any book I can think of, and is a delightful and uplifting read.—Judea Pearl

This beautifully written book addresses a fundamental challenge for humanity: increasingly intelligent machines that do what we ask but not what we really intend. Essential reading if you care about our future. —Yoshua Bengio

Bengio even recently lent a reasoned voice to a debate on instrumental convergence!

Bringing the AI community up-to-speed

I think the book will greatly help AI professionals understand key arguments, avoid classic missteps, and appreciate the serious challenge humanity faces. Russell straightforwardly debunks common objections, writing with both candor and charm.

I must admit, it's great to see such a prominent debunking; I still remember, early in my concern about alignment, hearing one professional respond to the entire idea of being concerned about AGI with a lazy ad hominem dismissal. Like, hello? This is our future we're talking about!

But Russell realizes that most people don't intentionally argue in bad faith; he structures his arguments with the understanding and charity required to ease the difficulty of changing one's mind. (Although I wish he'd be a little less sassy with LeCun, understandable as his frustration may be)

More important than having fish, however, is knowing how to fish; Russell helps train the right mental motions in his readers:

With a bit of practice, you can learn to identify ways in which the achievement of more or less any fixed objective can result in arbitrarily bad outcomes. [Russell goes on to describe specific examples and strategies] (p139)

He somehow explains the difference between the Platonic assumptions of RL and the reality of a human-level reasoner, while also introducing wireheading. He covers the utility-reward gap, explaining that our understanding of real-world agency is so crude that we can't even coherently talk about the "purpose" of eg AlphaGo. He explains instrumental subgoals. These bits are so, so good.

Now for the main course, for those already familiar with the basic arguments:

The agenda

Please realize that I'm replying to my understanding of Russell's agenda as communicated in a nontechnical book for the general public; I also don't have a mental model of Russell personally. Still, I'm working with what I've got.

Here's my summary: reward uncertainty through some extension of a CIRL-like setup, accounting for human irrationality through our scientific knowledge, doing aggregate preference utilitarianism for all of the humans on the planet, discounting people by how well their beliefs map to reality, perhaps downweighting motivations such as envy (to mitigate the problem of everyone wanting positional goods). One challenge is towards what preference-shaping situations the robot should guide us (maybe we need meta-preference learning?). Russell also has a vision of many agents, each working to reasonably pursue the wishes of their owners (while being considerate of others).

I'm going to simplify the situation and just express my concerns about the case of one irrational human, one robot.

There's fully updated deference:

One possible scheme in AI alignment is to give the AI a state of moral uncertainty implying that we know more than the AI does about its own utility function, as the AI's meta-utility function defines its ideal target. Then we could tell the AI, "You should let us shut you down because we know something about your ideal target that you don't, and we estimate that we can optimize your ideal target better without you."

The obstacle to this scheme is that belief states of this type also tend to imply that an even better option for the AI would be to learn its ideal target by observing us. Then, having 'fully updated', the AI would have no further reason to 'defer' to us, and could proceed to directly optimize its ideal target.

which Russell partially addresses by advocating ensuring realizability, and avoiding feature misspecification by (somehow) allowing for dynamic addition of previously unknown features (see also Incorrigibility in the CIRL Framework). But supposing we don't have this kind of model misspecification, I don't see how the "AI simply fully computes the human's policy, updates, and then no longer lets us correct it" issue is addressed. If you're really confident that computing the human policy lets you just extract the true preferences under the realizability assumptions, maybe this is fine? I suspect Russell has more to say here that didn't make it onto the printed page.

There's also the issue of getting a good enough human mistake model, and figuring out people's beliefs, all while attempting to learn their preferences (see the value learning sequence).

Now, it would be pretty silly to reply to an outlined research agenda with "but specific problems X, Y, and Z!", because the whole point of further research is to solve problems. However, my concerns are more structural. Certain AI designs lend themselves to more robustness against things going wrong (in specification, training, or simply having fewer assumptions). It seems to me that the uncertainty-based approach is quite demanding on getting component after component "right enough".

Let me give you an example of something which is intuitively "more robust" to me: approval-directed agency.

Consider a human Hugh, and an agent Arthur who uses the following procedure to choose each action:

Estimate the expected rating Hugh would give each action if he considered it at length. Take the action with the highest expected rating.

Here, the approval-policy does what a predictor says to do at each time step, which is different from maximizing a signal. Its shape feels different to me; the policy isn't shaped to maximize some reward signal (and pursue instrumental subgoals). Errors in prediction almost certainly don't produce a policy adversarial to human interests.

How does this compare with the uncertainty approach? Let's consider one thing it seems we need to get right:

Where in the world is the human?

How will the agent robustly locate the human whose preferences it's learning, and why do we need to worry about this?

Well, a novice might worry "what if the AI doesn't properly cleave reality at its joints, relying on a bad representation of the world?". But, having good predictive accuracy is instrumentally useful for maximizing the reward signal, so we can expect that its implicit representation of the world continually improves (i.e., it comes to find a nice efficient encoding). We don't have to worry about this - the AI is incentivized to get this right.

However, if the AI is meant to deduce and further the preferences of that single human, it has to find that human. But, before the AI is operational, how do we point to our concept of "this person" in a yet-unformed model whose encoding probably doesn't cleave reality along those same lines? Even if we fix the structure of the AI's model so we can point to that human, it might then have instrumental incentives to modify the model so it can make better predictions.

Why does it matter so much that we point exactly to the human? Well, then we're extrapolating the "preferences" of something that is not the person (or a person?) - the predicted human policy in this case seems highly sensitive to the details of the person or entity being pointed to. This seems like it could easily end in tragedy, and (strong belief, weakly held) doesn't seem like the kind of problem that has a clean solution. this sort of thing seems to happen quite often for proposals which hinge on things-in-ontologies.

Human action models, mistake models, etc. are also difficult in this way, and we have to get them right. I'm not necessarily worried about the difficulties themselves, but that the framework seems so sensitive to them.

Conclusion

This book is most definitely an important read for both the general public and AI specialists, presenting a thought-provoking agenda with worthwhile insights (even if I don't see how it all ends up fitting together). To me, this seems like a key tool for outreach.

Just think: in how many worlds does alignment research benefit from the advocacy of one of the most distinguished AI researchers ever?

New to LessWrong?

New Comment
34 comments, sorted by Click to highlight new comments since: Today at 6:02 AM

Reading this made me realize a pretty general idea, which we can call "decoupling action from utility".

Consequentialist AI: figure out which action, if carried out, would maximize paperclips; then carry out that action.

Decoupled AI 1: figure out which action, if carried out, would maximize paperclips; then print a description of that action.

Decoupled AI 2: figure out which action, if described to a human, would be approved; then carry out that action. (Approval-directed agent)

Decoupled AI 3: figure out which prediction, if erased by a low probability event, would be true; then print that prediction. (Counterfactual oracle)

Any other ideas for "decoupled" AIs, or risks that apply to this approach in general?

(See also the concept of "decoupled RL" from some DeepMind folks.)

I'm all for it! See my post here advocating for research in that direction. I don't think there's any known fundamental problem, just that we need to figure out how to do it :-)

For example, with end-to-end training, it's hard to distinguish the desired "optimize for X then print your plan to the screen" from the super-dangerous "optimize the probability that the human operators thinks they are looking at a plan for X". (This is probably the kind of inner alignment problem that ofer is referring to.)

I proposed here that maybe we can make this kind of decoupled system with self-supervised learning, although I there are still many open questions about that approach, including the possibility that it's less safe than it first appears.

Incidentally, I like the idea of mixing Decoupled AI 1 and Decoupled AI 3 to get:

Decoupled AI 5: "Consider the (counterfactual) Earth with no AGIs, and figure out the most probable scenario in which a small group (achieves world peace / cures cancer / whatever), and then describe that scenario."

I think this one would be likelier to give a reasonable, human-compatible plan on the first try (though you should still ask follow-up questions before actually doing it!).

Any other ideas for "decoupled" AIs, or risks that apply to this approach in general?

If the question is about all the risks that apply, rather than special risks with this specific approach, then I'll note that the usual risks from the inner alignment problem seem to apply.

Yes, decoupling seems to address a broad class of incentive problems in safety, which includes the shutdown problem and various forms of tampering / wireheading. Other examples of decoupling include causal counterfactual agents and counterfactual reward modeling.

Another design: imitation learning. Generally, there seems to be a pattern of: policies which aren't selected for on the basis of maximizing some kind of return.

The class of non-agent AI's (not choosing actions based on the predicted resulting utility) seems very broad. We could choose actions alphabetically, or use an expert system representing the outside view, or use a biased/inaccurate model when predicting consequences, or include preferences about which actions are good or bad in themselves.

I don't think there's any general failure mode (there are certainly specific ones), but if we condition on this AI being selected by humans, maybe we select something that's doing enough optimization that it will take a highly-optimizing action like rewriting itself to be an agent.

Design 2̴ 1 may happen to reply "Convince the director to undecouple the AI design by telling him <convincing argument>." which could convince the operator that reads it and therefore fail as 3̴ 2 fails.

Design 2̴ 1 may also model distant superintelligences that break out of the box by predictably maximizing paperclips iff we draw a runic circle that, when printed as a plan, convinces the reader or hacks the computer.

Why would such "dual purpose" plans have higher approval value than some other plan designed purely to maximize approval?

Oh, damn it, I mixed up the designs. Edited.

Can’t quite read your edit, did you mean 3?

[This comment is no longer endorsed by its author]Reply

Yeah, then I agree with both points. Sneaky!

FWIW, this reminds me of Holden Karnofsky's formulation of Tool AI (from his 2012 post, Thoughts on the Singularity Institute):

Another way of putting this is that a "tool" has an underlying instruction set that conceptually looks like: "(1) Calculate which action A would maximize parameter P, based on existing data set D. (2) Summarize this calculation in a user-friendly manner, including what Action A is, what likely intermediate outcomes it would cause, what other actions would result in high values of P, etc." An "agent," by contrast, has an underlying instruction set that conceptually looks like: "(1) Calculate which action, A, would maximize parameter P, based on existing data set D. (2) Execute Action A." In any AI where (1) is separable (by the programmers) as a distinct step, (2) can be set to the "tool" version rather than the "agent" version, and this separability is in fact present with most/all modern software. Note that in the "tool" version, neither step (1) nor step (2) (nor the combination) constitutes an instruction to maximize a parameter - to describe a program of this kind as "wanting" something is a category error, and there is no reason to expect its step (2) to be deceptive.

If I understand correctly, his "agent" is your Consequentialist AI, and his "tool" is your Decoupled AI 1.

I can't make sense of 3. Most predictions' truth is not contingent on whether they have been erased or not. Stipulations are. Successful action recommendations are stipulations. How does any action recommendation get through that?

You can read about counterfactual oracles in this paper. Stuart also ran a contest on LW about them.

Decoupled AI 4: figure out which action will reach the goal, without affecting outside world (low-impact AI)

I don’t think that low impact is decoupled, and it might be misleading to view them from that frame / lend a false sense of security. The policy is still very much shaped by utility, unlike approval.

Risks: Any decoupled AI "wants" to be coupled. That is, it will converge to the solutions which will actually affect the world, as they will provide highest expected utility.

I agree for 3, but not for 2.

But, having good predictive accuracy is instrumentally useful for maximizing the reward signal, so we can expect that its implicit representation of the world continually improves (i.e., it comes to find a nice efficient encoding). We don't have to worry about this - the AI is incentivized to get this right.

The AI is incentivized to get this right only in directions that increase approval. If the AI discovers something the human operator would disapprove of learning, it is incentivized to obscure that fact or act as if it didn't know it. (This works both for "oh, here's an easy way to kill all humans" and "oh, it turns out God isn't real.")

yes, but its underlying model is still accurate, even if it doesn't reveal that to us? I wasn’t claiming that the AI would reveal to us all of the truths it learns.

Perhaps I misunderstand your point.

yes, but its underlying model is still accurate, even if it doesn't reveal that to us?

This depends on whether it thinks we would approve more of it having an accurate model and deceiving us or having an inaccurate model in the way we want its model to be less accurate. Some algorithmic bias work is of the form "the system shouldn't take in inputs X, or draw conclusions Y, because that violates a deontological rule, and simple accuracy-maximization doesn't incentivize following that rule."

My point is something like "the genius of approval-directed agency is that it grounds out every meta-level in 'approval,' but this is also (potentially) the drawback of approval-directed agency." Specifically, for any potentially good property the system might have (like epistemic accuracy) you need to check whether that actually in-all-cases for-all-users maximizes approval, because if it doesn't, then the approval-directed agent is incentivized to not have that property.

[The deeper philosophical question here is something like "does ethics backchain or forwardchain?", as we're either grounding things out in what will believe or what we believe now, and approval-direction is more the latter, and CEV-like things are more the former.]

Note that I wasn’t talking about approval directed agents in the part you originally quoted. I was saying that normal maximizers will learn to build good models as part of capability generalization.

Oh! Sorry, I missed the "How does this compare with" line.

Where in the world is the human?

I'm a bit confused because you're citing this in comparison with approval-directed agency, but doesn't approval-directed agency also have this problem?

Human action models, mistake models, etc. are also difficult in this way

Approval-directed agency also has to correctly learn or specify what "considered it at length" means (i.e., to learn/specify reflection or preferences-for-reflection) and a baseline model of the current human user as a starting point for reflection, so it's not obvious to me that it's much more robust.

I think overall I do lean towards Paul’s approach though, if only because I understand it a lot better. I wonder why Professor Russell doesn’t describe his agenda in more technical detail, or engage much with the technical AI safety community, to the extent that even grad students at CHAI apparently do not know much about his approach. (Edit: This last paragraph was temporarily deleted while I consulted with Rohin then added back.)

I wonder why Professor Russell doesn't describe his agenda in more technical detail, or engage much with the technical AI safety community, to the extent that even grad students at CHAI apparently do not know much about his approach.

For the sake of explaining this: for quite a while, he's been engaging with academics and policymakers, and writing a book; it's not that he's been doing research and not talking to anyone about it.

Fyi, when you quote people who work at an organization saying something that has a negative implication about that organization, you make it less likely that people will say things like that in the future. I'm not saying that you did anything wrong here; I just want to make sure that you know of this effect, and that it does make me in particular more likely to be silent the next time you ask about CHAI rather than responding.

Clarification: For me, the general worry is something like "if I get quoted, I need to make sure that it's not misleading (which can happen even if the person quoting me didn't mean to be misleading), and that takes time and effort and noticing all the places where I'm quoted, and it's just easier to not say things at all".

(Other people may have more worries, like "If I say something that could be interpreted as being critical of the organization, and that becomes sufficiently well-publicized, then I might get fired, so I'll just never say anything like that.")

Note: I've only started to delve into the literature about Paul's agenda, so these opinions are lightly held.

Before I respond to specific points, recall that I wrote

I'm not necessarily worried about the difficulties themselves, but that the [uncertainty] framework seems so sensitive to them.

and

the approval-policy does what a predictor says to do at each time step, which is different from maximizing a signal. Its shape feels different to me; the policy isn't shaped to maximize some reward signal (and pursue instrumental subgoals). Errors in prediction almost certainly don't produce a policy adversarial to human interests.

The approval agent is taking actions according to the output of an ML-trained approval predictor; the fact that the policy isn't selected to maximize a signal is critical, and part of why I find approval-based methods so intriguing. There's a very specific kind of policy you need in order to pursue instrumental subgoals, which is reliably produced by maximization, but which otherwise seems to be vanishingly unlikely.

I'm a bit confused because you're citing this in comparison with approval-directed agency, but doesn't approval-directed agency also have this problem?

The contrast is the failing gracefully, not (necessarily) the specific problems.

In addition to the above (even if approval-directed agents have this problem, this doesn't mean disaster, just reduced performance), my understanding is that approval doesn't require actually locating the person, just imitating the output of their approval after reflection. This should be able to be trained in the normal fashion, right? (see the learning from examples section)

Suppose we train the predictor Approval using examples and high-powered ML. Then we have the agent take the action most highly rated by Approval at each time step. This seems to fail much more gracefully as the quality of Approval degrades?

Why does it matter so much that we point exactly to be human?

Should that be "to the human" instead of "to be human"? Wan't sure if you meant to say simply that, or if more words got dropped.

Or maybe it was supposed to be: "matter so much that what we point exactly to be human?"

Uh, the former - looks like I didn’t catch the dictation mistake.

Also, did you mean “wasn’t”? :)

Also, did you mean “wasn’t”? :)

Lol, you got me.

Here's my summary: reward uncertainty through some extension of a CIRL-like setup, accounting for human irrationality through our scientific knowledge, doing aggregate preference utilitarianism for all of the humans on the planet, discounting people by how well their beliefs map to reality, perhaps downweighting motivations such as envy (to mitigate the problem of everyone wanting positional goods).

Perhaps a dumb question, but is "reward" being used as a noun or verb here? Are we rewarding uncertainty, or is "reward uncertainty" a goal we're trying to achieve?

As a noun: "reward uncertainty" refers to uncertainty about how valuable various states of the world are, and usually also implies some way of updating beliefs about that based on something like 'human actions', under the assumption that humans to some degree/in some way know which states of the world are more valuable and act accordingly.