Book review: Human Compatible, by Stuart Russell.

Human Compatible provides an analysis of the long-term risks from artificial intelligence, by someone with a good deal more of the relevant prestige than any prior author on this subject.

What should I make of Russell? I skimmed his best-known book, Artificial Intelligence: A Modern Approach, and got the impression that it taught a bunch of ideas that were popular among academics, but which weren't the focus of the people who were getting interesting AI results. So I guessed that people would be better off reading Deep Learning by Goodfellow, Bengio, and Courville instead. Human Compatible neither confirms nor dispels the impression that Russell is a bit too academic.

However, I now see that he was one of the pioneers of inverse reinforcement learning, which looks like a fairly significant advance that will likely become important someday (if it hasn't already). So I'm inclined to treat him as a moderately good authority on AI.

The first half of the book is a somewhat historical view of AI, intended for readers who don't know much about AI. It's ok.

Key proposals

Russell focuses a moderate amount on criticizing what he calls the standard model of AI, in which someone creates an intelligent agent, and then feeds it a goal or utility function.

I'm not too clear how standard that model is. It's not like there's a consensus of experts who are promoting it as the primary way to think of AI. It's more like people find the model to be a simple way to think about goals when they're being fairly abstract. Few people seem to be defending the standard model against Russell's criticism (and it's unclear whether Russell is claiming they are doing so). Most of the disagreements in this area are more about what questions we should be asking, rather than on how to answer the questions that Russell asks.

Russell gives a fairly cautious overview of why AI might create risks that are as serious as the risks gorillas face from humans. Then he outlines an approach that might avoid those risks, using these three rules:

  1. The machine's only objective is to maximize the realization of human preferences.
  2. The machine is initially uncertain about what those preferences are.
  3. The ultimate source of information about human preferences is human behavior.

Note that these are high-level guidelines for researchers; he's not at all claiming they're rules that are ready to be written into an AI.

Russell complains that the AI community has ignored the possibility of creating AIs that are uncertain about their objective, and calls that "a huge blind spot".

I'm unclear on whether this qualifies as a blind spot. I can imagine a future in which it's important. But for AI as it exists today, it looks like uncertainty would add complexity, without producing any clear benefit. So I think it has been appropriate for most AI researchers to have postponed analyzing it so far.

An aside: Russell points out that uncertainty provides an interesting way to avoid wireheading: if the reward is defined so that it can't be observed directly, then the AI will know that hacking the AI's signal won't create more brownie points in heaven.


Russell is fairly convincing in his claim that AIs which are designed according to his rules will relatively safe. That's a much better achievement than most authors manage on this topic.

I'm a bit less convinced that this approach is easy enough to implement that it will be competitive with other, possibly less safe, approaches.

Some of my doubt derives from the difficulty, using current techniques, of encoding the relevant kind of abstract objectives into an AI.

The objectives that Russell wants don't look much like the kind of objectives that AI researchers know how to put into an AI.

It's fairly well known how to give an AI objectives either by using a large number of concrete examples of the "correct" result, or by specifying a readily quantifiable reward. Even a dilettante such as myself knows the basics of how to go about either of those approaches.

In contrast, it's unclear how to encode an objective that depends on high-level concepts such as "human" or "preference that is inferred from behavior" without the AI already having done large amounts of learning.

Maybe there's some way to use predictions about observed preferences as if the predictions quantified the actual objective? That looks partly right. But how do we tell the AI that the predictions aren't the real objective? If we don't succeed at that, we risk something like the King Midas problem: a naive new AI might predict that King Midas's preferences will be better satisfied if everything he touches turns to gold. But if that prediction becomes the AI's objective, then the AI will resist learning that the King regrets his new ability, since that might interfere with it's objective of turning anything he touches into gold.

AI researchers have likely not yet tried to teach their systems about hard-to-observe concepts such as utopia, or heaven. Teaching an AI to value not-yet-observed preferences seems hard in roughly the same way. It seems to require using a much more sophisticated language than is currently used to encode objectives.

I'll guess that someone would need to hard code many guesses about what human preferences are, to have somewhere to start, otherwise it's unclear how the AI would initially prefer any action over another. How is it possible to do that without the system already having learned a lot about the world? And how is it possible for the AI to start learning without already having some sort of (possibly implicit) objective?

Is there some way to start a system with a much easier objective than maximizing human preferences, then switch to Russell's proposed objective after the system understands concepts such as "human" and "preference"? How hard is it to identify the right time to do that?

I gather that some smart people believe some of these questions need to be tackled head on. My impression is that most of those people think AI safety is a really hard problem. I'm unclear on how hard Russell thinks AI safety is.

It's quite possible that there are simple ways to implement Russell's rules, but I'm moderately confident that doing so would require a fairly large detour from what looks like the default path to human-level AI.

Compare Russell's approach to Drexler's ideas of only putting narrow, short-term goals into any one system. (I think Drexler's writings were circulating somewhat widely before Russell finished writing Human Compatible, but maybe Russell finished his book before he could get access to Drexler's writings).

If Drexler's approach is a good way to generate human-level AI, then I expect it to be implemented sooner than Russell's approach will be implemented.

Still, we're still at a stage where generating more approaches to AI safety seems more valuable than deciding which one is best. Odds are that the researchers who actually implement the first human-level AIs will have better insights than we do into which approaches are most feasible. So I want to encourage more books of this general nature.

Russell's rules show enough promise to be worth a fair amount of research, but I'm guessing they only have something like a 5% or 10% chance of being a good solution to AI risks.


Russell ideas often sound closer to those of Bostrom and MIRI than to those of mainstream AI, yet he dismisses recursive self-improvement and fast takeoff. His reasons sound suspicious - I can't tell whether he's got good intuitions that he has failed to explain, or whether he ignores those scenarios because they're insufficiently mainstream.

Russell makes the strange claim that, because existing AI is poor at generalizing across domains,

when people talk about "machine IQ" increasing rapidly and threatening to exceed human IQ, they are talking nonsense.

But Russell seems to take the opposite position 100 pages later, when he's dismissing Kevin Kelly's The Myth of a Superhuman AI. I'm disappointed that Russell didn't cite the satire of Kelly that argues against the feasibility of bigger than human machines.

Russell has a strange response to Bostrom's proposal to use one good AI to defend against any undesirable AIs. Russell says that we'd end up "huddling in bunkers" due to the "titanic forces" involved in battles between AIs. Yet Bostrom's position is clearly dependent on the assumption of a large power asymmetry between the dominant AI (or possibly a dominant coalition of AIs?) and any new bad AI - why would there be much of a battle? I'd expect something more like Stuxnet.

There are lots of opinions about how much power disparity there will be between the most powerful AI and a typical new AI, and no obvious way to predict which one is correct. Russell says little about this issue.

But suppose such battles are a big problem. How is this concern specific to Bostrom's vision? If battles between AI are dangerous to bystanders, what's the alternative to good AI(s) fighting bad AIs? Does someone have a plan to guarantee that nobody ever creates a bad AI? Russell shows no sign of having such a plan. Russell might be correct here, but if so, the issue deserves more analysis than Russell's dismissal suggests.


Russell concludes with a philosophical section that tackles issues relating to morality.

It includes some good thoughts about the difficulties of inferring preferences, and some rather ordinary ideas about utilitarianism, including some standard worries about the repugnant conclusion.

Here's one of Russell's stranger claims:

in a sense, all humans are utility monsters relative to, say, rats and bacteria, which is why we pay little attention to the preferences of rats and bacteria in setting public policy.

Is that why we ignore their preferences? My intuition says it's mostly because we're selfish and not trying to cooperate with them. I don't think I'm paying enough attention to their preferences to have figured out whether we're utility monsters compared to them.


I'll end with a more hopeful note (taken from right after Russell emphasizes that machines won't imitate the behavior of people they observe):

It's possible, in fact, that if we humans find ourselves in the unfamiliar situation of dealing with purely altruistic entities on a daily basis, we may learn to be better people ourselves - more altruistic and less driven by pride and envy.

Human Compatible will be somewhat effective at increasing the diversity of AI safety research, while heading off risks that AI debate will polarize into two tribes.

See also this review from someone who, unlike me, is doing real AI safety research.

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 3:33 AM

Copying over a comment:

I’m not too clear how standard that model is. It’s not like there’s a consensus of experts who are promoting it as the primary way to think of AI. It’s more like people find the model to be a simple way to think about goals when they’re being fairly abstract. Few people seem to be defending the standard model against Russell’s criticism (and it’s unclear whether Russell is claiming they are doing so).

It’s not that AI researchers are saying “clearly we should be writing down an objective function that captures our goal with certainty”. It’s that if you look at the actual algorithms that the field of AI produces, nearly all of them assume the existence of some kind of specification that says what the goal is, because that is just the way that you do AI research. There wasn’t a deliberate decision to use this “standard model”; but given that all the work produced does fit in this standard model, it seems pretty reasonable to call it “standard”.

This is not specific to deep learning — it also applies to traditional AI algorithms like search, constraint satisfaction, logic, reinforcement learning, etc. The one exception I know of is the field of human-robot interaction, which has grappled with the problem that objectives are hard to write down.

Thanks for this perspective! I really should get around to reading this book...

Have you ever played the game Hanabi? Some of the statements you make imply, "why would he say them otherwise?" style, that your error bars aren't big enough.

So, depending on how you feel about statements like, e.g., "Human Compatible neither confirms nor dispels the impression that Russell is a bit too academic", I think you should either widen your error bars, or do a better job of communicating wide error bars.