Epistemic status: I am just learning about alignment and just read Human Compatible. Below is a summary of the paradigm he outlines for aligning AI in the last third of his book, and the questions I have of this project as a new reader.

AI researcher Stuart Russell’s 2019 book Human Compatible is one of the most popular/widely-circulated books on AI alignment right now. It argues that we need to change the paradigm of AI development in general from what he calls goal-directed behavior, where a machine optimizes on a reward function written by humans, to behavior that attempts to learn and follow human objectives. 

Russell provides a set of general design principles that someone building this kind of AI would have to follow, quoted directly below:

  1. The machine’s only objective is to maximize the realization of human preferences.
  2. The machine is initially uncertain about what those preferences are.
  3. The ultimate source of information about human preferences is human behavior.

He makes sure to state that he means “preferences” in the way that an economist would define them. Under this definition, a “preference” refers to whatever, given a set of options, a person would rather choose. He attempts to distinguish this from optimizing for “happiness” in a utilitarian framework, which might involve somehow getting a metric of how happy people are that doesn’t just depend on people’s choices, or for some other ethical system encoded by a human. So, instead of getting an objective right the first time, which is insanely hard, you have an AI that does machine learning to figure out what a person, or people, actually want, from observing their preferences.

Russell connects this to work he and others have done in Inverse Reinforcement Learning (IRL), which is machine learning that observes an agent’s behavior and determines the reward function the agent is optimizing for. For example, Russell describes observing a cricket walking on various terrains and wanting to understand its behavior and how to emulate it. In order to understand this, he first needed to determine the cricket’s reward function when walking; what feedback did the cricket seem to be getting to determine it was walking properly? This process basically just starts with a bunch of priors for the agent’s reward function and adjusts until the reward function properly predicts the agent’s behavior. The reward function calculated is then used to predict it in novel situations. 

In the context of a human and an AI, that means something like, in the smaller-scale example of a personal assistant robot Russell uses, observing whether the human you’re assisting prefers a window or aisle seat on an airplane, and adjusting accordingly. In this case, Russell has outlined a process he calls Cooperative Inverse Reinforcement Learning (CIRL), where an AI is learning the person’s reward function at the same time as they are attempting to satisfy their preferences. In other words, while the original IRL is just trying to find the reward function that makes an agent display certain behavior to emulate the behavior, the CIRL robot is learning the reward function in order to help the human, while trying to help the human in real time. Russell shows that this assistant AI, when uncertain about the person’s preferences, would ask a person about their preferences, and this AI would allow itself to be switched off, because it benefits from giving the person options and collecting new information when it is uncertain.

Russell thinks this kind of model is hard to apply to a bunch of people, but if done successfully, will do things people want it to do and will not do something bad like killing everyone. He says that the ultimate end of this kind of research would be to create something that does IRL to determine objectives, and follow those objectives, that is “provably secure.” He means “provably secure” as in, you can write, from the basic principles of the system, a CS proof saying that there is a high likelihood (in the case of ML, as he says, you can’t get a guarantee), that this system will do what people want and not kill everyone.

I’m all for a clear research agenda that involves AI not killing anyone, but this has raised in me a number of fears about where this could go wrong. I’ve outlined below a number of questions that it seems to me need to be answered if this were to work, and some of Russell’s responses to them below.

1. If you have a personal assistant AI that uses this model to satisfy only its owner’s preferences, couldn’t it both get really powerful, and follow only its owner’s preferences, therefore making it really likely to do a bunch of unethical stuff to other people in service of its owner?

Russell answers yes, it probably would. Even if you prevented it from breaking laws or something, you probably wouldn’t prevent it from exploiting loopholes to do something crazy in the pursuit of, say, finding an endangered African rhino horn for its owner, or getting money for its owner unethically. This is basically a big warning for anyone considering using CIRL or something similar to sell intelligent personal assistants in the market, and makes me think that you’d want to heavily regulate/prevent anyone who tried to do this. 

Russell’s response to this problem is that you should make an AI that takes into account literally everyone’s preferences, and optimizes for an average/linear combination of what everyone wants.

2. What if what everyone wants is morally wrong (e. g. factory farming, not caring about climate change, etc.)?

Russell responds to this objection by basically saying too bad, deal with it, it’s better than one programmer just imposing their ideals on everyone else, and also as our ideas change/become more enlightened the AI’s behavior will change along with it.

3. OK, I’m still not sure you can understand a person’s preferences so clearly: people want one thing one minute, and another thing the next, and are often irrational. Do you optimize for what they want in a given time slice? Do you optimize for what they tell you they want their life to look like in ten years? When they die?

Russell tries to solve this problem by saying that we should have the AI think of the person’s preferences in terms of future lives: from start to finish, if played a movie of their lives, what would they pick? One issue is that physically the different lives a person could live span so many possibilities that a person could not be able to conceptualize all of them fully, and probably even the most complex computer we think we could make (which is insanely more processing power than we could ourselves imagine) still couldn’t choose between literally all future atomic states in a human life, and would have to narrow the possibilities and abstract things a bit; and where do you start the abstraction?

There’s also the bit where people have instrumental preferences on the day-to-day, and then larger-scale overarching preferences, and it’s hard to tell between them. Russell says that to really figure this out, we are going to have to get a better understanding of human psychology, and if we and then the AI do that, we’ll be able to get a picture of true human preferences that we just don’t grasp yet. This seems kinda sketchy to me.

4. How do you compare people’s preferences? How do you weigh preferences across future lives?

A classic utilitarianism problem; Russell rightly concedes that we’d have to get a much better understanding of this if we were to align this kind of AI. It seems very non obvious to me how you make this tradeoff, especially if you’re thinking about preferences, which don’t seem to have a straightforward scale of, say, x % of preferences satisfied.

5. How do we know it won’t try to control human preferences to make them easier to satisfy?

Russell tells us that we can sidestep this problem by, instead of a proxy, having the robot optimize for “the reward itself”: human preferences. I don’t understand how this would work; even I, when making decisions, am always optimizing using a proxy (I would like to give my body nutrition, so I do something that makes me feel like I am drinking a substance called water, etc. etc.) As too many philosophers have told us, you can’t simply see the thing in itself. Maybe I’m being naive and he just means “reflect really thoughtfully on how to set up the reward for a really long time,” which seems hard.

6. OK, Stuart Russell. We can solve human psychology. We can compare all our preferences to get the best utility function. If we have some sort of technology to do this, it seems like there are intermediate steps in research before we deploy this perfectly aligned AI. To serve as a good guide for this AI, as in providing correct parameters for what human preferences are, though, whatever research we need to do seems to require something with a really high level of intelligence, maybe one high enough to become an unaligned intelligence. This is true because either we’re giving proper specifications about what human preferences are or we’re risking the AI being wrong about human preferences.

This is mainly my concern stemming from this blog post. Maybe if we have an incomplete idea but we’re confident the AI can learn it properly it can work. But we’d better get pretty sure before betting on it.

So, tl;dr: Russell has outlined a way he thinks aligned AI might work using AI that learns human preferences, hopefully even better than we understand them now, and optimizes for them. Before we think this is a good idea, we’ll probably have to answer some questions about human psychology, how to prevent this thing from just controlling human preferences, how to compare different people’s preferences, and what it means to have a “preference.” This might work, but it seems to me like we have a long way to go, and should maybe be cautious about this kind of AI in the meantime.

Maybe this stuff has been fleshed out more in other posts/elsewhere; I’m newish to this stuff, and if some of these Qs have been answered, would love to hear in the comments!

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 1:59 AM

Huh, there have been a few threads about CIRL recently. I'm too lazy to do a thorough search, but here's one.

I have sympathy for your fears, especially that "not killing everyone" is not sufficient for an AI to be considered well-aligned (arguably, "killing everyone" at least prevents the worst case scenarios that are possible from being realized). This seems to be an area where the line separating AI research from general ethics is blurry, and perhaps technically intractable for that reason.

This seems very similar to coherent extrapolated volition: https://intelligence.org/files/CEV.pdf

Which I think seems like a good idea, and is one way to define an objective function which is otherwise very difficult -- but as you note, it's very hard to see how to do it in a practical way.

I hope that at least some researchers work on solving the practical problems. I think it's a long shot but that's a lot better than no shot at all.