I’m having a problem understanding why Stuart Russell thinks that AI learning human preferences is a good idea. I think it’s a bad idea. I assume I am wrong but I assume I don’t understand something. So, help me out here please. I’m not looking for an argument but rather to understand. Let me explain.

I have watched Stuart’s four hour lecture series of Reith lectures on the BBC. Highly recommended. I have watched several other videos that include him as well. I am currently reading his book, Human Compatible. I am not an academic. I am now retired and write science fiction about advanced social robots for a hobby.

Reading the chapter “AI: A Different Approach” in Stuart’s book I am still bothered by something about the preferences issue. My understanding of Stuart’s “new model for AI” is that it would learn what our preferences are from observing our behavior. I understand why he thinks “preferences” is a better word than “values” to describe these behaviors but, at the risk of confusing things, let me use the word values to explain my confusion.

As I understand it, humans have different kinds of values:

1) Those that are evolved and which we all share as a species, like why sugar tastes good or why glossy hair is attractive.
2) Those that reflect our own individuality, which make each of us unique including those some twin studies reveal.
3) Those our culture, family, society or what have you impose on us.

I believe the first two kinds are genetic and the third kind learned. Let me classify the first two as biological values and the third kind as social values. It would appear that the third category accounts for the majority of the recent evolution of our physical brains.

Let’s consider three values for each type just as simple examples. Biological values might be greed, selfishness and competition while social values might be trust, altruism and cooperation. Humans are a blend of all six of these values and will exhibit preferences based on them in different situations. A lot of times they are going to choose behaviors based on biological values as the nightly news makes clear.

If AI learns our preferences base on our behaviors it’s going to learn a lot of “bad” things like lying, stealing and cheating and other much worse things. From a biological point of view, these behaviors are “good” because they maximize the return on calories invested by getting others to do the work while we reap the benefits. Parasites and cuckoo birds for example.

In his Reith lecture Stuart states that an AI trained on preferences will not turn out evil but he never explains why not. There is no mention (so far) in his book regarding the issue of human preferences and anything we would consider negative, bad or evil. I simply don’t understand how an AI observing our behavior is going to end up being exclusively benevolent or “provably beneficial” to use Stuart’s term.

I think an AI learning from our preferences would be a terrible idea. What am I not understanding? 
 

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 10:51 PM

My take is that AGI learning human preferences/values/needs/desires/goals/etc. is a necessary but not sufficient condition for achieving alignment.

If AI learns our preferences base on our behaviors it’s going to learn a lot of “bad” things like lying, stealing and cheating and other much worse things.

First of all, it's important to note that just because the AGI learns that humans engage in certain behaviors, that by itself does not imply that it will have any inclination to engage in those behaviors itself. Secondly, there is a difference between behaviors and preferences. Lying, theft, and murder are all natural things that members of our species engage in, yes, but they are typically only ever used as a means of fulfilling actual preferences.

Preferences, values, etc. are what motivate our actions. Behaviors are reinforced that tend to satisfy our preferences, stimulate pleasure, alleviate pain, move us toward our goals and away from our antigoals. It's those preferences, those motivators of action rather than the actions themselves, that I believe Stuart Russell is talking about.

An aligned AGI will necessarily have to take our preferences into account, whether to help us achieve them more intelligently than we could on our own or to steer us toward actions that minimize conflict with others' preferences. It would ideally take on our values (or the coherent extrapolation or aggregation of them) as its own in some way without necessarily taking on our behavioral policies as its own (though of course it would need to learn about those too).

An unaligned AGI would almost certainly learn human values as well, once it's past a certain level of intelligence, but it would lack the empathy to care about them beyond the need to plan around them as it pursues its own goals.

“…human preferences/values/needs/desires/goals/etc. is a necessary but not sufficient condition for achieving alignment.”

I have to agree with you in this regard and most of your other points. My concern however is that Stuart’s communications give the impression that the preferences approach addresses the problem of AI learning things we consider bad when in fact it doesn’t.  

The model of AI learning our preferences by observing our behavior and then proceeding with uncertainty makes sense to me. However just as Asimov’s robot characters eventually decide there is a fourth rule that overrides the other three Stuart’s “Three Principles” model seems incomplete. Preferences do not appear to me, in themselves, to deal with the issue of evil. 

Biological values might be greed, selfishness and competition while social values might be trust, altruism and cooperation.

I believe that this "biological = selfish, social = cooperative" dichotomy is wrong. It is a popular mistake to make, because it provides legitimacy to all kinds of political regimes, by allowing them to take credit for everything good that humans living in them do. It also allows one to express "edgy" opinions about human nature.

But if homo sapiens actually had no biological foundations for trust, altruism, and cooperation, then... it would be extremely difficult for our societies to instill such values in humans; and most likely we wouldn't even try, because we simply wouldn't think about such things as desirable. The very idea that fundamentally uncooperative humans somehow decided to cooperate at creating a society that brainwashes humans into being capable of cooperation is... somewhat self-contradictory.

(The usual argument is that it would make sense, even for a perfectly selfish asshole, to brainwash other people into becoming cooperative altruists. The problem with this argument is that the hypothetical perfectly selfish wannabe social engineer couldn't accomplish such project alone. And when many people start cooperating on brainwashing the next generation, what makes even more sense for a perfectly selfish asshole is to... shirk their duty at creating the utopia, or even try to somehow profit from undermining this communal effort.)

Instead, I think we have dozens of instincts which under certain circumstances nudge us towards more or less cooperation. Perhaps in the ancient environment they were balanced in the way that maximized survival (sometimes by cooperation with others, sometimes at their expense), but currently the environment changes too fast for humans to adapt...

I agree with your main point: it is not obvious how training an AI on human preferences (which are sometimes "good" and sometimes "evil") would help us achieve the goal (separating the "good" from "evil" from "neutral").

Thanks for responding Viliam. Totally agree with you that “if homo sapiens actually had no biological foundations for trust, altruism, and cooperation, then... it would be extremely difficult for our societies to instill such values”.

As you say, we have a blend of values that shift as required by our environment. I appreciate your agreement that it’s not really clear how training an AI on human preferences solves the issue raised here.

Of all the things I have ever discussed in person or on-line values are the most challenging. I’ve been interested in human values for decades before AI came along and historically there is very little hard science to be found on the subject. I’m delighted that AI is causing values to be studied widely for the first time however in my view we are only about where the ancient Greeks were with regard to the structure of matter or where Gregor Mendel’s study of pea plants fall with regards to genetics. Both fields turned out to be unimaginably complex. Like those I expect the study of values will go on indefinitely as we discover how complicated they really are.

I can see how the math involved likely precludes us writing the necessary code and that “self-teaching” (sorry I don’t know the correct word) is the only way an AI could learn human values but again it seems as if Stuart’s approach is missing a critical component. I’ve finished his book now and although he goes on at length with regards to different scenarios he never definitively addresses the issue I raise here. I think the analogy that children learn many things from their parents, not all of them “good”, applies here and Stuart’s response to this problem re his approach still seems to gloss over the issue.    

If something is capable of fulfilling human preferences in its actions, and you can convince it to do so, you're already most of the way to getting it to do things humans will judge as positive. Then you only need to specify which preferences are to be considered good in an equally compelling manner. This is obviously a matter of much debate, but it's an arena we know a lot about operating in. We teach children these things all the time.

Stuart does say something along the same lines that you point out in a later chapter however I felt it detracted from his idea of three principles:  

   1. The machine's only objective is to maximize the realization of human preferences.

   2. The machine is initially uncertain about what those preferences are.

   3. The ultimate source of information about human preferences is human behavior.

He goes on at such length to qualify and add special cases that the word “ultimate” in principle #3 seems to have been a poor choice because it becomes so watered down as to lose its authority.

If things like laws, ethics and morality are used to constrain what AI learns from preferences (which seems both sensible and necessary as in the parent/child example you provide) then I don’t see how preferences are “the ultimate source of information” but rather simply one of many training streams. I don’t see that his point #3 itself deals with the issue of evil.

As you point out this whole area is “a matter of much debate” and I’m pretty confident that like philosophical discussions it will go on (as they should) forever however I am not entirely confident that Stuart’s model won’t end up having the same fate as Marvin Minsky’s “Society Of Mind”.

That does sound problematic for his views if he actually holds these positions. I am not really familiar with him, even though he did write the textbook for my class on AI (third edition) back when I was in college. At that point, there wasn't much on the now current techniques and I don't remember him talking about this sort of thing (though we might simply have skipped such a section).

You could consider it that we have preferences on our preferences too. It's a bit too self-referential, but that's actually a key part of being a person. You could determine those things that we consider to 'right' directly from how we act when knowingly pursuing those objectives, though this requires much more insight.

You're right, the debate will keep going on in philosophical style, but if it works or not as an approach for something different than humans could change that.