Stuart Russell: AI value alignment problem must be an "intrinsic part" of the field's mainstream agenda

by Rob Bensinger2 min read26th Nov 201439 comments

26

Center for Human-Compatible AI (CHAI)
Personal Blog

Edge.org has recently been discussing "the myth of AI". Unfortunately, although Superintelligence is cited in the opening, most of the participants don't seem to have looked into Bostrom's arguments. (Luke has written a brief response to some of the misunderstandings Pinker and others exhibit.) The most interesting comment is Stuart Russell's, at the very bottom:

Of Myths and Moonshine

"We switched everything off and went home. That night, there was very little doubt in my mind that the world was headed for grief."

So wrote Leo Szilard, describing the events of March 3, 1939, when he demonstrated a neutron-induced uranium fission reaction. According to the historian Richard Rhodes, Szilard had the idea for a neutron-induced chain reaction on September 12, 1933, while crossing the road next to Russell Square in London. The previous day, Ernest Rutherford, a world authority on radioactivity, had given a "warning…to those who seek a source of power in the transmutation of atoms – such expectations are the merest moonshine."

Thus, the gap between authoritative statements of technological impossibility and the "miracle of understanding" (to borrow a phrase from Nathan Myhrvold) that renders the impossible possible may sometimes be measured not in centuries, as Rod Brooks suggests, but in hours.

None of this proves that AI, or gray goo, or strangelets, will be the end of the world. But there is no need for a proof, just a convincing argument pointing to a more-than-infinitesimal possibility. There have been many unconvincing arguments – especially those involving blunt applications of Moore's law or the spontaneous emergence of consciousness and evil intent. Many of the contributors to this conversation seem to be responding to those arguments and ignoring the more substantial arguments proposed by Omohundro, Bostrom, and others.

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:

1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.

2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want. A highly capable decision maker – especially one connected through the Internet to all the world's information and billions of screens and most of our infrastructure – can have an irreversible impact on humanity.

This is not a minor difficulty. Improving decision quality, irrespective of the utility function chosen, has been the goal of AI research – the mainstream goal on which we now spend billions per year, not the secret plot of some lone evil genius. AI research has been accelerating rapidly as pieces of the conceptual framework fall into place, the building blocks gain in size and strength, and commercial investment outstrips academic research activity. Senior AI researchers express noticeably more optimism about the field's prospects than was the case even a few years ago, and correspondingly greater concern about the potential risks.

No one in the field is calling for regulation of basic research; given the potential benefits of AI for humanity, that seems both infeasible and misdirected. The right response seems to be to change the goals of the field itself; instead of pure intelligence, we need to build intelligence that is provably aligned with human values. For practical reasons, we will need to solve the value alignment problem even for relatively unintelligent AI systems that operate in the human environment. There is cause for optimism, if we understand that this issue is an intrinsic part of AI, much as containment is an intrinsic part of modern nuclear fusion research. The world need not be headed for grief.

I'd quibble with a point or two, but this strikes me as an extraordinarily good introduction to the issue. I hope it gets reposted somewhere it can stand on its own.

Russell has previously written on this topic in Artificial Intelligence: A Modern Approach and the essays "The long-term future of AI," "Transcending complacency on superintelligent machines," and "An AI researcher enjoys watching his own execution." He's also been interviewed by GiveWell.

26

38 comments, sorted by Highlighting new comments since Today at 11:53 PM
New Comment

AI value alignment problem / AI goal alignment problem

Incidentally, rephrasing and presenting the problem in this manner is how MIRI could possibly gain additional traction in the private sector and its associated academia. As self-modifying code will gain more popularity, there will be plenty of people encountering just this hurdle, i.e. "but how can I make sure that my modified agent still optimizes for X?" Establishing that as a well delineated subfield, unrelated to the whole "fate of humanity"-thing, could both prompt/shape additional external research and lay the groundwork to the whole "the new goals post-modication may be really damaging", reducing the inferential distance to one of scope alone.

A company making a lot of money off of their self-modifying stock-brokering algorithms* (a couple of years down the line), in a perpetual tug-of-war with their competitor's self-modifying stock-brokering algorithms* will be quite interested in proofs that their modified-beyond-recognition agent will still try to make them a profit.

I imagine that a host of expert systems, medium term, will increasingly rely on self-modification. Now, compare: "There is an institute concerned with the friendliness of AI, in the x-risk sense" versus "an institute concerned with preserving AI goals across modifications as an invariant which I can use to keep the money flowing" in terms of industry attractiveness.

Even if both match to the same research activities at MIRI, modulo the "what values should the AI have?", which in large parts isn't a technical question anyways and brings us into the murky maelstrom between game theory / morality / psychology / personal dogmatism. Just a branding suggestion, since the market on the goal alignment question in the concrete economic sense can still be captured. Could even lead to some Legg-itimate investments.

* Using algorithm/tool/agent interchangeably, since they aren't separated by more than a trivial modification.

I worry about the phrase "provably aligned with human values."

Yep, I noticed it too, it really is a great response. A shame it appeared much later than the others, would have been good to have it there amongst the other comments when the article was getting the early traffic.

Russell is an entirely respectable and mainstream researcher, at one of the top CS departments. It's striking that he's now basically articulating something pretty close to the MIRI view. Can somebody comment on whether Russell has personally interacted with MIRI?

If MIRI's work played a role in convincing people like Russell, that seems like an major accomplishment and demonstration that they have arrived as part of the academic research community. If Russell came to that conclusion on his own, MIRI should still get a fair bit of praise for getting there first and saying it before it was respectable.

In either case, my congratulations to the folks at MIRI and I will up my credence in them, going forwards. (They've been rising steadily in my estimation for the last several years; this is just one of the more dramatic bumps.)

The 3rd edition of Artificial Intelligence: A Modern Approach which came out in 2009, explains the intelligence explosion concept, cites Yudkowsky's 2008 paper Artificial intelligence as a positive and negative factor in global risk, and specifically mentions friendly AI and the challenges involved in creating it.

So Russell has more or less agreed with MIRI on a lot of the key issues for quite some time now.

Can somebody comment on whether Russell has personally interacted with MIRI?

His textbook from 2009 mentions Yudkowsky and Omohundro by name, so he very likely is familiar with MIRI's arguments.

Can somebody comment on whether Russell has personally interacted with MIRI?

He has.

I'm a super-dummy when it comes to thinking about AI. I rightly leave it to people better equipped and more motivated than me.

But, can someone explain to me why a solution would not involve some form of "don't do things to people or their property without their permission"? Certainly, that would lead to a sub-optimal use of AI in some people's opinions. But it would completely respect the opinions of those who disagree.

Recognizing that I am probably the least AI-knowledgeable person to have posted a comment here, I ask, what am I missing?

Even leaving aside the matters of 'permission' (which lead into awkward matters of informed consent) as well as the difficulties of defining concepts like 'people' and 'property', define 'do things to X'. Every action affects others. If you so much as speak a word, you're causing others to undergo the experience of hearing that word spoken. For an AGI, even thinking draws a miniscule amount of electricity from the power grid, which has near-negligible but quantifiable effects on the power industry which will affect humans in any number of different ways. If you take chaos theory seriously, you could take this even further. It may seem obvious to a human that there's a vast difference between innocuous actions like those in the above examples and those that are potentially harmful, but lots of things are intuitively obvious to humans and yet turn out to be extremely difficult to precisely quantify, and this seems like just such a case.

What people permit is more inclusive and vague than what they want and doesn't even in the same sense try to aim to further a persons goals. There is also an problem that people could accept a fate they don't want. Whether that is the human being self-unfriendly or the ai being unfriendly is a matter of debate. But still it's a form of unfriendliness.

it's not strictly an AI problem-- any sufficiently rapid optimization process bears the risk of irretrievably converging on an optimum nobody likes before anybody can intervene with an updated optimization target.

individual and property rights are not rigorously specified enough to be a sufficient safeguard against bad outcomes even in an economy moving at human speeds

in other words the science of getting what we ask for advances faster than the science of figuring out what to ask for

I you don't know that you are missing somethin or reason to be beleive this to be the case, you are unsure about wheter you are dummy when it comes to AI or not. Not knowiing whether you should AI discuss is different from knowing not to AI discuss.

Human beings do not have values that are provably aligned with the values of other human beings. Nor can there ever be a proof like this, since "human being" does not have a mathematical definition any more than "baldness" has a definition that would tell you in every edge case whether someone is "truly bald" or not. Consequently there will never be such a proof for AI, since if various human beings have diverging values, there is no way for the AI to be aligned with both.

In any case, I think the main problem is the assumption that human beings have utility functions at all, since they do not. In particular, as I said elsewhere, human beings do not value anything infinitely. Any AI that does value something infinitely will not have human values, and it will be subject to Pascal's Muggings. Consequently, the most important point is to make sure that you do not give an AI any utility function at all, since if you do give it one, it will automatically diverge from human values.

if various human beings have diverging values, there is no way for the AI to be aligned with both.

Yes, it is trivially true that an AI cannot perfectly optimize for one person's values while simultaneously perfectly optimizing for a different person's values. But, by optimizing for some combination of each person's values, there's no reason the AI can't align reasonably well with all of them unless their values are rather dramatically in conflict.

In particular, as I said elsewhere, human beings do not value anything infinitely. Any AI that does value something infinitely will not have human values, and it will be subject to Pascal's Muggings. Consequently, the most important point is to make sure that you do not give an AI any utility function at all, since if you do give it one, it will automatically diverge from human values.

Are you claiming that all utility functions are unbounded? That is not the case. (In fact, if you only consider continuous utility functions on a complete lottery space, then all utility functions are bounded. http://lesswrong.com/lw/gr6/vnm_agents_and_lotteries_involving_an_infinite/)

No, I wasn't saying that all utility functions are unbounded. I was making two points in that paragraph:

1) An AI that values something infinitely will not have anything remotely like human values, since human beings do not value anything infinitely. And if you describe this AI's values with a utility function, it would either be an unbounded function, or a bounded function that behaves in a similar way by approaching a limit (if it didn't behave similarly it would not treat anything as having infinite value.)

2) If you program an AI with an explicit utility function, in practice it will not have human values, because human beings are not made with an explicit utility function, just as if you program an AI with a GLUT, in practice it will not engage in anything like human conversation.

It's true that humans do not have utility functions, but I think it still can make sense to try to fit a utility function to a human that approximates what they want as well as possible, since non-VNM preferences aren't really coherent. It's a good point that it is pretty worrying that the best VNM approximation to human preferences might not fit them all that closely though.

a bounded function that behaves in a similar way by approaching a limit (if it didn't behave similarly it would not treat anything as having infinite value.)

Not sure what you mean by this. Bounded utility functions do not treat anything as having infinite value.

I think there is good reason to think coming up with an actual VNM representation of human preferences would not be a very good approximation. On the other hand as long as you don't program an AI in that way -- with an explicit utility function -- then I think it is unlikely to be dangerous even if it does not have exactly human values. This is why I said the most important thing is to make sure that the AI does not have a utility function. I'm trying to do a discussion post on that now but something's gone wrong (with the posting).

I thought you could map an unbounded function to a bounded one to produce the same behavior, but actually you may be right that this is not really possible since you have to multiply your utilities by probabilities. I would have to think about that more.

It's awfully suspicious to say that the one goal architecture that is coherent enough to analyse easily is dangerous but that all others are safe. More concretely, humans are not VNM-rational (as you pointed out), and often pose threats to other agents anyway. Also, an AI does not have to be programmed with an explicit utility function in order to be VNM rational, and thus to behave like it has a utility function.

I thought you could map an unbounded function to a bounded one to produce the same behavior, but actually you may be right that this is not really possible since you have to multiply your utilities by probabilities.

You can rescale an unbounded utility function to a bounded one that will have the same preferences over known outcomes, but this will change its preferences over gambles; in particular, agents with bounded utility functions cannot be made to care about arbitrarily small probabilities of arbitrarily good/bad outcomes.

Yes, you're right about the effect of rescaling an unbounded function.

I don't see why it's suspicious that less coherent goal systems are safer. Being less coherent is being closer to having no goals at all, and without goals a thing is not particularly dangerous. For example, take a rock. We could theoretically say that the path a rock takes when it falls is determined by a goal system, but it would not be particularly helpful to describe it as using a utility function, and likewise it is not especially dangerous. It is true that you can get killed if it hits you on the head or something, but it is not going to take over the world.

I described in my top-level post what kind of behavior I would expect of an intelligent goal system that was not programmed using an explicit utility function. You might be able to theoretically describe its behavior with a utility function, but this is not the most helpful description. So for example, if we program a chess playing AI, as long as it is programmed to choose chess moves in a deterministic fashion, optimizing based solely on the present chess game (e.g. not choosing its moves based on what it has learned about the current player or whatever, but only based on the current position), then no matter how intelligent it becomes it will never try to take over the universe. In fact, it will never try to do anything except play chess moves, since it is physically impossible for it to do anything else, just as a rock will never do anything except fall.

Notice that this also is closer to having no goals, since the chess playing AI can't try to affect the universe in any particular way. (That is why I said based on the game alone -- if it can base its moves on the person playing or whatever, then in theory it could secretly have various goals such as e.g. driving someone insane on account of losing chess games etc., even if no one programmed these goals explicitly.) But as long as its moves are generated in a deterministic manner based on the current position alone, it cannot have any huge destructive goal, just like a rock does not.

It's true that humans do not have utility functions

Do not have full conscious access to their utility function? Yes. Have an ugly, constantly changing utility function since we don't guard our values against temporal variance? Yes. Whose values cannot with perfect fidelity be described by a utility function in a pragmatic sense, say with a group of humans attempting to do so? Yes.

Whose actual utility function cannot be approximately described, with some bounded error term epsilon? No. Whose goals cannot in principle be expressed by a utility function? No.

Please approximately describe a utility function of an addict who is calling his dealer for another dose, knowing full well that he is doing harm to himself, that he will feel worse the next day, and already feeling depressed because of that, yet still acting in a way which is guaranteed to negatively impact his happiness. The best I can do is "there are two different people, System 1 and System 2, with utility functions UF1 and UF2, where UF1 determines actions while UF2 determines happiness".

The question does come down to definition. I do think most people here are on the same page concerning the subject matter, and only differ on what they're calling a utility function. I'm of the Church-Turing thesis persuasion (the 'iff' goes both ways), and don't see why the aspect of a human governing its behavior should be any different than the world at large.

Whether that's useful is a different question. No doubt the human post-breakfast has a different utility function than pre-breakfast. Do we then say that the utility function takes as a second parameter t, or do we insist that post-breakfast there exists a different agent (strictly speaking, since it has different values) who merely shares some continuity with its hungry predecessor, who sadly no longer exists (RIP)? If so, what would be the granularity, what kind of fuzziness would still be allowed in our constantly changing utility function, which ebbs and flows with our cortison levels and a myriad of other factors?

If a utility function, even if known, was only applicable in one instant, for one agent, would it even make sense to speak of a global function, if the domain consists of but one action?

In the VNM-sense, it may well be that technically humans don't have a (VNM!)utility function. But meh, unless there's uncomputable magic in there somewhere some kind of function mapping all possible stimuli to a human's behavior should theoretically exist, and I'd call that utility function.

Definitional stuff, which is just wiggly lines fighting each other: squibbles versus squobbles, dictionary fight to the death, for some not[at]ion of death!

ETA: It depends on what you call a utility function, and how ugly a utility function (including assigning different values to different actions each fraction of a second) you're ready to accept. Is there "a function" assigning values to outcomes which would describe a human's behavior over his/her lifetime? Yes, of course there is. (There is one describing the whole universe, so there better be one for a paltry human's behavior. Even if it assigns different values at different times.) Is there a 'simple' function (e.g. time-invariant) which also satisfices the VNM criteria? Probably not.

In the VNM-sense, it may well be that technically humans don't have a (VNM!)utility function. But meh, unless there's uncomputable magic in there somewhere some kind of function mapping all possible stimuli to a human's behavior should theoretically exist, and I'd call that utility function.

Calling it a utility function does not make it a utility function. A utility function maps decisions to utilities, in an entity which decides among its available choices by evaluating that function for each one and making the decision that maximises the value. Or as Wikipedia puts it, in what seems a perfectly sensible summary definition covering all its more detailed uses, utility is "the (perceived) ability of something to satisfy needs or wants." That is the definition of utility and utility functions; that is what everyone means by them. It makes no sense to call something completely different by the same name in order to preserve the truth of the sentence "humans have utility functions". The sentence has remained the same but the proposition it expresses has been changed, and changed into an uninteresting tautology. The original proposition expressed by "humans have utility functions" is still false, or if one is going to argue that it is true, it must be done by showing that humans have utility functions in the generally understood meaning of the term.

some kind of function mapping all possible stimuli to a human's behavior should theoretically exist

No, it should not; it cannot. Behaviour depends not only on current stimuli but the human's entire past history, internal and external. Unless you are going to redefine "stimuli" to mean "entire past light-cone" (which of course the word does not mean) this does not work. Furthermore, that entire past history is also causally influenced by the human's behaviour. Such cyclic patterns of interaction cannot be understood as functions from stimulus to response.

In order to arrive at this subjectively ineluctable ("meh, unless there's uncomputable magic") statement, you have redefined the key words to make them mean what no-one ever means by them. It's the Texas Sharpshooter Utility Function fallacy yet again: look at what the organism does, then label that as having higher "utility" than the things it did not do.

I appreciate your point.

Mostly, I'm concerned that "strictly speaking, humans don't have VNM-utility functions, so that's that, full stop" can be interpreted like a stop sign, when in fact humans do have preferences (clearly) and do tend to choose actions to try to satisfice those preferences at least part of the time. To the extent that we'd deny that, we'd deny the existence of any kind of "agent" instantiated in the physical universe. There is predictable behavior for the most part, which can be modelled. And anything that can be computationally modelled can be described by a function. It may not have some of the nice VNM properties, but we take what we can get.

If there's a more applicable term for the kind of model we need (rather than simply "utility function in a non-VNM sense"), by all means, but then again, "what's in a name" ...

Sorry, I don't understand your point, beyond your apparently reversing your position and agreeing that humans don't have a utility function, not even approximately.

The question is whether AIs can have a fixed UF ...specifically whether they can both self modify and maintain their goals. If they can't, there is no point in loading then with human values upfront (as they won't stick to them anyway), and the problem of corrigibility becomes one of getting them to go in the direction we want, not of getting them to budge at all.

Which is not to say that goal unstable AIs will be safe, but they do present different problems and require different solutions. Which could do with being looked at some time.

In the face of iinstability, you can rescue the idea of the utility function by feeding in an agent's entire history, but rescuing the UF is not what is important. Is stability versus instability. I am still against the use of the phrase utility function, because when people read it, they think time independent utility function, which is why, I think, there is so little consideration of unstable AI.

Humans do not behave even closely to VNM-rationality, and there's no clear evidence for some underlying VNM preferences that are being deviated from.

Human beings do not have values that are provably aligned with the values of other human beings.

Sure, but we "happily" compromise. AI should be able to understand and implement the compromise that is overall best for everyone.

Any AI that does value something infinitely will not have human values

AI can value the "best compromise" infinitely :). But agreed nothing else.

I'm not sure what it would mean exactly to value the best compromise infinitely, since part of that compromise would be the refusal to accept a sufficiently bad Mugging, which implies a utility bound.

But if an AI can compromise on some fuzzy or simplified set if values, what happened to the full complexity and fragility of human value?

Why does the compromise have to be a function of simplified values? I don't think I implied that.

Good points, shamefully downoted.

A utility function sounds like the sort of computery thing an AI programme ought be expected to have, but actual is an idealized way of describing a rational agent that can't be translated into code,

If your preferences about possible states of the world follow a few very reasonable constraints, then (somewhat surprisingly) your preferences can be modeled by a utility function. An agent with a reasonably coherent set of preferences can be talked about as if it optimizes a utility function, even if that's not the way it was programmed. See VNM rationality.

I agree with this, but that doesn't mean the model has to be useful. For example you could say that I have a utility function that assigns a utility of 1 to all the actions I actually take, and a utility of 0 to all the actions that I don't. But this would be similar to saying that you could make a giant look-up table which would be a model of my responses in conversation. Nonetheless, if you attempt to program an AI with a GLUT for conversation, it will not do well at all in conversation, and if you attempt to program an AI with the above model of human behavior, it will do very badly.

In other words, theoretically there is such a model, but in practice this is not how a human being is made and it shouldn't be how an AI is made.

Here's the argument I was hearing:

Humans can be turned into money pumps. Consequently, the most important point is to make sure that your AI can be turned into a money pump, since if you don't, it will automatically diverge from human values.

If this is what you are arguing, it would take a lot to convince me of that position.

Here's the argument I think you're making:

Don't make AIs try to optimize stuff without bound. If you try to optimize any fixed objective function without bound, you will end up sacrificing all else that you hold dear.

I agree that optimizing without bound seems likely to kill you. If a safe alternative approach is possible, I don't know what it would be. My guess would be most alternative approaches are equivalent to an optimization problem.

Right, the second argument is the one that concerns me, since it should be possible to convince people to adjust their preferences in some way that will make them consistent.

My suggestion here was simply to adopt a hard limit to the utility function. So for example instead of valuing lifespan without limit, there would be some value such that the AI is indifferent to extending it even more. This kind of AI might take the lifespan deal up to a certain point, but it would not keep taking it permanently, and in this way it would avoid driving its probability of survival down to a limit of zero.

I think Eliezer does not like this idea because he claims to value life infinitely, assigning ever greater values to longer lifespans and an infinite value to an infinite lifespan. But he is wrong about his own values, because being a limited being he cannot actually care infinitely about anything, and this is why the lifespan dilemma bothers him. If he actually cared infinitely, as he claims, then he would not mind driving his probability of survival down to zero.

I am not saying (as he has elsewhere described this) that "the utility function is up for grabs." I am saying that if you understand yourself correctly, you will see that you do not yourself assign an infinite value to anything, so it would be a serious and possibly fatal mistake to make a machine that assigns an infinite value to something.

Yeah, I follow. I'll bring up another wrinkle (which you may already be familiar with): Suppose the objective you're maximizing never equals or exceeds 20. You can reach to 19.994, 19.9999993, 19.9999999999999995, but never actually reach 20. Then even though your objective function is bounded, you will still try to optimize forever, and may resort to increasingly desperate measures to eek out another .000000000000000000000000001.

Yes, this would happen if you take an unbounded function and simply map it to a bounded function without actually changing it. That is why I am suggesting admitting that you really don't have an infinite capacity for caring, and describing what you care about as though you did care infinitely is mistaken, whether you describe this with an unbounded or with a bounded function. This requires admitting that scope insensitivity, after a certain point, is not a bias, but just an objective fact that at a certain point you really don't care anymore.