A brief tutorial on preferences in AI

lukeprog

Preferences are important both for rationality and for Friendly AI, so preferences are a major topic of discussion on Less Wrong. We've discussed preferences in the context of economics and decision theory, but I think AI has a more robust set of tools for working with preferences than either economics or decision theory has, so I'd like to introduce Less Wrong to some of these tools. In particular, I think AI's toolset for working with preferences may help us think more clearly about CEV.

In AI, we can think of working with preferences in four steps:

Preference acquisition: In this step, we aim to extract preferences from a user. This can occur either by preference learning or by preference elicitation. Preference learning occurs when preferences are acquired from data about the user's past behavior or past preferences. Preference elicitation occurs as a result of an interactive process with the user, e.g. a question-answer process.
Preferences modeling: Our next step is to mathematically express these acquired preferences as preferences between pairwise choices. The properties of a preferences model are important. For example, is the relation transitive? (If the model tells us that choice c₁ is preferred to c₂, and c2 is preferred to c₃, can we conclude that c₁ is preferred to c₃?) And is the relation complete? (Is any choice comparable to any other choice, or are there some incomparabilities?)
Preference representation: Assuming we want to capture and manipulate the user's preferences robustly, we'll next want to represent the preferences model in a preference representation language.
Preferences reasoning: Once a user's preferences are represented in a preference representation language, we can do cool things like preferences aggregation (involving the preferences of multiple agents) and preference revision (a user's new preferences being added to her old preferences). We can also perform the usual computations of decision theory, game theory, and more.

Preference acquisition

Preference learning is typically an application of supervised machine learning (classification). Throw the algorithm at a database containing a user's preferences, and it will learn that user's preferences and make predictions about the preferences not listed in the database, including preferences about pairwise choices the user may never have faced before.

Preference elicitation involves asking a user a series of questions, and extracting their preferences from the answers they give. Chen & Pu (2004) survey some of the methods used for this.

In studying CEV, I am interested in methods built for learning a user's utility function from inconsistent behavior (because humans make inconsistent choices). Nielsen & Jensen (2004) provided two computationally tractable algorithms which handle the problem by interpreting inconsistent behavior as random deviations from an underlying "true" utility function. As far as I know, however, nobody in AI has tried to solve the problem with an algorithm informed by the latest data from neuroeconomics on how human choice is the product of at least three valuation systems, only one of which looks anything like an "underlying true utility function."

Preference Modeling

A model of a user's preferences describes one of three relations between any two choices ("objects"): a strict preference relation which says that one choice is preferred to another, an indifference relation, and an incomparability relation. Kaci (2011), chapter 2 provides a brief account of preference modeling.

Preference Representation

In decision theory, a preference relation is represented by a numerical function with associates a utility value with each choice. But this may not be the best representation. We face an exponential number of choices whose explicit enumeration and evaluation is time-consuming. Moreover, users can't compare all pairwise choices and evaluate how satisfactory each choice is.

Luckily, choices are often made on the basis of a set of attributes, e.g. cost, color, price, etc. You can use a preference representation language to represent partial descriptions of preferences and rank-order possible choices. The challenge of a preference representation language is that it should (1) cope with a user's preferences, (2) faithfully represent the user's preferences such that it rank-orders choices in a way similar to how the user would specify choices if they were able to provide preferences for every pairwise comparison, (3) cope with possibly inconsistent preferences, and (4) offer attractive complexity properties, i.e. the spatial cost of representing partial descriptions of preferences and the time cost of comparing pairwise choices or computing the best choices.

One popular method of preference representation is with the graphical representation language of conditional preference networks or "CP-nets." They look like this.

Preferences Reasoning

There are a multitude of ways in which one might want to reason algorithmically about preferences. I point the reader to Part II of Kaci (2011) for a very incomplete overview.

General Sources:

Domshlak et al. (2011). Preferences in AI: An Overview. Artificial Intelligence 175: 1037-1052.

Fürnkranz & Hüllermeier (2010). Preference Learning. Springer.

Kaci (2011). Working with Preferences: Less is More. Springer.

In studying CEV, I am interested in methods built for learning a user's utility function from inconsistent behavior (because humans make inconsistent choices).

I've always interpreted CEV as saying: simulate a smarter version of me, then ask it what its utility function is. I don't see why looking at people's behavior is either part of CEV as written, or a good idea. Am I missing something?

I'll give the same response to you and to Vladimir Nesov:

I don't know what is going to be required for CEV, so I'm hacking away at the edges. Hopefully, little bits of progress even on wrong paths will inform our intuitions about what the right paths are.

Well, if you aren't, I certainly am. (Of course, if you are, I undoubtedly am as well.)

So, OK, the AI simulates a smarter version of me... call it Dave2. There's three relevant possibilities:

Dave2 has a utility function U() that has some poorly-understood-but-important relationship to my volition and will fully articulate U() when asked. In this case, simulating asking Dave2 what its utility function and attending to its simulated answer might be of value in figuring out the right thing for the AI to do next.
Dave2 has U() but won't fully articulate U() when asked. In this case, attending to Dave2's simulated answer might be less valuable in figuring out the right thing for the AI to do next than attending to Dave2's expressed preferences.
Dave2 lacks U(), either because it doesn't have a utility function at all, or because it turns out that the parameters the AI used to create Dave2 resulted in Dave2's utility function lacking said relationship to my volition, or for some other reason. In this case, it's not clear that any operation performed on Dave2 is of any value in figuring out the right thing for the AI to do next

Said more simply: maybe I ask it what its utility function is, maybe I infer its utility function from its behavior, and maybe the whole idea is muddle-headed.

It seems to me you're confident that only the first of those is plausible... can you expand on your reasons for believing that, if you in fact do?

Also, what exactly means "smarter version"? Is there only one way to make a smarter version, or are they many possible smarter versions?

What if the smarter versions have different utility functions -- should AI take some weighted average of their functions?

(nods) Yeah. I take it for granted that there are multiple ways to create the "smarter version" steven0461 was referring to, since the alternative seems implausibly neat, and that it's therefore (hypothetically) up to the AI to figure out how to create a Dave2 whose utterances have the desired value.

Of course, if we live in a convenient universe where there's only one possible "extrapolated Dave," or at least an obviously superior candidate (which of course opens a whole infinite regress problem: how do I build a system I trust to decide which of the many possible simulations of my improved self it should use in order to determine what I would want if I were better? And if I somehow can trust it to do that much in an ethical fashion, haven't I already solved the automated ethics problem? What work is left for CEV to do?)

In the less convenient worlds, the idea of averaging all the possible extrapolated mes into a weighted vector sum, along with all the possible extrapolated everyone elses, had not occurred to me, but is better than anything else I can think of.

In studying CEV, I am interested in methods built for learning a user's utility function from inconsistent behavior ... Nielsen & Jensen (2004) provided two computationally tractable algorithms which handle the problem by interpreting inconsistent behavior as random deviations from an underlying "true" utility function. As far as I know, however, nobody in AI has tried to solve the problem with an algorithm informed by the latest data from neuroeconomics on how human choice is the product of at least three valuation systems ...

If better understanding of neuroscience seems potentially maybe somewhat useful in the long run, this kind of thing is clearly of no use. The only way to win is to eventually form accurate and technically precise understanding of human preference. Any progress that doesn't potentially contribute to that is irrelevant, even if much easier to make and appears to study the same thing, but with no hope for the necessary precision (e.g. experimental psychology).

Having an imperfect utility function to hand (which can be improved upon later) might be useful if circumstances force FAI to be launched early.

Having an imperfect utility function to hand (which can be improved upon later) might be useful if circumstances force FAI to be launched early.

No, it might not, since "imperfect" in this context essentially means "arbitrary". When you've got a tiny target to hit from several miles away, it's of no use to spend time practicing with a bow. And if the time comes when your life hangs upon successfully hitting the target, you don't say that it's useful that you have a master bowman at hand, you just die.

I was assuming that the imperfect utility functions would at least be accurate enough that they would assign preference to "not dying". So your life wouldn't depend on the choice of one of these utility functions vs. the other - it's just that under the imperfect system, the world would be slightly suckier in a possibly non-obvious way.

Of course, the imperfect function would have to be subjected to some tests to make sure "slightly suckier" doesn't equate to "extremely sucky" or "dead". Obviously we don't really know how to do that part yet.

To use the analogy, I'd expect that we might hit the target that way, just not the bullseye.

Inaccurate preference is a wish that you ask of a superintelligent genie (indifferent powerful outcome pump). The problem with wishes is that they get tested on all possible futures that the AI can implement, while you yourself rank their similarity to what you want only on the futures that you can (do) imagine. If there is but one highly implausible (to you) future that implements the wish a little bit better than others, that is the future that will happen, even if you would rank it as morally horrible. A wish that has too few points of contact with your preference has a lot of such futures within its rules.

That is the problem with the notion of similarity for AI wishes: it is brittle with respect to ability to pick out a single possible future that was unrepresentative in the way you ranked the similarity, and the criterion for which future actually gets selected doesn't care about what was representative to you.

I think you can assign a low preference ranking to "everything that I can't imagine". (Obviously that would limit the range of possible futures quite a bit though).

In general though, there are (among others) two risks in any value discovery project:

You don't get your results in time
You end up missing something that you value

Running multiple approaches in parallel would seem to mitigate both of those risks somewhat.

I agree that a neuroscience-based approach feels the least likely to miss out any values, since presumably everything you value is stored in your brain somehow. There are still possibilities for bugs in the extrapolation/aggregation stage though.

If there is but one highly implausible (to you) future that implements the wish a little bit better than others, that is the future that will happen, even if you would rank it as morally horrible. A wish that has too few points of contact with your preference has a lot of such futures within its rules.

X-Files, Je Souhaite:

Mulder wishes for peace on earth and she wipes out the entire population of Earth. Mulder then writes down his third wish to be very specific.

What happens when an entity doesn't have an intrinsic fixed preference about something, but rather it emerges as a consequence of other (possibly changing) underlying properties?

For example, if a naturalist were to consider a flock of a million starlings as a single entity, and try to map out the flock's preferences for moving in a particular direction, they might find a statistical correlation (such as "during winter evenings, the flock is 50% more likely to move South than North.").

I think there needs to be an additional step in the "working with preferences" : preference testing. I think, once you have a representation of preferences in a format from which you can make predictions of future behaviour, you need to explore the bounds of uncertainty, in the same way that statisticians do confirmatory factor analysis after having done exploratory factor analysis.

At some point you also need to establish a mapping between symbols and the domain. (i.e. map/territory correspondence). In as much as I have a utility function, it involves things like "people" which I don't have a mathematical definition of. As far as I know I'm only capable of forming preferences between maps, not territories.

Also has "preference extrapolation" fallen off the list? I'm not sure how the ethics of this work out at all, but it feels like we need some way of reasoning about outcomes that no-one could have anticipated when writing the questionnaire.

Luke I was under the impression that the writing on CEV you link to is dated. Is this so? And if so where could I find material on more recent thinking on the subject?

In studying CEV, I am interested in methods built for learning a user's utility function from inconsistent behavior (because humans make inconsistent choices).

I'll give the same response to you and to Vladimir Nesov:

Well, if you aren't, I certainly am. (Of course, if you are, I undoubtedly am as well.)

So, OK, the AI simulates a smarter version of me... call it Dave2. There's three relevant possibilities:

Dave2 has a utility function U() that has some poorly-understood-but-important relationship to my volition and will fully articulate U() when asked. In this case, simulating asking Dave2 what its utility function and attending to its simulated answer might be of value in figuring out the right thing for the AI to do next.
Dave2 has U() but won't fully articulate U() when asked. In this case, attending to Dave2's simulated answer might be less valuable in figuring out the right thing for the AI to do next than attending to Dave2's expressed preferences.
Dave2 lacks U(), either because it doesn't have a utility function at all, or because it turns out that the parameters the AI used to create Dave2 resulted in Dave2's utility function lacking said relationship to my volition, or for some other reason. In this case, it's not clear that any operation performed on Dave2 is of any value in figuring out the right thing for the AI to do next

Said more simply: maybe I ask it what its utility function is, maybe I infer its utility function from its behavior, and maybe the whole idea is muddle-headed.

It seems to me you're confident that only the first of those is plausible... can you expand on your reasons for believing that, if you in fact do?

Also, what exactly means "smarter version"? Is there only one way to make a smarter version, or are they many possible smarter versions?

What if the smarter versions have different utility functions -- should AI take some weighted average of their functions?

In studying CEV, I am interested in methods built for learning a user's utility function from inconsistent behavior ... Nielsen & Jensen (2004) provided two computationally tractable algorithms which handle the problem by interpreting inconsistent behavior as random deviations from an underlying "true" utility function. As far as I know, however, nobody in AI has tried to solve the problem with an algorithm informed by the latest data from neuroeconomics on how human choice is the product of at least three valuation systems ...

Having an imperfect utility function to hand (which can be improved upon later) might be useful if circumstances force FAI to be launched early.

Having an imperfect utility function to hand (which can be improved upon later) might be useful if circumstances force FAI to be launched early.

To use the analogy, I'd expect that we might hit the target that way, just not the bullseye.

I think you can assign a low preference ranking to "everything that I can't imagine". (Obviously that would limit the range of possible futures quite a bit though).

In general though, there are (among others) two risks in any value discovery project:

You don't get your results in time
You end up missing something that you value

Running multiple approaches in parallel would seem to mitigate both of those risks somewhat.

If there is but one highly implausible (to you) future that implements the wish a little bit better than others, that is the future that will happen, even if you would rank it as morally horrible. A wish that has too few points of contact with your preference has a lot of such futures within its rules.

X-Files, Je Souhaite:

Mulder wishes for peace on earth and she wipes out the entire population of Earth. Mulder then writes down his third wish to be very specific.

What happens when an entity doesn't have an intrinsic fixed preference about something, but rather it emerges as a consequence of other (possibly changing) underlying properties?

Luke I was under the impression that the writing on CEV you link to is dated. Is this so? And if so where could I find material on more recent thinking on the subject?

19

A brief tutorial on preferences in AI

19

Preference acquisition

Preference Modeling

Preference Representation

Preferences Reasoning

General Sources:

19

19