# 18

Personal Blog

Utility functions are only defined up to an additive constant and a positive multiplier. For example, if we have a simple universe with only 3 possible states (X, Y, and Z), a utility function u such that u(X)=0, u(Y)=1, and u(Z)=3, and another utility function w such that w(X)=-1, w(Y)=1, and w(Z)=5, then as utility functions, u and w are identical, since w=2u-1.

Preference utilitarianism suggests maximizing the sum of everyone's utility function. But the fact that utility functions are invariant on multiplication by positive scalars makes this operation poorly defined. For example, suppose your utility function is u (as defined above), and the only other morally relevant agent has a utility function v such that v(X)=0, v(Y)=2000, and v(Z)=1000. He argues that according to utilitarianism, Y is the best state of the universe, since if you add each of your utility functions, you get (u+v)(X)=0, (u+v)(Y)=2001, and (u+v)(Z)=1003. You complain that he cheated by multiplying his utility function by a large number, and that if you treat v as v(X)=0, v(Y)=2, and v(Z)=1, then Z is the best state of the universe according to utilitarianism. There is no objective way to resolve this dispute, but anyone who wants to build a preference utilitarianism machine has to find a way to resolve such disputes that gives reasonable results.

I'm pretty sure that the idea of the previous two paragraphs has been talked about before, but I can't find where. [Edit: here and here]

Anyway, one might argue that if you are not a preference utilitarian, and not planning to build a friendly AI, you have little reason to care about this problem. If you just want to maximize your personal utility function, surely you don't need a solution to that problem, right?

Wrong! Unless you know exactly what your preferences are, which humans don't. If you're unsure whether or not u or v (as described above) describes your true preferences, and you assign a 50% probability to each, then you face the same problem that preference utilitarianism did in the previous example.

Humans are a lot better at getting ordinal utilities straight than they are at figuring out cardinal utilities, but even assuming that you know the order of your preferences, the problems remain. Let's say that, in another 3-state world (with states A, B, and C) you know you prefer B over A, and C over B, but you are uncertain between the possibilities that you prefer C over A by twice the margin that you prefer B over A, and that you prefer C over A by 10 times the margin that you prefer B over A. You assign a 50% probability to each. Now suppose you face a choice between B and a lottery that has a 20% chance of giving you C and an 80% chance of giving you A. If you define the utility of A as 0 utils and the utility of B as 1 util, then the utility values (in utils) are u1(A)=0, u1(B)=1, u1(C)=2, u2(A)=0, u2(B)=1, u2(C)=10, so the expected utility of choosing B is 1 util, and the expected utility of the lottery is .5*(.2*2 + .8*0) + .5*(.2*10 + .8*0) = 1.2 utils, so the lottery is better. But if you instead define the utility of A as 0 utils and the utility of C as 1 util, then u1(A)=0, u1(B)=.5, u1(C)=1, u2(A)=0, u2(B)=.1, and u2(C)=1, so the expected utility of B is .5*.5 + .5*.1 = .3 utils, and the expected utility of the lottery is .2*1 + .8*0 = .2 utils, so B is better. The result changes depending on how we define a util, even though we are modeling the same knowledge over preferences in each situation.

Anything with moral uncertainty, such as a value loading agent, needs to know how to add utility functions, not just utilitarians. I do not have a satisfactory solution to this, although I have come up with 2 attempted solutions, neither of which is entirely satisfactory.

My first idea was to normalize the standard deviation of each utility function to 1. For example, in the XYZ world, after normalizing u and v so that their values have standard deviation 1, we get (approximately) u(X)=0, u(Y)=.802, u(Z)=2.405, v(X)=0, v(Y)=2.449, v(Z)=1.225, so (u+v)(X)=0, (u+v)(Y)=3.251, and (u+v)(Z)=3.630. Z is thus declared the best option overall. However, if there are an infinite number of possible states, then this is impossible unless we have some sort of a priori probability distribution over the possible states. Even more frightening is the fact that this does not respect independence of irrelevant alternatives. Let's suppose that we find out that X is impossible. Good; no one wanted it anyway, so this shouldn't change anything, right? But if you exclude X and set Y as the 0 value for each utility function, then we get u(Y)=0, u(Z)=2, v(Y)=0, v(Z)=-2, (u+v)(Y)=0, (u+v)(Z)=0. The relative values of Y and Z in our preference aggregator changed even though all we did was exclude an option that everyone already agreed we should avoid.

Then it occurred to me that we have much more knowledge about the relative values of options that we are already quite familiar with, so it seems reasonable to assume that most of our moral uncertainty is about the value of options that we are not so familiar with. For example, in the ABC world, if you make decisions involving A and B all the time, but C is an unfamiliar option that you have not thought much about, it might be tempting to accept the first calculation, which gave B a value of 1 util and the lottery a value of 1.2 utils. This seems like a promising heuristic, but is difficult to formalize, and does not completely solve the problem. For instance, if both B and C are unfamiliar, then this heuristic does not have any advice to give.

# 18

New Comment

I'm pretty sure that the idea of the previous two paragraphs has been talked about before, but I can't find where.

On LessWrong: VNM expected utility theory: uses, abuses, and interpretation (shameless self-citation ;)

On Wikipedia: Limitations of the VNM utility theorem

Thanks!

Suppose I am a VNM-rational friendly AI trying to optimize how the world behaves. Then I have some utility function, which is a function on the set of ways the world can be. I do not see why this utility function necessarily has to be some sum-like operation over the utility functions of the agents I care about (so I do not see why this is necessarily a problem). Even if I am implementing some version of the CEV, my preferences might be more complicated than that. For example, I might have preferences about the utility functions the agents I care about use. This isn't captured by preference utilitarianism as you're using the term, but it isn't ruled out by the VNM theorem at all.

In fact, I am pretty sure that I, as a human being, already have such a preference: I want people to want to be smarter.

True, we wouldn't want a friendly AI to optimize the world with respect to a utility function taken from a "sum" of utility functions over whatever agents happen to exist. But we do want it to optimize the world with respect to a "sum" of the CEV of each person that currently exists (actually, there is some debate over what the reference class should be, but most people do not demand that it only pay attention to their own CEV), and a CEV is structurally the same as a utility function.

Why? (Edit:) First of all, I don't think that's how the CEV is intended to be used. As I understand it, the CEV is something we extrapolate from all of humanity, and it is a single utility function rather than a utility function for each person. Second, in this context I don't see why the CEV has to be a "sum" over any particular function, whatever you want to call it ("welfare"?), over agents. For example, maybe I really value fairness and don't want one agent's preferences to be satisfied too much more than the others; this would be one way to guard against utility monsters.

Another reason taking "sums" is problematic is that it can be gamed by duplicating an agent's preferences in other agents, e.g. for humans by raising a large number of children.

Whatever the CEV of each person that currently exists is, maybe I have more complicated preferences than maximizing just their "sum."

Actually, your personal preferences are your CEV, not some function that also takes into account other people's CEVs. That's what a CEV is. The point of having a friendly AI aggregate different people's individual preferences together is so that everyone will be able to cooperate on making it, instead of some people having an incentive to interfere (and also because we tend to think about friendly AI in far mode, which has strong fairness norms).

You could suggest that everyone's CEV should be aggregated in some non-additive way, but this risks situations where the aggregation makes a choice that everyone whose preferences got aggregated disagrees with. A weighted sum is the only way to aggregate utility functions that consistently avoids this. I've sketched out a proof of this, but I'm getting tired, so I'll write it up tomorrow.

taking "sums" can be gamed by duplicating a person's preferences in other people, e.g. by raising a large number of children.

Weighted sums are fine, so you can just make the duplicates count less. In fact, as I pointed out in the post, there's no such thing as an unweighted sum.

Edit: Apparently the contents of your comment changed drastically as I was drafting this response. But it looks like this still mostly makes sense as a response.

Edit: Apparently the contents of your comment changed drastically as I was drafting this response. But it looks like this still mostly makes sense as a response.

Actually, your personal preferences are your CEV, not some function that also takes into account other people's CEVs.

I don't think this is how Eliezer is using the term. From the wiki:

In developing friendly AI, one acting for our best interests, we would have to take care that it would have implemented, from the beginning, a coherent extrapolated volition of humankind. In calculating CEV, an AI would predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge.

So this is 1) a single utility function, not a utility function for each human, and 2) being an aggregate of everything humanity wants, it naturally includes information about what each human wants.

this risks situations where the aggregation makes a choice that everyone whose preferences got aggregated disagrees with. A weighted sum is the only way to aggregate utility functions that consistently avoids this. I've sketched out a proof of this, but I'm getting tired, so I'll write it up tomorrow.

I would be very interested to see this proof! In particular, I want to know what assumptions you're making. As I mentioned way up in the parent comment, I don't see how a weighted sum captures a friendly AI that has preferences about the utility functions that humans use.

So this is 1) a single utility function, not a utility function for each human, and 2) being an aggregate of everything humanity wants, it naturally includes information about what each human wants.

Okay, but it still aggregates a utility function-like thing for each human. I don't care what you call it.

I want to know what assumptions you're making.

For the case of aggregating two people's preferences, only that 1) Both people and the aggregation are VNM utility agents, 2) Whenever both people prefer A to B, the aggregation prefers A to B, and 3) the previous assumption is non-vacuous. Given those, then the aggregation must maximize a weighted sum of their utility functions. For the many-person case, I was using analogous assumptions, but I think there might be a flaw in my induction, so I'll get back to you when I have a proof that actually works.

I don't see how a weighted sum captures a friendly AI that has preferences about the utility functions that humans use.

We currently have preferences about the utility functions that future humans use. So any linear aggregation of our current utility functions will also have preferences about the utility functions that future humans use.

I'm pretty sure that the idea of the previous two paragraphs has been talked about before, but I can't find where.

It's pretty commonly discussed in the philosophical literature on utilitarianism.

I recommend the book "Fair Division and Collective Welfare" by H. J. Moulin, it discusses some of these problems and several related others.

That looks like it only discusses interpersonal utility comparisons. I don't see anything about intrapersonal utility comparison in the book description.

They just do interpersonal comparisons; lots of their ideas generalize to intrapersonal comparisons though.

Let's suppose that we find out that X is impossible. Good; no one wanted it anyway, so this shouldn't change anything, right? But if you exclude X and set Y as the 0 value for each utility function, then we get u(Y)=0, u(Z)=2, v(Y)=0, v(Z)=-2, (u+v)(Y)=0, (u+v)(Z)=0. The relative values of Y and Z in our preference aggregator changed even though all we did was exclude an option that everyone already agreed we should avoid.

this isn't good for my insomnia