Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is presenting some old work on combining different possible utility functions, that is worth revealing to the world.

I've written before about the problem of reaching an agreement between agents with different utility functions. The problem re-appears if you yourself are uncertain between two different moral theories.

For example, suppose you gave credence to average utilitarianism and credence to total utilitarianism. In an otherwise empty universe, you can create one person with utility, or a thousand with utility.

If we naively computed the expected utility of both actions, we would get for the first choice, and for the second. It therefore seems that total utilitarianism wins by default, even though it is very unlikely (for you).

But the situation can be worse. Suppose that there is a third option, which created ten thousand people with each utility. And you have credence on average utilitarianism, credence on total utilitarianism, and credence on exponential utilitarianism, where the average utility is multiplied by two to the power of the population. In this case the third option - and the incredibly unlikely exponential utilitarianism - win out massively.

Normalising utilities

To prevent the large-population-loving utilities from winning out by default, it's clear we need to normalise the utilities in some way before adding them together, similarly to how you normalise the utilities of opposing agents.

I'll distinguish two methods here: individual normalisations, and collective normalisations. For individual normalisations, if you have credences of for utilities , then is normalised into using some procedure that is independent of , , and for . Then the normalised utilities are added to give your total utility function of:

In collective normalisations, the normalisation of into is allowed to depend upon the other utilities and the credences. All Pareto outcomes for the utilities are equivalent (modulo resolving ties) with maximising such a .

The Nash Bargaining Equilibrium and the Kalai-Smorodinsky Bargaining Solution are both collective normalisations; the Mutual Worth Bargaining Solution is an individual normalisation iff the choice of the default point is individual (but doing that violates the spirit of what that method is supposed to achieve).

Note that there are no non-dictatorial Pareto normalisations, whether individual or collective, that are independent of irrelevant alternatives, or that are immune to lying.

Individual normalisations

Here I'll present the work that I did with Owen Cotton-Barratt, Toby Ord, and Will MacAskill, in order to try and come up with a principled way of doing individual normalisations. In a certain sense, this work failed: we didn't find any normalisations that were clearly superior in every way to others. But we did find a lot about the properties of the different normalisations; one interesting thing is that the dumbest normalsation - the zero-one, or min-max - has surprisingly good properties.

Let be the option set for the agent: the choices that it can make (in our full treatment, we considered a larger set , the normalisation set, but this won't be needed here).

For the purpose of this post, will be equal to , the set of deterministic policies the agent can follow; this feels like a natural choice, as it's what the agent really has control over.

For any and , there is the expected utility of conditional on the agent following policy ; this will be designated by .

We may have a probability distribution over (maybe defined by the complexity of the policy?). If we don't have such a normalisation, and the set of deterministic policies is finite, then we can set to be the uniform distribution.

Then, given , each becomes a real-valued random variable, taking value with probability . We'll normalise these by normalising the properties of this random variable.

First of all, let's exclude any that are constant on all of ; these utilities cannot be changed, in expectation, by the agent's policies, so should make no difference. Then each , seen as a random variable, has the following properties:

  • Maximum: .
  • Minimum: .
  • Mean: .
  • Variance: .
  • Mean difference: .

There are five natural normalisation methods that emerge from these properties. The first and most trivial is the min-max or zero-one normalisation: scale and translate so that takes the value and takes the value (note that the translation doesn't change the desired policy when summing utilities, so what is actually required is to scale so that ).

The second nomalisation, the mean-max, involves setting ; by symmetry, the min-mean normalisation involves setting

Finally, the last two normalisations involve setting either the variance, or the mean difference, to .

Meaning of the normalisations

What do these normalisations mean? Well, min-max is a normalisation that cares about the difference between perfect utopia and perfect dystopia: between the best possible and the worst possible expected outcome. Conceptually, this seems problematic - it's not clear why the dystopia matters, with seems like something that opens the utility up to extortion - but, as we'll see, the min-max normalisation has the best formal properties.

The mean-max is the normalisation that most appeals to me; the mean is the expected value of random policy, while the max is the expected outcome of the best policy. In a sense, that's the job of an agent with a single utility function: to move the outcome from random to best. Thus the max has a meaning that the min, for instance, lacks.

For this reason, I don't see the min-mean normalisation as being anything meaningful; it's the difference between complete disaster and a random policy.

I don't fully grasp the meaning of the variance normalisation; Owen Cotton-Barratt did the most work on it, and showed that, in a certain sense, it was resistant to lying/strategic distortion in certain circumstances, if a given utility didn't 'know' what the other utilities would be. But I didn't fully understand this point. So bear in mind that this normalisation has positive properties that aren't made clear in this post.

Finally, the mean difference normalisation controls the spread between the utilities of the different policies, in a linear way that may seem to be more natural than the variance.

Properties of the normalisation

So, which normalisation is best? Here's were we look at the properties of the normalisations (they will be summarised in a table at the end). As we've seen, independence of irrelevant alternatives always fails, and there can always be an incentive for a utility to "lie" (as in, there are , , , , and , such that would have a higher expected utility under the final if it was replaced with ).

What other properties do all the normalisations share? Well, since they normalise independently, is continuous in . And because the minimum, maximum, variance, etc... are continuous in and in , then is also continuous in that information.

In contrast, the best policy of is not typically continuous in the data. Imagine that there are two utilities and two policies: and . Then for , is the optimal policy (for all the above normalisations for uniform ), while for , is optimal.

Ok, that's enough of properties that all methods share; what about ones they don't?

First of all, we can look at the negation symmetry between and . Min-max, variance, and mean difference all have the same normalisation for and ; mean-max and min-mean do not, since the mean can be closer to the min than that max (or vice versa).

Then we can consider what happens when some policies and are clones of each other: imagine that for all , . Then what happens if we remove the redundant and normalise on ? Well, it's clear that the maximum or minimum value of cannot change (since if was a maximum/minimum, then so is , which remains), so the min-max normalisation is unaffected.

All the other normalisations change, though. This can be seen in the example , , with , , , , and ; in terms of sets of expected utilities in terms of policies, has while has . Then for uniform , all other normalisation methods change if we remove which is identical to for both utilities.

Thus all the other normalisation change when we add (or remove) clones of existing policies.

Finally, we can consider what happens if we are in one of several worlds, and the policies/utilities are the identical in some of these worlds. This should be treated the same as if those identical worlds were all one.

So, imagine that we are in one of three worlds: , , and , with probabilities , , and , respectively. Before taking any actions, the agent will discover which world it is in. Thus, if is the set of policies in , the complete set of policies is .

The worlds and are, however, indistinguishable for all utilities in . Thus we can identify , with for all . Then a normalisation method combines indistinguishable choices property if the normalisation is the same in world and . Then:

  • Min-max, mean-max, and min-mean combine indistinguishable choices. Variance and mean difference normalisations do not.

Proof (sketch): Let be the random variable that is on under the assumption that is the true underlying world. Then on , behaves like the random variable . (this means that has probability of being , not that it adds random variables together). Mean, max, and min all have the property that ; variance and mean difference, on the other hand, do not.

Summary of properties

In my view, it is a big worry that the variance and mean difference normalisations fail to combine indistinguishable choices. World and could be strictly identical, except for some irrelevant information that all utility functions agree is irrelevant. We have to worry about whether the light from a distant star is slightly redder or slightly bluer than expected; what colour ink was used in a proposal; the height of the next animal we see, and so on.

This means that we cannot divide the universe into relevant and irrelevant variables, and focus solely on the first.

In table form, the various properties are:

As can be seen, the min-max method, simplistic though it is, has all the possible nice properties.

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 7:30 AM

Another plausible normalization (that seems more likely to yield sane behavior in practice) is to set the value of $1 to be the same for every theory. This has its own problems, but it seems much better than min-max to me, since it avoids having our behavior be determined by extremely countefractual counterfactuals. What do you think is the strongest argument for min-max over constant values for $1?

I think the best argument against constant-value-of-$1 is that it has its own risk of giving pathological answers for theories that really don't care about dollars. You'd want to ameliorate that by using a very broad basket of resources, e.g. a small sliver of everything you own. Giving high weight to a theory which has "nothing to gain" doesn't seem as scary though, since hopefully it's not going to ask for any of your resources. (Unlike min-max's "foible" of giving overwhelming weight to theories that happen to not care about anything bad happening...)

It's easier for me to see how we could argue that (max-actual) is better than "constant $1."

(ETA: these two proposals both make the same sign error, I was acting as if making a theory better off reduces its weight, but obviously it increases its weight.)

Another option is to punt very explicitly on the aggregation: allow partial negotiation today (say, giving each theory 50% of the resources), but delegate the bulk of the negotiation between value systems to the future. Basically doing the same thing we'd do if we actually had an organization consisting of 5 people with different values.

In general, it seems like we should trying to preserve the option value of doing aggregation in the future using a better understanding of how to aggregate. So we should be evaluating our theories by how well they work in the interim rather than e.g. aesthetic considerations.

What do you think is the strongest argument for min-max over constant values for $1?

The constant $1 is the marginal current utility of the function, which is a reflection of its local properties only (very close utilities can have very different weightings), while min-max looks at its global properties.

The min-max is in expected utility given a policy, not in maximal utility that could happen, so it's a bit less stupid than it would be in the second case.


1. In general there are diminishing returns to dollars, so global properties constrain local properties. (This is very true if you can gamble)

2. Your actual decisions mostly concern local changes, so it seems like a not-crazy thing to base your policy on.

That said, this proposal suffers from me making the same sign error as the (max-actual) proposal. Consider a theory with log utility in the number of dollars spent on it. As you spend less on it, its utility per dollar goes up and the weight goes down, so you further decrease the number of dollars, in the limit it has 0 dollars and infinite utility per dollar.

(It still seems like a sane approach for value learning, but not for moral uncertainty.)

It seems worth mentioning than anything which involves enumerating over the space of possible actions, or policies, is often not tractable in practice (or, will be exploitable by adversarial enumeration)

So another desideratum may be "it's easy to implement using sampling". On this, normalizing by some sort of variance is probably best.

It seems to me like max-actual would be better than max-min if it could be made to work.

That is, find a distribution over policies + weighting of utility functions such that (a) the distribution is optimal according to the weighting, (b) each utility function is weighted so that the difference between their preferred policy and the actual policy is 1. I think this exists by a simple fixed point argument. I'm not sure if it's unique.

Short of that, if using mean or variance, it seems much better to use the probability distribution "Pick the preferred policy of a random theory" rather than picking a uniformly random policy.

It seems to me like max-actual would be better than max-min if it could be made to work.

That's pretty much the "Mutual worth bargaining solution"

That is, find a distribution over policies + weighting of utility functions such that (a) the distribution is optimal according to the weighting, (b) each utility function is weighted so that the difference between their preferred policy and the actual policy is 1. I think this exists by a simple fixed point argument. I'm not sure if it's unique.

I don't understand (a), but (b) has problems when there are policies that are actually ideal for all/most utilities - you don't want to rule out generally optimal policies if they exist.

That's pretty much the "Mutual worth bargaining solution"

I don't see how it can be the same as the mutual worth bargaining solution. That bargaining solution assumes we were given a default solution, and this proposal doesn't (but see above, this solution doesn't make sense).

I misunderstood your proposal.

ETA: this is based on a sign error, as was the original intuition. Everywhere below I wrote as if getting a higher utility causes your weight to decrease, but it actually causes your weight to increase. So you could this with (actual-min), or (actual-default) as in Nash, but that's not as appealing.

Existence proof (not totally sure its right):

  • Given a policy distribution pi, say a new policy is "admissible" if it optimizes the weights 1/(max utility - realized utility under pi).
  • That map is Kakutani, so there is some policy which is in its own admissible set, as desired.

Proof that it's unique:

  • Consider two weights w, w', produced by this procedure, with corresponding profiles of utilities u, u'.
  • We know that every term of (u-u')(w-w') is non-positive, since w decreases whenever u increases.
  • But we can expand the sum as the sum of uw + u'w' - uw' - u'w <= 0
  • Since u and u' were the utilities of the maximizing profiles, we have uw >= u'w, and uw' >= uw'.
  • Thus the sum of (u-u')(w-w') = 0, so every term is 0, so we have u=u' and w=w' (if one pair is equal the other must be as well, by construction).

That policy, if it exists, need not be Pareto.

The policy was constructed as optimizing a weighted sum of utilities, so it's Pareto efficient, but the uniqueness argument and intuition for reasonableness was based on a sign error.