Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Intertheoretic utility comparison

3rd Jul 2018

8paulfchristiano

2Stuart_Armstrong

2paulfchristiano

5Jan_Kulveit

5paulfchristiano

3Stuart_Armstrong

2paulfchristiano

4Stuart_Armstrong

2paulfchristiano

2Stuart_Armstrong

4paulfchristiano

New Comment

11 comments, sorted by Click to highlight new comments since: Today at 7:30 AM

Another plausible normalization (that seems more likely to yield sane behavior in practice) is to set the value of $1 to be the same for every theory. This has its own problems, but it seems much better than min-max to me, since it avoids having our behavior be determined by extremely countefractual counterfactuals. What do you think is the strongest argument for min-max over constant values for $1?

I think the best argument against constant-value-of-$1 is that it has its own risk of giving pathological answers for theories that really don't care about dollars. You'd want to ameliorate that by using a very broad basket of resources, e.g. a small sliver of everything you own. Giving high weight to a theory which has "nothing to gain" doesn't seem as scary though, since hopefully it's not going to ask for any of your resources. (Unlike min-max's "foible" of giving overwhelming weight to theories that happen to not care about anything bad happening...)

It's easier for me to see how we could argue that (max-actual) is better than "constant $1."

(ETA: these two proposals both make the same sign error, I was acting as if making a theory better off reduces its weight, but obviously it increases its weight.)

Another option is to punt very explicitly on the aggregation: allow partial negotiation today (say, giving each theory 50% of the resources), but delegate the bulk of the negotiation between value systems to the future. Basically doing the same thing we'd do if we actually had an organization consisting of 5 people with different values.

In general, it seems like we should trying to preserve the option value of doing aggregation in the future using a better understanding of how to aggregate. So we should be evaluating our theories by how well they work in the interim rather than e.g. aesthetic considerations.

What do you think is the strongest argument for min-max over constant values for $1?

The constant $1 is the marginal current utility of the function, which is a reflection of its local properties only (very close utilities can have very different weightings), while min-max looks at its global properties.

The min-max is in expected utility given a policy, not in maximal utility that could happen, so it's a bit less stupid than it would be in the second case.

Well:

1. In general there are diminishing returns to dollars, so global properties constrain local properties. (This is very true if you can gamble)

2. Your actual decisions mostly concern local changes, so it seems like a not-crazy thing to base your policy on.

That said, this proposal suffers from me making the same sign error as the (max-actual) proposal. Consider a theory with log utility in the number of dollars spent on it. As you spend less on it, its utility per dollar goes up and the weight goes down, so you further decrease the number of dollars, in the limit it has 0 dollars and infinite utility per dollar.

(It still seems like a sane approach for value learning, but not for moral uncertainty.)

It seems worth mentioning than anything which involves enumerating over the space of possible actions, or policies, is often not tractable in practice (or, will be exploitable by adversarial enumeration)

So another desideratum may be "it's easy to implement using sampling". On this, normalizing by some sort of variance is probably best.

It seems to me like max-actual would be better than max-min if it could be made to work.

That is, find a distribution over policies + weighting of utility functions such that (a) the distribution is optimal according to the weighting, (b) each utility function is weighted so that the difference between their preferred policy and the actual policy is 1. I think this exists by a simple fixed point argument. I'm not sure if it's unique.

Short of that, if using mean or variance, it seems much better to use the probability distribution "Pick the preferred policy of a random theory" rather than picking a uniformly random policy.

It seems to me like max-actual would be better than max-min if it could be made to work.

That's pretty much the "Mutual worth bargaining solution" https://www.lesswrong.com/posts/7kvBxG9ZmYb5rDRiq/gains-from-trade-slug-versus-galaxy-how-much-would-i-give-up

That is, find a distribution over policies + weighting of utility functions such that (a) the distribution is optimal according to the weighting, (b) each utility function is weighted so that the difference between their preferred policy and the actual policy is 1. I think this exists by a simple fixed point argument. I'm not sure if it's unique.

I don't understand (a), but (b) has problems when there are policies that are actually ideal for all/most utilities - you don't want to rule out generally optimal policies if they exist.

That's pretty much the "Mutual worth bargaining solution"

I don't see how it can be the same as the mutual worth bargaining solution. That bargaining solution assumes we were given a default solution, and this proposal doesn't (but see above, this solution doesn't make sense).

ETA: this is based on a sign error, as was the original intuition. Everywhere below I wrote as if getting a higher utility causes your weight to *decrease*, but it actually causes your weight to *increase*. So you could this with (actual-min), or (actual-default) as in Nash, but that's not as appealing.

Existence proof (not totally sure its right):

- Given a policy distribution pi, say a new policy is "admissible" if it optimizes the weights 1/(max utility - realized utility under pi).
- That map is Kakutani, so there is some policy which is in its own admissible set, as desired.

Proof that it's unique:

- Consider two weights w, w', produced by this procedure, with corresponding profiles of utilities u, u'.
- We know that every term of (u-u')(w-w') is non-positive, since w decreases whenever u increases.
- But we can expand the sum as the sum of uw + u'w' - uw' - u'w <= 0
- Since u and u' were the utilities of the maximizing profiles, we have uw >= u'w, and uw' >= uw'.
- Thus the sum of (u-u')(w-w') = 0, so every term is 0, so we have u=u' and w=w' (if one pair is equal the other must be as well, by construction).

This is presenting some old work on combining different possible utility functions, that is worth revealing to the world.

I've written before about the problem of reaching an agreement between agents with different utility functions. The problem re-appears if you yourself are uncertain between two different moral theories.

For example, suppose you gave 99% credence to average utilitarianism and 1% credence to total utilitarianism. In an otherwise empty universe, you can create one person with 2 utility, or a thousand with 1 utility.

If we naively computed the expected utility of both actions, we would get 0.99(2)+0.01(2)=2 for the first choice, and 0.99(1)+0.01(1000)=10.99 for the second. It therefore seems that total utilitarianism wins by default, even though it is very unlikely (for you).

But the situation can be worse. Suppose that there is a third option, which created ten thousand people with each 0.000001 utility. And you have 99% credence on average utilitarianism, (1−10−100)% credence on total utilitarianism, and (10−100)% credence on exponential utilitarianism, where the average utility is multiplied by two to the power of the population. In this case the third option - and the incredibly unlikely exponential utilitarianism - win out massively.

## Normalising utilities

To prevent the large-population-loving utilities from winning out by default, it's clear we need to normalise the utilities in some way before adding them together, similarly to how you normalise the utilities of opposing agents.

I'll distinguish two methods here: individual normalisations, and collective normalisations. For individual normalisations, if you have credences of pi for utilities ui∈U, then ui is normalised into ^ui using some procedure that is independent of pi, pj, and uj for j≠i. Then the normalised utilities are added to give your total utility function of:

In collective normalisations, the normalisation of ui into ^ui is allowed to depend upon the other utilities and the credences. All Pareto outcomes for the utilities are equivalent (modulo resolving ties) with maximising such a u.

The Nash Bargaining Equilibrium and the Kalai-Smorodinsky Bargaining Solution are both collective normalisations; the Mutual Worth Bargaining Solution is an individual normalisation iff the choice of the default point is individual (but doing that violates the spirit of what that method is supposed to achieve).

Note that there are no non-dictatorial Pareto normalisations, whether individual or collective, that are independent of irrelevant alternatives, or that are immune to lying.

## Individual normalisations

Here I'll present the work that I did with Owen Cotton-Barratt, Toby Ord, and Will MacAskill, in order to try and come up with a principled way of doing individual normalisations. In a certain sense, this work failed: we didn't find any normalisations that were clearly superior in every way to others. But we did find a lot about the properties of the different normalisations; one interesting thing is that the dumbest normalsation - the zero-one, or min-max - has surprisingly good properties.

Let O be the

option setfor the agent: the choices that it can make (in our full treatment, we considered a larger set O⊃O, thenormalisation set, but this won't be needed here).For the purpose of this post, O will be equal to Π={πj}, the set of deterministic policies the agent can follow; this feels like a natural choice, as it's what the agent really has control over.

For any ui∈U and πj∈Π, there is the expected utility of ui conditional on the agent following policy πj; this will be designated by ui(πj).

We may have a probability distribution q over O=Π (maybe defined by the complexity of the policy?). If we don't have such a normalisation, and the set of deterministic policies is finite, then we can set q to be the uniform distribution.

Then, given q, each ui becomes a real-valued random variable, taking value ui(πj) with probability q(πj). We'll normalise these ui by normalising the properties of this random variable.

First of all, let's exclude any ui that are constant on all of Π; these utilities cannot be changed, in expectation, by the agent's policies, so should make no difference. Then each ui, seen as a random variable, has the following properties:

There are five natural normalisation methods that emerge from these properties. The first and most trivial is the min-max or zero-one normalisation: scale and translate ui so that minui takes the value 0 and maxui takes the value 1 (note that the translation doesn't change the desired policy when summing utilities, so what is actually required is to scale ui so that (maxui)−(minui)=1).

The second nomalisation, the mean-max, involves setting (maxui)−μi=1; by symmetry, the min-mean normalisation involves setting μi−(minui)=1.

Finally, the last two normalisations involve setting either the variance, or the mean difference, to 1.

## Meaning of the normalisations

What do these normalisations mean? Well, min-max is a normalisation that cares about the difference between perfect utopia and perfect dystopia: between the best possible and the worst possible expected outcome. Conceptually, this seems problematic - it's not clear why the dystopia matters, with seems like something that opens the utility up to extortion - but, as we'll see, the min-max normalisation has the best formal properties.

The mean-max is the normalisation that most appeals to me; the mean is the expected value of random policy, while the max is the expected outcome of the best policy. In a sense, that's the job of an agent with a single utility function: to move the outcome from random to best. Thus the max has a meaning that the min, for instance, lacks.

For this reason, I don't see the min-mean normalisation as being anything meaningful; it's the difference between complete disaster and a random policy.

I don't fully grasp the meaning of the variance normalisation; Owen Cotton-Barratt did the most work on it, and showed that, in a certain sense, it was resistant to lying/strategic distortion in certain circumstances, if a given utility didn't 'know' what the other utilities would be. But I didn't fully understand this point. So bear in mind that this normalisation has positive properties that aren't made clear in this post.

Finally, the mean difference normalisation controls the spread between the utilities of the different policies, in a linear way that may seem to be more natural than the variance.

## Properties of the normalisation

So, which normalisation is best? Here's were we look at the properties of the normalisations (they will be summarised in a table at the end). As we've seen,

independence of irrelevant alternativesalways fails, and there can always be an incentive for a utility to "lie" (as in, there are U, ui∈U, p, Π, and q, such that ui would have a higher expected utility under the final u if it was replaced with u′i≠ui).What other properties do all the normalisations share? Well, since they normalise independently, u is continuous in p. And because the minimum, maximum, variance, etc... are continuous in q and in ui(πj), then u is also

continuousin that information.In contrast, the best policy argmaxπju(πj) of u is

not typically continuousin the data. Imagine that there are two utilities and two policies: u0(π0)=u1(π1)=1 and u0(π1)=u1(π0)=0. Then for p0<1/2, π1 is the optimal policy (for all the above normalisations for uniform q), while for p0>1/2, π0 is optimal.Ok, that's enough of properties that all methods share; what about ones they don't?

First of all, we can look at the

negation symmetrybetween ui and −ui. Min-max, variance, and mean difference all have the same normalisation for ui and −ui; mean-max and min-mean do not, since the mean can be closer to the min than that max (or vice versa).Then we can consider what happens when some policies πj and πk are

clonesof each other: imagine that for all ui∈U, ui(πj)=ui(πk). Then what happens if we remove the redundant πk and normalise on U−{πk}? Well, it's clear that the maximum or minimum value of ui cannot change (since if πk was a maximum/minimum, then so is πj, which remains), so the min-max normalisation is unaffected.All the other normalisations change, though. This can be seen in the example U={u0,u1}, Π={π0,π1,π2,π3}, with u0(π0)=u1(π0)=−1, u1(π0)=1, u1(π1)=0, u1(π2)=u1(π3)=1, and u0(π2)=u1(π3)=0; in terms of sets of expected utilities in terms of policies, π0 has (1,0,0,−1) while π1 has (1,1,0,−1). Then for uniform q, all other normalisation methods change if we remove π3 which is identical to π2 for both utilities.

Thus all the other normalisation change when we add (or remove) clones of existing policies.

Finally, we can consider what happens if we are in one of several worlds, and the policies/utilities are the identical in some of these worlds. This should be treated the same as if those identical worlds were all one.

So, imagine that we are in one of three worlds: W0, W1, and W2, with probabilities ρ0, ρ1, and ρ2, respectively. Before taking any actions, the agent will discover which world it is in. Thus, if Πi is the set of policies in Wi, the complete set of policies is Π0×Π1×Π2.

The worlds W1 and W2 are, however, indistinguishable for all utilities in U. Thus we can identify f(Π1)≅Π2, with ui(πj)=ui(f(πj)) for all ui∈U. Then a normalisation method

combines indistinguishable choicesproperty if the normalisation is the same in world ρ0W0+ρ1W1+ρ2W2 and ρ0W0+(ρ1+ρ2)W1. Then:Proof (sketch): Let uji=ui|Wj be the random variable that is ui on Wj under the assumption that Wj is the true underlying world. Then on ρ0W0+ρ1W1+ρ2W2, ui behaves like the random variable ρ0u0i+ρ1u1i+ρ2u2i. (this means that ui has probability ρj of being uji, not that it adds random variables together). Mean, max, and min all have the property that f(aX+bY)=af(X)+bf(Y); variance and mean difference, on the other hand, do not.

## Summary of properties

In my view, it is a big worry that the variance and mean difference normalisations fail to combine indistinguishable choices. World W1 and W2 could be strictly identical, except for some irrelevant information that all utility functions agree is irrelevant. We have to worry about whether the light from a distant star is slightly redder or slightly bluer than expected; what colour ink was used in a proposal; the height of the next animal we see, and so on.

This means that we cannot divide the universe into relevant and irrelevant variables, and focus solely on the first.

In table form, the various properties are:

As can be seen, the min-max method, simplistic though it is, has all the possible nice properties.