Best utility normalisation method to date?

by Stuart_Armstrong3 min read2nd Sep 20197 comments


Ω 7

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

For some time, me and others have been looking at ways of normalising utility functions, so that we can answer questions like:

  • Suppose that you are uncertain between maximising and , what do you do?

...without having to worry about normalising or (since utility functions are only defined up to positive affine transformations).

I've long liked the mean-max normalisation; in this view, what matters is the difference between a utility's optimal policy, and a random policy. So, in a sense, each utility function has a equal shot of moving the outcome away from an expected random policy, and towards themselves.

The intuition still seems good to me, but the "random policy" is a bit of a problem. First of all, it's not all that well defined - are we talking about a policy that just spits out random outputs, or one that picks randomly among outcomes? Suppose there are three options, option A (if A is output), option B' (if B' is output), or do nothing (any other output), should we really say that A happens twice as often as B' (since typing out A randomly is twice as likely that typing out B'?).

Relatedly, if we add another option C, which is completely equivalent to A for all possible utilities, then this redefines the random policy. There's also a problem with branching - what if option A now leads to twenty choices later, while B leads to no further choices, are we talking about twenty-one equivalent choices, or twenty equivalent choices and one other one as likely as all of them put together? Also, the concept has some problem with infinite option sets.

A more fundamental problem is that the random policy includes options that neither nor would ever consider sensible.

Random dictator policy

These problems can be solved by switching instead to the random dictator policy as the default, rather than a random policy.

Assume we are hesitating between utility functions , , ... , with the optimal policy for utility . Then the random dictator policy is just which picks a at random and then follows that. So

  • .

Normalising to the random dictator policy

This is an excellent candidate for replacing the random policy in the normalisation. It is well defined, it would never choose options that all utilities object to, and it doesn't care about how options are labelled or about how to count them.

Therefore we can present the random dictator normalisation: if you are hesitating between utility functions , , ... , then normalise each one to as follows:

  • ,

where is the expected utility of given optimal policy, and is its expected utility given the random dictator policy.

Our overall utility to maximise then becomes:

  • .

Note that that normalisation has a singularity when . But realise what that means: it means that the random dictator policy is optimal for . That means that every single is optimal for . So, though the explosion in the normalisation means that we must pick an optimal policy for , this set is actually quite large, and we can use the normalisations of the other to pick from among it (so maximising becomes a lexicographic preference for us).

Normalising a distribution over utilities

Now suppose that there is a distribution over the utilities - we're not equally sure of each , instead we assign a probability to them. Then the random dictator policy is defined quite obviously as:

  • .

And the normalisation can proceed as before, generating the , and maximising the normalised sum:

  • .


The random dictator normalisation has all the good properties of the mean-max normalisation in this post, namely that the utility is continuous in the data and that it respects indistinguishable choices. It is also invariant under cloning (ie adding another option that is completely equivalent to one of the options already there), which the mean-max normalisation does not.

But note that, unlike all the normalisations in that post, it is not a case of normalising each without looking at the other , and only then combining them. Each normalisation of takes the other into account, because of the definition of the random dictator policy.

Problems? Double counting, or the rich get richer

Suppose we are hesitating between utilities (with probability) and (with ) probability.

Then is the random dictator policy, and is likely to be closer to optimal for than for .

Because of this, we expect to get "boosted" more by the normalisation process than does (since the normalisation is the inverse of the difference between and the optimal policies).

But then when we take the weighted sum, this advantage is compounded, because the boosted is weighted versus for the relatively unboosted . It seems that the weight of thus gets double-counted.

A similar phenomena happens when we are equally indifferent between utilities , , ... , if the , ... all roughly agree with each other while is completely different: the similarity of the first nine utilities seems to give them a double boost effect.

There are some obvious ways to fix this (maybe use rather than ), but they all have problems with continuity, either when , or when .

I'm not sure how much of a problem this is.



Ω 7