Best utility normalisation method to date?

[-]Wei Dai6yΩ350

Stuart, what's your view on the problem I described in Is the potential astronomical waste in our universe too small to care about? Translated to this setting, the problem is that if you do a normalisation when you're uncertain about the size of the universe (i.e., $E_{π_{r d}}$ is computed under this uncertainty), and then later find out the actual size of the universe (or just gets some information that shifts your expectation of the size of the universe or of how many lives or observer-moments it can support), you'll end up putting almost all of your efforts into Total Utilitarianism (if the shift is towards the universe being bigger) or almost none of your efforts into it (if the shift is in the opposite direction).

[-]Stuart_Armstrong6yΩ120

Hum... It seems that we can stratify here. Let $X$ represent the values of a collection of variables that we are uncertain about, and that we are stratifying on.

When we compute the normalising factor for utility $U$ under two policies $π$ and $π^{'}$ , we normally do it as:

$U \to U / N_{U}$ , with $N_{U} = \sum_{x} P (X = x) (E_{π, X = x} U - E_{π^{'}, X = x} U)$ .

And then we replace $U$ with $U / N_{U}$ .

Instead we might normalise the utility $U$ separately for each value of $x$ :

Conditional on $X = x$ , then $U \to U / N_{U, x}$ , with $N_{U, x} = E_{π, X = x} U - E_{π^{'}, X = x} U$ .

The problem is that, since we're dividing by the $N$ , the expectation of $U / N_{U, x}$ is not the same $U / N_{U}$ .

Is there an obvious improvement on this?

Note that here, total utilitarianism get less weight in large universes, and more in small ones.

I'll think more...

[-]Gurkenglas6y*50

Desirable properties that this may or may not have:

Partitioning the utilities, aggregating each component, then aggregating the results ought to not depend on the partition.
Any agent ought to want to submit its true utility function.

Taking the limit of introducing many copies of an indifferent utility into the mix recovers mean-max.

What happens when we use the resulting aggregated action as the new normalization pivot, and take a fixed point? The double-counting problem gets worse, but fixing it should also make this work.

If each agent can choose which action to submit to the random dictator policy, they might want to sacrifice a bit of their own utility (which they only currently want to improve their normalization position) in order to ruin other utilities (to worsen their normalization position). Two agents might cooperate by agreeing on an action they both submit.

In addition to the pivot each utility submits, we could take into account pivots selected by an aggregate of a subset of utilities. The full aggregate's pivot would agree with what the others submit (due to the convergent instrumental goal of reflective consistency). This construction might be easy to make invariant under partitioning.

[-]Pattern6y10

I've long liked the mean-max normalisation; in this view, what matters is the difference between a utility's optimal policy, and a random policy. So, in a sense, each utility function has a equal shot of moving the outcome away from an expected random policy, and towards themselves.

So utility normalization is about making a compromise. (I'm visualizing a frontier of some sort*.)

This π_rd is an excellent candidate for replacing the random policy in the normalisation. It is well defined, it would never choose options that all utilities object to, and it doesn't care about how options are labelled or about how to count them.

How related is this to the literature on voting? (There I understand there are some issues, including: (under some circumstances) if the random dictator policy is used there is zero probability of an option being chosen which is all the second choice of all parties.)

where Eπ∗i[Ui] is the expected utility of Ui given optimal policy, and Eπrd[Ui] is its expected utility given the random dictator policy.

That was difficult to understand. (In part because of the self reference.**)

It is also invariant under cloning (ie adding another option that is completely equivalent to one of the options already there), which the mean-max normalisation does not.

But it isn't invariant to adding another utility function which is identical to one already present.

There are some obvious ways to fix this (maybe use √pi rather than pi), but they all have problems with continuity, either when pi→0, or when Ui→Uj.

I didn't entirely follow this. (Would replacing U_i with ln(U_i) help?)

*Like the one mentioned in the post about a multi-round prisoner's dilemma, where one player says they value "utility" while the other says they value "difference in utility", and the solution to the problem was described (abstractly) based on the frontier.

** I guess I'll have to come up with a toy problem involving some options and utilities to figure this out.

[-]Gurkenglas6y20

there is zero probability of an option being chosen which is all the second choice of all parties

We might get around this by letting each agent submit not only a utility, but also the probability distribution over actions it would choose if it were dictator. If he's a maximizer, this doesn't get around that. If he's a quantilizer, this should. A desirable property would be that an agent wants to not lie about this.

[-]Stuart_Armstrong6y30

Er, this normalisation system way well solve that problem entirely. If $U_{i}$ prefers option $o_{i}$ (utility $1$ ), with second choice $o_{0}$ (utility $1 / 2$ ), and all the other options as third choice (utility $0$ ), then the expected utility of the random dictator is $1 / n$ for all $U_{i}$ (as $p_{i}^{*}$ gives utility $1$ , and $p_{j}^{*}$ gives utility $0$ for all $j \neq i$ ), so the normalised weighted utility to maximise is:

$U = \frac{1}{n - 1} (U_{1} + U_{2} + \dots U_{n})$ .

Using $(n - 1) U$ (because scaling doesn't change expected utility decisions), the utility of any $o_{i}$ , $i > 0$ , is $1$ , while the utility of $o_{0}$ is $n / 2$ . So if $n > 2$ , the compromise option $o_{0}$ will get chosen.

Don't confuse the problems of the random dictator, with the problems of maximising the weighted sum of the normalisations that used the random dictator (and don't confuse the other way, either; the random dictator is immune to players' lying, this normalisation is not).

[-]Gurkenglas6y20

I was aware, but addressing his objection as though it were justified, which it would be if this were the only place where the agent's preferences matter. This counterfactual is supported by my fondness for linear logic.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

19

Best utility normalisation method to date?

19

Ω 9

19

Ω 9

Random dictator policy

Normalising to the random dictator policy

Normalising a distribution over utilities

Properties

Problems? Double counting, or the rich get richer