There's been a lot of work on how to reach agreement between people with different preferences or values. In practice, reaching agreement can be tricky, because of issues of extortion/trade and how the negotiations actually play out.
To put those issues aside, let's consider a much simpler case: where a single agent is uncertain about their own utility function. Then there is no issue of extortion, because the agent's opponent is simply itself.
This type of comparison is called intertheoretic, rather than interpersonal.
A question of scale
It would seem that if the agent believed with probability that it followed utility , and that it followed utility , then it should simply follow utility .
But this is problematic, because and are only defined up to positive affine transformations. Translations are not a problem: sending to sends to . But scalings are: sending to does not usually send to any scaled version of .
So if we identify as the equivalence class of utilities equivalent to , then we can write , but it's not meaningful to write .
From clarity, we'll call things like (which map worlds to real values) utility functions, while will be called utility classes.
This is work done in collaboration with Toby Ord, Owen Cotton-Barratt, and Will MacAskill. We had some slightly different emphases during that process. In this post, I'll present my preferred version, while adding the more general approach at the end.
We will need the structure described in this post:
#. A finite set of deterministic strategies the agent can take. #. A set of utility classes the agent might follow. #. A distribution over , reflecting the agent's uncertainty over its own utility functions. #. Let be the subset to which assigns a non-zero weight. We'll assume puts no weight on trivial, constant utility functions.
We'll assume here that never gets updated, that the agent never sees any evidence that changes its values. The issue of updating is analysed in the sections on reward learning agent.
We'll be assuming that there is some function that takes in and and outputs a single utility class reflecting the agent's values.
- Relevant data: If the utility classes and have the same values on all of , then they are interchangeable from 's perspective. Thus, in the terminology of this post, we can identify with .
This gives the structure of , where is a sphere, and corresponds to the trivial utility that is equal on all . The topology of is the standard topology on , and the only open set containing is the whole of .
Then with a reasonable topology on the probability distribution on -- such as the weak topology? -- this leads to the next axiom:
Continuity: the function is continuous in .
Individual normalisation: there is a function that maps to individual utility functions, such that (using as a measure on ).
The previous axiom means that all utility classes get normalised individually, then added together according to their weight in .
- Symmetry: If is a stable permutation of , then .
Symmetry essentially means that the labels of , or the details of how the strategies are implemented, do not matter.
Utility reflection: .
Cloning indifference: If there exists such that for all in on which is non-zero, , then .
Cloning indifference means that the normalisation procedure does not care about multiple strategies that are equivalent on all possible utilities: it treats these strategies as if they were a single strategy.
We might want a stronger result, an independence of irrelevant alternatives. But this clashes with symmetry, so the following axioms attempt to get a weaker version of that requirement.
The above axioms are sufficient for the basics, but, as we'll see, they're compatible with a lot of different ways of combining utilities. The following two axioms attempt to put some sort of limitations on these possibilities.
First of all, we want to define events that are irrelevant. In the terminology of this post, let be a partial history (ending in an action), with at two possible observations afterwards: and .
Then . Then if there exists a bijection between and such that, for all with , , then the observation versus is irrelevant. See here for more on how to define on in this context.
Thus irrelevance means that the utilities in really do not 'care' about versus , and that the increased strategy set it allows is specious. So if we remove as a possible observation (substituting instead) this should make no difference:
Weak irrelevance: If versus given is irrelevant for , then making (xor ) impossible does not change .
Strong irrelevance: If versus given is irrelevant for and there is at least one other possible observation after , then making (xor ) impossible does not change .
In our full analysis, we considered other approaches and properties, and I'll briefly list them here.
First of all, there is a set of prospects/options that may be different from the set of strategies . This allows you to add other moral considerations, not just strictly consequentialist expected utility reasoning.
In this context, the defined above was called a 'rating function', that rated the various utilities. With , there are two other possibilities, the 'choice function' which selected the best option, and the permissibility function, which lists the options you are allowed to take.
If we're considering options as outputs, rather than utilities, then we can do things like requiring the options to be Pareto only. We could also consider that the normalisation should stay the same if we remove the non-Pareto options or strategies. We might also consider that it's the space of possible utilities that we should care about; so, for instance, if , and , and similar results hold for all in , then we may as well drop from the strategy set as it's image is in the mixture of the other strategies.
Finally, some of the axioms above were presented in weaker forms (eg the individual normalisations) or stronger (eg independence of irrelevant alternatives).