# Ω 1

Personal Blog
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

There's been a lot of work on how to reach agreement between people with different preferences or values. In practice, reaching agreement can be tricky, because of issues of extortion/trade and how the negotiations actually play out.

To put those issues aside, let's consider a much simpler case: where a single agent is uncertain about their own utility function. Then there is no issue of extortion, because the agent's opponent is simply itself.

This type of comparison is called intertheoretic, rather than interpersonal.

## A question of scale

It would seem that if the agent believed with probability that it followed utility , and that it followed utility , then it should simply follow utility .

But this is problematic, because and are only defined up to positive affine transformations. Translations are not a problem: sending to sends to . But scalings are: sending to does not usually send to any scaled version of .

So if we identify as the equivalence class of utilities equivalent to , then we can write , but it's not meaningful to write .

From clarity, we'll call things like (which map worlds to real values) utility functions, while will be called utility classes.

## The setup

This is work done in collaboration with Toby Ord, Owen Cotton-Barratt, and Will MacAskill. We had some slightly different emphases during that process. In this post, I'll present my preferred version, while adding the more general approach at the end.

We will need the structure described in this post:

#. A finite set of deterministic strategies the agent can take. #. A set of utility classes the agent might follow. #. A distribution over , reflecting the agent's uncertainty over its own utility functions. #. Let be the subset to which assigns a non-zero weight. We'll assume puts no weight on trivial, constant utility functions.

We'll assume here that never gets updated, that the agent never sees any evidence that changes its values. The issue of updating is analysed in the sections on reward learning agent.

We'll be assuming that there is some function that takes in and and outputs a single utility class reflecting the agent's values.

# Basic axioms

1. Relevant data: If the utility classes and have the same values on all of , then they are interchangeable from 's perspective. Thus, in the terminology of this post, we can identify with .

This gives the structure of , where is a sphere, and corresponds to the trivial utility that is equal on all . The topology of is the standard topology on , and the only open set containing is the whole of .

Then with a reasonable topology on the probability distribution on -- such as the weak topology? -- this leads to the next axiom:

1. Continuity: the function is continuous in .

2. Individual normalisation: there is a function that maps to individual utility functions, such that (using as a measure on ).

The previous axiom means that all utility classes get normalised individually, then added together according to their weight in .

1. Symmetry: If is a stable permutation of , then .

Symmetry essentially means that the labels of , or the details of how the strategies are implemented, do not matter.

1. Utility reflection: .

2. Cloning indifference: If there exists such that for all in on which is non-zero, , then .

Cloning indifference means that the normalisation procedure does not care about multiple strategies that are equivalent on all possible utilities: it treats these strategies as if they were a single strategy.

We might want a stronger result, an independence of irrelevant alternatives. But this clashes with symmetry, so the following axioms attempt to get a weaker version of that requirement.

# Relevance axioms

The above axioms are sufficient for the basics, but, as we'll see, they're compatible with a lot of different ways of combining utilities. The following two axioms attempt to put some sort of limitations on these possibilities.

First of all, we want to define events that are irrelevant. In the terminology of this post, let be a partial history (ending in an action), with at two possible observations afterwards: and .

Then . Then if there exists a bijection between and such that, for all with , , then the observation versus is irrelevant. See here for more on how to define on in this context.

Thus irrelevance means that the utilities in really do not 'care' about versus , and that the increased strategy set it allows is specious. So if we remove as a possible observation (substituting instead) this should make no difference:

1. Weak irrelevance: If versus given is irrelevant for , then making (xor ) impossible does not change .

2. Strong irrelevance: If versus given is irrelevant for and there is at least one other possible observation after , then making (xor ) impossible does not change .

## Full theory

In our full analysis, we considered other approaches and properties, and I'll briefly list them here.

First of all, there is a set of prospects/options that may be different from the set of strategies . This allows you to add other moral considerations, not just strictly consequentialist expected utility reasoning.

In this context, the defined above was called a 'rating function', that rated the various utilities. With , there are two other possibilities, the 'choice function' which selected the best option, and the permissibility function, which lists the options you are allowed to take.

If we're considering options as outputs, rather than utilities, then we can do things like requiring the options to be Pareto only. We could also consider that the normalisation should stay the same if we remove the non-Pareto options or strategies. We might also consider that it's the space of possible utilities that we should care about; so, for instance, if , and , and similar results hold for all in , then we may as well drop from the strategy set as it's image is in the mixture of the other strategies.

Finally, some of the axioms above were presented in weaker forms (eg the individual normalisations) or stronger (eg independence of irrelevant alternatives).

# Ω 1

New Comment

You talk like is countably supported, but everything you've said generalizes to arbitrary probability measures over , if you replace "for all assigned nonzero probability by " with "for all in some set assigned probability by ".

If you endow with the quotient topology from , then the only open set containing is all of . This is a funny-looking topology, but I think it is ultimately the best one to use. With this topology, every function to is continuous at any point that maps to . As a consequence, the assumption "if " in the continuity axiom is unnecessary. More importantly, what topology on the space of probability distributions did you have in mind? Probably the weak topology?

I find independence of irrelevant alternatives more compelling than symmetry, but as long as we're accepting symmetry instead, it probably makes sense to strengthen the assumption to isomorphism-invariance: If is a bijection, then .

The relevance axioms section is riddled with type errors. only makes sense if , which would make sense if represented a space of outcomes rather than a space of strategies (which seems to me to be a more natural space to pay attention to anyway), or if is fully under the agent's control, whereas makes sense if is fully observable to the agent. If is neither fully under the agent's control nor fully observable to the agent, then I don't think either of these make sense. If we're using instead of , then formalizing irrelevance seems trickier. The best I can come up with is that is supported on of the form , where is the probability of . The weak and strong irrelevance axioms also contain type errors, since the types of the output and second input of depend on its first input, though this can probably be fixed.

I didn't understand any of the full theory section, so if any of that was important, it was too brief.

Yes to your two initial points; I wanted to keep the exposition relatively simple.

Do you disagree with the reasoning presented in the picture-proof? That seems a simple argument against IIA. Isomorphism invariance makes sense, but I wanted to emphasise the inner structure of .

Updated the irrelevance section to clarify that is fully observed and happens before the agent takes any actions, and that should be read as .

The full theory section is to write up some old ideas, to show that the previous axioms are not set in stone but that other approaches are possible and were considered.

Your picture proof looks correct, but it relies on symmetry, and I was saying that I prefer IIA instead of symmetry. I'm not particularly confident in my endorsement of IIA, but I am fairly confident in my non-endorsement of symmetry. In real situations, strategies/outcomes have a significant amount of internal structure which seems relevant and is not preserved by arbitrary permutations.

You've just replaced a type error with another type error. Elements of are just (equivalence classes of) functions . Conditioning like that isn't a supported operation.

You're right. I've drawn the set of utility functions too broadly. I'll attempt to fix this in the post.

Ok, I chose the picture proof because it was a particularly simple example of symmetry. What kind of internal structure are you thinking of?

For strategies: This ties back in to the situation where there's an observable event that you can condition your strategy on, and the strategy space has a product structure . This product structure seems important, since you should generally expect utility functions to factor in the sense that for some functions and , where is the probability of (I think for the relevance section, you want to assume that whenever there is such a product structure, is supported on utility functions that factor, and you can define conditional utility for such functions). Arbitrary permutations of that do not preserve the product structure don't seem like true symmetries, and I don't think it should be expected that an aggregation rule should be invariant under them. In the real world, there are many observations that people can and do take into account when deciding what to do, so a good model of strategy-space should have a very rich structure.

For outcomes, which is what utility functions should be defined on anyway: Outcomes differ in terms of how achievable they are. I have an intuition that if an outcome is impossible, then removing it from the model shouldn't have much effect. Like, you shouldn't be able to rig the aggregator function in favor of moral theory 1 as opposed to moral theory 2 by having the model take into account all the possible outcomes that could realistically be achieved, and also a bunch of impossible outcomes that theory 2 thinks are either really good or really bad, and theory 1 thinks are close to neutral. A natural counter-argument is that before you know which outcomes are impossible, any Pareto-optimal way of aggregating your possible preference functions must not change based on what turns out to be achievable; I'll have to think about that more. Also, approximate symmetries between peoples' preferences seem relevant to interpersonal utility comparison in practice, in the sense that two peoples' preferences tend to look fairly similar to each other in structure, but with each person's utility function centered largely around what happens to themselves instead of the other person, and this seems to help us make comparisons of the form "the difference between outcomes 1 and 2 is more important for person A than for person B"; I'm not sure if this way of describing it is making sense.

OK, got a better formalism: https://agentfoundations.org/item?id=1449

I think I've got something that works; I'll post it tomorrow.