Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Two failed attempts

I initially defined partial preferences in terms of foreground variables and background variables .

Then a partial preference would be defined by and in , such that, for any , the world described by would be better than the world described by . The idea being that, everything else being equal (ie the same ), a world with was better than a world with . The other assumption is that, within mental models, human preferences can be phrased as one or many binary comparisons. So if we have a partial preference like : "I prefer a chocolate ice-cream to getting kicked in the groin", then and are otherwise identical worlds with a chocolate ice-cream and a groin-kick, respectively.

Note that in this formalism, there are two subsets of the set of worlds, and , and map between them (which just sends to ).

In a later post, I realised that such a formalism can't capture seemingly simple preferences, such as : " people is better than people". The problem is that that preferences like that don't talk about just two subsets of worlds, but many more.

Thus a partial preference was defined as a preorder. Now, a preorder is certainly rich enough to include preferences like , but its allows for far too many different types of structures, needing a complicated energy-minimisation procedure to turn a preorder into a utility function.

This post presents another formalism for partial preferences, that keeps the initial intuition but can capture preferences like .

The formalism

Let be the (finite) set of all worlds, seen as universes with their whole history.

Let be a subset of , and let be an injective (one-to-one) map from to . Define , the image of , and as the inverse.

Then the preference is determined by:

  • For all , .

If and are disjoint, this just reproduces the original definition, with and .

But it also allows preferences like , defining as something like "the same world as , but with one less person". In that case, maps some parts of to itself[1].

Then for any element , we can construct its upwards and downwards chain:

  • .

These chains end when they cycle: so there is an and an so that (equivalently, ).

If they don't cycle, the upwards chain ends when there is an which is not an element of (hence is not defined on in), and the downward chain ends when there is an which is not in (and hence is not defined on it).

So, for example, for , all the chains contain two elements only: and . For , there are no cycles, and the lower chain ends when the population hits zero, while the upper chain ends when the population hits some maximal value.

Utilities difference between clearly comparable worlds

Since the worlds of decompose either into chains or cycles via , there is not need for the full machinery for utilities constructed in this post.

One thing we can define unambiguously, is the relative utility between two elements of the same chain/cycle:

  • If and are in the same cycle, then .
  • Otherwise, if and are in the same chain, then .

Currently, lets normalise these relative utilities to , by normalising each chain individually; note that if every world in the chain is reachable, this is the same as the mean-max normalisation on each chain:

  • If and are in the same cycle, then .
  • Otherwise, if and are in the same chain with total elements in the chain, then .

We we could try and extend to a global utility function which compares different chains and compares values in chains with values outside of . But as we shall see in the next post, this doesn't work when combining different partial preferences.

Interpretation of

The interpretation of is something like "this is the key difference in features that causes the difference in world-rankings". So, for , the switches out a chocolate ice-cream and substitutes a groin-kick. While for , the simply removes one person from the world.

This means that, locally, we can express in the same formalism as in the first post. Here the are the background variables, while is a discrete variable that operates on.

We cannot necessarily express this product globally. Consider, for , a situation where is an idyllic village, is an Earthbound human population, and a star-spanning civilization with extensive use of human uploads.

And if denotes the number of people in each world, it's clear that hits a low maximum for (thousands?), can rise much higher for (trillions?), and even higher for (need to use scientific notation). So though makes sense, is nonsense. So there is no global decomposition of these worlds as .

  1. Note that there is a similarity with CP-nets, if we consider this as expressing a preference over population size while keeping other variables constant. ↩︎


Ω 7

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 3:12 AM

I really like the refinement of the formalization, with the explanations of what to keep and what was missing.

That said, I feel like the final formalization could be defined directly as a special type of preorder, one composed only of disjoint chains and cycles. Because as I understand the rest of the post, that is what you use when computing the utility function. This formalization would also be more direct, with one less layer of abstraction.

Is there any reason to prefer the "injective function" definition to the "special preorder" one?

This felt more intuitive to me (and it's a minor result that injective function->special preorder) and closer to what humans actually seem to have in their minds.

That said, since it's equivalent, there is nothing wrong with starting from either approach.

I am studying along your Research Agenda, and I am very excited about your whole plan.

As for this one, I am puzzled that how to formalize preferences like this together, "I prefer apples to both bananas and peaches", since the function l is one-to-one here. In contrast, the model you proposed in the "Toy model piece #1: Partial preferences revisited" deals with this quite easily.

Does this imply I prefer X apples to Y bananas and Z pears, where Y+Z=X?

If it's just for a single fruit, I'd decompose that preference into two separate ones? Apple vs Banana, Apple vs Pear.

Sorry for my vague expressions here. What I try to say is that "I prefer apples to bananas and I prefer apples to peaches". My original thought is that: If this statement is formalized in a single world, since it is not clear that whether I prefer bananas to peaches, it seems that the function l has to map apples to bananas and peaches at the same time, which violates its one-to-one property.

But maybe I also asked a bad question: I mistook the definition of partial preferences for any simple statement about preferences, and tried to apply the model proposed in this post to the "composite" preferences, which actually expressed two preferences.