Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Toy model piece #4: partial preferences, re-re-visited

1adamShimi

2Stuart_Armstrong

1Acorn

2Stuart_Armstrong

3Acorn

New Comment

I really like the refinement of the formalization, with the explanations of what to keep and what was missing.

That said, I feel like the final formalization could be defined directly as a special type of preorder, one composed only of disjoint chains and cycles. Because as I understand the rest of the post, that is what you use when computing the utility function. This formalization would also be more direct, with one less layer of abstraction.

Is there any reason to prefer the "injective function" definition to the "special preorder" one?

This felt more intuitive to me (and it's a minor result that injective function->special preorder) and closer to what humans actually seem to have in their minds.

That said, since it's equivalent, there is nothing wrong with starting from either approach.

I am studying along your Research Agenda, and I am very excited about your whole plan.

As for this one, I am puzzled that how to formalize preferences like this together, "I prefer apples to both bananas and peaches", since the function l is one-to-one here. In contrast, the model you proposed in the "*Toy model piece #1: Partial preferences revisited*" deals with this quite easily.

Does this imply I prefer X apples to Y bananas and Z pears, where Y+Z=X?

If it's just for a single fruit, I'd decompose that preference into two separate ones? Apple vs Banana, Apple vs Pear.

Sorry for my vague expressions here. What I try to say is that "I prefer apples to bananas and I prefer apples to peaches". My original thought is that: If this statement is formalized in a single world, since it is not clear that whether I prefer bananas to peaches, it seems that the function l has to map apples to bananas and peaches at the same time, which violates its one-to-one property.

But maybe I also asked a bad question: I mistook the definition of partial preferences for any simple statement about preferences, and tried to apply the model proposed in this post to the "composite" preferences, which actually expressed two preferences.

## Two failed attempts

I initially defined partial preferences in terms of foreground variables Y and background variables Z.

Then a partial preference would be defined by y+ and y− in Y, such that, for any z∈Z, the world described by (y+,z) would be better than the world described by (y−,z). The idea being that, everything else being equal (ie the same z), a world with y+ was better than a world with y−. The other assumption is that, within mental models, human preferences can be phrased as one or many binary comparisons. So if we have a partial preference like P1: "I prefer a chocolate ice-cream to getting kicked in the groin", then (y+,z) and (y−,z) are otherwise identical worlds with a chocolate ice-cream and a groin-kick, respectively.

Note that in this formalism, there are two subsets of the set of worlds, y+×Z and y−×Z, and map l between them (which just sends (y+,z) to (y−,z)).

In a later post, I realised that such a formalism can't capture seemingly simple preferences, such as P2: "n+1 people is better than n people". The problem is that that preferences like that don't talk about just two subsets of worlds, but many more.

Thus a partial preference was defined as a preorder. Now, a preorder is certainly rich enough to include preferences like P2, but its allows for far too many different types of structures, needing a complicated energy-minimisation procedure to turn a preorder into a utility function.

This post presents another formalism for partial preferences, that keeps the initial intuition but can capture preferences like P2.

## The formalism

Let W be the (finite) set of all worlds, seen as universes with their whole history.

Let X be a subset of W, and let l be an injective (one-to-one) map from X to W. Define Y=l(X), the image of l, and l−1:Y→X as the inverse.

Then the preference is determined by:

If X and Y are disjoint, this just reproduces the original definition, with X=y+×Z and Y=y−×Z.

But it also allows preferences like P, defining l(x) as something like "the same world as x, but with one less person". In that case, l maps some parts of X to itself

^{[1]}.Then for any element x∈X, we can construct its upwards and downwards chain:

These chains end when they cycle: so there is an n and an m so that l−n(x)=lm(x) (equivalently, lm+n(x)=x).

If they don't cycle, the upwards chain ends when there is an l−n(x) which is not an element of Y (hence l−1 is not defined on in), and the downward chain ends when there is an l(x) which is not in X (and hence l is not defined on it).

So, for example, for P1, all the chains contain two elements only: x and l(x). For P2, there are no cycles, and the lower chain ends when the population hits zero, while the upper chain ends when the population hits some maximal value.

## Utilities difference between clearly comparable worlds

Since the worlds of X∪Y decompose either into chains or cycles via l, there is not need for the full machinery for utilities constructed in this post.

One thing we can define unambiguously, is the relative utility between two elements of the same chain/cycle:

Currently, lets normalise these relative utilities to ˆUl, by normalising each chain individually; note that if every world in the chain is reachable, this is the same as the mean-max normalisation on each chain:

We we could try and extend ˆUl to a global utility function which compares different chains and compares values in chains with values outside of X∪Y. But as we shall see in the next post, this doesn't work when combining different partial preferences.

## Interpretation of l

The interpretation of l is something like "this is the key difference in features that causes the difference in world-rankings". So, for P1, the l switches out a chocolate ice-cream and substitutes a groin-kick. While for P2, the l simply removes one person from the world.

This means that,

locally, we can express X∪Y in the same Y×Z formalism as in the first post. Here the Z are the background variables, while Y is a discrete variable that l operates on.We cannot necessarily express this Y×Z product

globally. Consider, for P2, a situation where z0 is an idyllic village, z1 is an Earthbound human population, and z2 a star-spanning civilization with extensive use of human uploads.And if Y denotes the number of people in each world, it's clear that Y hits a low maximum for z0 (thousands?), can rise much higher for z1 (trillions?), and even higher for z2 (need to use scientific notation). So though (1020,z2) makes sense, (1020,z0) is nonsense. So there is no global decomposition of these worlds as Y×Z.

Note that there is a similarity with CP-nets, if we consider this as expressing a preference over population size while keeping other variables constant. ↩︎