Ω 5

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

My previous approach to combining preferences went like this: from the partial preferences , create a normalised utility function that is defined over all worlds (and which is indifferent to the information that didn't appear in the partial model). Then simply add these utilities, weighted according to the weight/strength of the preference.

But this method fails. Consider for example the following partial preferences, all weighted with the same weight of :

• .
• .
• .
• .
• .

If we follow the standard normalisation approach, then the normalised utility will be defined[1] as:

• , , and otherwise .

Then adding together all five utility functions would give:

• .

There are several problems with this utility. Firstly, the utility of and the utility of are the same, even though in the only case where there is a direct comparison between them, is ranked higher. We might say that we are missing the comparisons between and and , and could elicit these preferences using one-step hypotheticals. But what if comparing to is a complex preference, and all that happens is that the agent combines and ? If we added another partial preference that said , then would end up ranked above !

Another, more subtle point, is that the difference between and is too large. Simply having and would give . Adding in moves this difference to . But note that is already implicit in and , so adding it shouldn't make the difference larger.

In fact, if the difference in utility between and were larger than , adding in should make the difference between and smaller: because having weighted at means that the agent's preference of over is not that strong.

Energy minimising between utilities

So, how should we combine these preferences otherwise? Well, if I have a preference , of weight , that ranks outcome below outcome (write this as ), then, if these outcomes appear nowhere else in any partial preference, will be .

So in a sense, that partial preference is trying to set the distance between those two outcomes to . Call this the energy-minimising condition for .

Then for a utility function , we can define the energy of , as compared with the (partially defined) normalised utility corresponding to . It is:

• .

This is the difference between the weighted distance between the outcomes that , and the one that actually gives.

Because different partial preferences have different number of elements to compare, we can compute the average energy of :

• .

Global energy minimising condition

But weights have another role to play here; they measure not only how much is preferred to , but how important it is to reach that preference. So, for humans, " with weight " means both:

• is not much preferred to .
• The humans isn't too fussed about the ordering of and .

For general agents, these two could be separate phenomena; but for humans, they generally seem to be the same thing. So we can reuse the weights to compute the global energy for as compared to all partial preferences, which is just the weighted sum of its average energy for each partial preference:

• .

Then the actual ideal is defined to be the that minimises this energy term.

Solutions

Now, it's clear this expression is convex. But it need not be strictly convex (which would imply a single solution): for example, if () and () were the only partial preferences, then there would be no conditions on the relative utilities of , and .

Say that is linked to , by defining a link as "there exists a with or ", and then making this definition transitive and reflexive (it's automatically symmetric). In the example above, with , , all of are linked.

Being linked is an equivalence relation. And within a class of linked worlds, if we fix the utility of one world, then the energy minimisation equation becomes strictly convex (and hence has a single solution). Thus, within a class of linked worlds, the energy minimisation equation has a single solution, up to translation.

So if we want a single , translate the solution for each linked class so that the average utility in that class is equal to the average of every other linked class. And this would then define uniquely (up to translation).

For example, if we only had () and (), this could set to be:

Here, the average utility in each linked class (, and ) is .

Applying this to the example

So, applying this approach to the full set of the , above (and fixing ), we'd get:

• .

Here is in the middle of and , as it should be, while the utilities of and are defined by their distance from only. The distance between and is . This is between (which would be given by and only) and (which would be given by only).

1. I've divided the normalisation from that post by , to fit better with the methods of this post. Dividing everything in a sum by the same constant gives the same equivalence class of utility functions. ↩︎