Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'm working towards a toy model that will illustrate all the steps in the research agenda. It will start with some algorithmic stand-in for the "human", and proceed to create the , following all the steps in that research agenda. So I'll be posting a series of "toy model pieces", that will then be ultimately combined in a full toy model. Along the way, I hope to get a better understanding of how to do the research agenda in practice, and maybe even modify that agenda based on insights making the toy model.

For this post, I'll look in more detail into how to combine different types of (partial) preferences.

Short-distance, long-distance, and other preferences

I normally use population ethics as my go-to-example for a tension between different types of preferences. You can get a lot of mileage by contrasting the repugnance of the repugnant conclusion with the seeming intuitiveness of the mere addition argument.

However, many people who read this will have strong opinions about population ethics, or at least some opinions. Since I'm not trying to convince anyone of my particular population ethics here, I thought it best to shift to another setting where we could see similar tensions at work, without the baggage.

Living in a world of smiles

Suppose you have three somewhat contradictory ethical intuitions. Or rather, in the formulation of my research agenda, two somewhat contradictory partial preferences.

The second is that any world would be better if people smiled more (). The third is that if almost everyone smiles all the time, it gets really creepy ().

Now, the proper way of resolving those preferences is to appeal to meta-preferences, or to cut them up into their web of connotations: why do we value smiles? Is it because people are happy? Why do we find universal smiling creepy? Is it because we fear that something unnatural is making them smile that way? That's the proper way of resolving those preferences.

However, let's pretend there are no meta-preferences, and no connotations, and just try to combine the preferences as given.

Smiles and worlds

Fix the population to a hundred people, and let be the set of worlds. This set will contain one hundred and one different worlds, described by , where is an integer, denoting the number of people smiling in these worlds.

We can formalise the preferences as follows:

  • .
  • and }$.

These give rise to the following utility functions (for simplicity of the formula, I've translated the definition of ; translations don't matter when combining utilities; I've also written as ):

  • .
  • .

But before being combined, there preferences have to be normalised. There are multiple ways we could do this, and I'll somewhat arbitrarily choose the "mean-max" method, which normalises the utility difference between the top world and the average world[1].

Given that normalisation, we have:

  • .
  • .

Thus we send the to their normalised counterparts:

  • .
  • .

Now consider what happens when we do the weighted sum of these utilities, weighted by the intensity of the human feeling on the subject:

  • .

If the weights and are equal, we get the following, where the utility of the world grows slowly with the number of smiles, until it reaches the maximum at and then drops precipitously:

Thus is dominant most of the time when comparing worlds, but is very strong on the few worlds it really wants to avoid.

But what if (a seeming odd choice) is weighted less that (a more "natural" choice)?

Well, setting for the moment, if , then the utility for all worlds with are the same:

.

Thus if , will force the optimal to be (and will select from these options). If , then will dominate completely, setting .

This seems like it could be extended to solve population ethics considerations in various ways (where might be total utilitarianism, with average utilitarianism or just a dislike of worlds with everyone at very low utility). To go back to my old post about differential versus integral ethics, is a differential constraint, is an integral one, and is the compromise point between them.

Inverting the utilities

If we invert the utilities, things behave differently. If we had (smiles are bad) and (only lots of smiles are good) instead, things would be different[2]. In mean-max, the norm of these would be:

  • .
  • .

So the normalised version of is just , but the normalised version of is different from .

Then, at equal weights, we get the following graph for :

Thus fails at having any influence, and is optimum.

To get the break-ever point, we need , where and are equally valued:

For greater than that, dominates completely, and forces .

It's clear that and are less "antagonistic" than and are (compare the single peak in the graph in the first case, with the two peaks in the second).


  1. Why choose the mean-max normalisation? Well, it seems to have a number of nice properties. It has some nice formal properties, as the intertheoretic utility comparison post demonstrates. But it also, to some extent, boosts utility function to the extent that they do not interfere much with other functions.

    What do I mean by this? Well, consider two utility functions over different worlds. The first one, , ranks one world () as above all others (the other ones being equal). The second one, , ranks one world () as below all others (the other ones being equal).

    Under the mean-max normalisation, and for other . Under the same normalisation, while for other .

    Thus has a much wider "spread" that , meaning that, in a normalised sum of utilities, affects the outcome much more strongly than ("outcome" meaning the outcome of maximising the summed utility). This is acceptable, even desirable: dominating the outcome just rules out one universe (), while dominating the outcome rules out all-but-one universe (). So, in a sense, their ability to focus the outcome is comparable: almost never focuses the outcome, but when it does, it narrows down to a single universe. While almost always focuses the outcome, but barely narrows it down. ↩︎

  2. There is no point having the pairs being or , since those pairs agree on the ordering of the worlds, up to ties. ↩︎

New to LessWrong?

New Comment