Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'm working towards a toy model that will illustrate all the steps in the research agenda. It will start with some algorithmic stand-in for the "human", and proceed to create the UH, following all the steps in that research agenda. So I'll be posting a series of "toy model pieces", that will then be ultimately combined in a full toy model. Along the way, I hope to get a better understanding of how to do the research agenda in practice, and maybe even modify that agenda based on insights making the toy model.

For this post, I'll look in more detail into how to combine different types of (partial) preferences.

Short-distance, long-distance, and other preferences

However, many people who read this will have strong opinions about population ethics, or at least some opinions. Since I'm not trying to convince anyone of my particular population ethics here, I thought it best to shift to another setting where we could see similar tensions at work, without the baggage.

Living in a world of smiles

Suppose you have three somewhat contradictory ethical intuitions. Or rather, in the formulation of my research agenda, two somewhat contradictory partial preferences.

The second is that any world would be better if people smiled more (P1). The third is that if almost everyone smiles all the time, it gets really creepy (P2).

Now, the proper way of resolving those preferences is to appeal to meta-preferences, or to cut them up into their web of connotations: why do we value smiles? Is it because people are happy? Why do we find universal smiling creepy? Is it because we fear that something unnatural is making them smile that way? That's the proper way of resolving those preferences.

However, let's pretend there are no meta-preferences, and no connotations, and just try to combine the preferences as given.

Smiles and worlds

Fix the population to a hundred people, and let W be the set of worlds. This set will contain one hundred and one different worlds, described by w(n), where 0≤n≤100 is an integer, denoting the number of people smiling in these worlds.

We can formalise the preferences as follows:

P1={w(n)≤1w(m)∣n≤m}.

P2={w(n)≤2w(m)∣n≥95 and n≥m }$.

These give rise to the following utility functions (for simplicity of the formula, I've translated the definition of U2; translations don't matter when combining utilities; I've also written Ui(w(n)) as Ui(n)):

U1(n)=2n−100.

U2(n)=2×min(94−n,0).

But before being combined, there preferences have to be normalised. There are multiple ways we could do this, and I'll somewhat arbitrarily choose the "mean-max" method, which normalises the utility difference between the top world and the average world^{[1]}.

Given that normalisation, we have:

||U1||mema=100−0=100.

||U2||mema=0−(−42/101)=42/101≈0.42.

Thus we send the Ui to their normalised counterparts:

U1(n)→ˆU1(n)=n/50−1.

U2(n)→ˆU2(n)=10121min(94−n,0).

Now consider what happens when we do the weighted sum of these utilities, weighted by the intensity of the human feeling on the subject:

U=w1ˆU1+w2ˆU2.

If the weights w1 and w2 are equal, we get the following, where the utility of the world grows slowly with the number of smiles, until it reaches the maximum at n=94 and then drops precipitously:

Thus U1 is dominant most of the time when comparing worlds, but U2 is very strong on the few worlds it really wants to avoid.

But what if U2 (a seeming odd choice) is weighted less that U1 (a more "natural" choice)?

Well, setting w1=1 for the moment, if w2=21/5050, then the utility for all worlds with n≥94 are the same:

.

Thus if w2>21/5050, ˆU2 will force the optimal n to be n≤94 (and ˆU1 will select n=94 from these options). If w2<21/5050, then ˆU1 will dominate completely, setting n=100.

This seems like it could be extended to solve population ethics considerations in various ways (where U1 might be total utilitarianism, with U2 average utilitarianism or just a dislike of worlds with everyone at very low utility). To go back to my old post about differential versus integral ethics, U1 is a differential constraint, U2 is an integral one, and n=94 is the compromise point between them.

Inverting the utilities

If we invert the utilities, things behave differently. If we had −U1 (smiles are bad) and −U2 (only lots of smiles are good) instead, things would be different^{[2]}. In mean-max, the norm of these would be:

||−U1||mema=100−0=100.

||−U2||mema=12−(42/101)=1170/101≈11.58.

So the normalised version of −U1 is just −ˆU1, but the normalised version of U2 is different from −ˆU2.

Then, at equal weights, we get the following graph for U:

Thus −U2 fails at having any influence, and n=0 is optimum.

To get the break-ever point, we need w2=585/303, where n=0 and n=100 are equally valued:

For w2 greater than that, −U2 dominates completely, and forces n=100.

It's clear that U1 and U2 are less "antagonistic" than −U1 and −U2 are (compare the single peak in the graph in the first case, with the two peaks in the second).

Why choose the mean-max normalisation? Well, it seems to have a number of nice properties. It has some nice formal properties, as the intertheoretic utility comparison post demonstrates. But it also, to some extent, boosts utility function to the extent that they do not interfere much with other functions.

What do I mean by this? Well, consider two utility functions over n+1 different worlds. The first one, V1, ranks one world (W1) as above all others (the other ones being equal). The second one, V2, ranks one world (W2) as below all others (the other ones being equal).

Under the mean-max normalisation, V1(W1)=1 and V1(W)=−1/n for other W. Under the same normalisation, V2(W2)=−n while V2(W)=1 for other W.

Thus V2 has a much wider "spread" that V1, meaning that, in a normalised sum of utilities, V2 affects the outcome much more strongly than V1 ("outcome" meaning the outcome of maximising the summed utility). This is acceptable, even desirable: V2 dominating the outcome just rules out one universe (W2), while V1 dominating the outcome rules out all-but-one universe (W1). So, in a sense, their ability to focus the outcome is comparable: V1 almost never focuses the outcome, but when it does, it narrows down to a single universe. While V2 almost always focuses the outcome, but barely narrows it down. ↩︎

There is no point having the pairs being (U1,−U2) or (−U1,U2), since those pairs agree on the ordering of the worlds, up to ties. ↩︎

I'm working towards a toy model that will illustrate all the steps in the research agenda. It will start with some algorithmic stand-in for the "human", and proceed to create the UH, following all the steps in that research agenda. So I'll be posting a series of "toy model pieces", that will then be ultimately combined in a full toy model. Along the way, I hope to get a better understanding of how to do the research agenda in practice, and maybe even modify that agenda based on insights making the toy model.

For this post, I'll look in more detail into how to combine different types of (partial) preferences.

## Short-distance, long-distance, and other preferences

I normally use population ethics as my go-to-example for a tension between different types of preferences. You can get a lot of mileage by contrasting the repugnance of the repugnant conclusion with the seeming intuitiveness of the mere addition argument.

However, many people who read this will have strong opinions about population ethics, or at least some opinions. Since I'm not trying to convince anyone of my particular population ethics here, I thought it best to shift to another setting where we could see similar tensions at work, without the baggage.

## Living in a world of smiles

Suppose you have three somewhat contradictory ethical intuitions. Or rather, in the formulation of my research agenda, two somewhat contradictory partial preferences.

The second is that any world would be better if people smiled more (P1). The third is that if almost everyone smiles all the time, it gets really creepy (P2).

Now, the proper way of resolving those preferences is to appeal to meta-preferences, or to cut them up into their web of connotations: why do we value smiles? Is it because people are happy? Why do we find universal smiling creepy? Is it because we fear that something unnatural is making them smile that way? That's the proper way of resolving those preferences.

However, let's pretend there are no meta-preferences, and no connotations, and just try to combine the preferences as given.

## Smiles and worlds

Fix the population to a hundred people, and let W be the set of worlds. This set will contain one hundred and one different worlds, described by w(n), where 0≤n≤100 is an integer, denoting the number of people smiling in these worlds.

We can formalise the preferences as follows:

These give rise to the following utility functions (for simplicity of the formula, I've translated the definition of U2; translations don't matter when combining utilities; I've also written Ui(w(n)) as Ui(n)):

But before being combined, there preferences have to be normalised. There are multiple ways we could do this, and I'll somewhat arbitrarily choose the "mean-max" method, which normalises the utility difference between the top world and the average world

^{[1]}.Given that normalisation, we have:

Thus we send the Ui to their normalised counterparts:

Now consider what happens when we do the weighted sum of these utilities, weighted by the intensity of the human feeling on the subject:

If the weights w1 and w2 are equal, we get the following, where the utility of the world grows slowly with the number of smiles, until it reaches the maximum at n=94 and then drops precipitously:

Thus U1 is dominant most of the time when comparing worlds, but U2 is very strong on the few worlds it really wants to avoid.

But what if U2 (a seeming odd choice) is weighted less that U1 (a more "natural" choice)?

Well, setting w1=1 for the moment, if w2=21/5050, then the utility for all worlds with n≥94 are the same:

.

Thus if w2>21/5050, ˆU2 will force the optimal n to be n≤94 (and ˆU1 will select n=94 from these options). If w2<21/5050, then ˆU1 will dominate completely, setting n=100.

This seems like it could be extended to solve population ethics considerations in various ways (where U1 might be total utilitarianism, with U2 average utilitarianism or just a dislike of worlds with everyone at very low utility). To go back to my old post about differential versus integral ethics, U1 is a differential constraint, U2 is an integral one, and n=94 is the compromise point between them.

## Inverting the utilities

If we invert the utilities, things behave differently. If we had −U1 (smiles are bad) and −U2 (only lots of smiles are good) instead, things would be different

^{[2]}. In mean-max, the norm of these would be:So the normalised version of −U1 is just −ˆU1, but the normalised version of U2 is different from −ˆU2.

Then, at equal weights, we get the following graph for U:

Thus −U2 fails at having any influence, and n=0 is optimum.

To get the break-ever point, we need w2=585/303, where n=0 and n=100 are equally valued:

For w2 greater than that, −U2 dominates completely, and forces n=100.

It's clear that U1 and U2 are less "antagonistic" than −U1 and −U2 are (compare the single peak in the graph in the first case, with the two peaks in the second).

Why choose the mean-max normalisation? Well, it seems to have a number of nice properties. It has some nice formal properties, as the intertheoretic utility comparison post demonstrates. But it also, to some extent, boosts utility function to the extent that they do not interfere much with other functions.

What do I mean by this? Well, consider two utility functions over n+1 different worlds. The first one, V1, ranks one world (W1) as above all others (the other ones being equal). The second one, V2, ranks one world (W2) as below all others (the other ones being equal).

Under the mean-max normalisation, V1(W1)=1 and V1(W)=−1/n for other W. Under the same normalisation, V2(W2)=−n while V2(W)=1 for other W.

Thus V2 has a much wider "spread" that V1, meaning that, in a normalised sum of utilities, V2 affects the outcome much more strongly than V1 ("outcome" meaning the outcome of maximising the summed utility). This is acceptable, even desirable: V2 dominating the outcome just rules out one universe (W2), while V1 dominating the outcome rules out all-but-one universe (W1). So, in a sense, their ability to focus the outcome is comparable: V1 almost never focuses the outcome, but when it does, it narrows down to a single universe. While V2 almost always focuses the outcome, but barely narrows it down. ↩︎

There is no point having the pairs being (U1,−U2) or (−U1,U2), since those pairs agree on the ordering of the worlds, up to ties. ↩︎