Simple and composite partial preferences

by Stuart_Armstrong1 min read9th Sep 2019No comments


Ω 5

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Partial preferences are preferences that exist within a human's internal model, eg "this painting is better than that one", "I don't like getting punched", "that was embarrassing", "I'd prefer to fund this charity than that one", and so on.

In order to elicit these partial preferences, we can use one-step hypotheticals: brief descriptions of a hypothetical situation where causing the human to model it and reach a preference.

But sometimes our reaction to a hypothetical are more complex. Consider:

During the Notre-Dame Cathedral fire, a relic said to be part of the True Cross was threatened by the fire. Ten firefighters could go in and save it; however, there is a that all of them might die. Is it better that they go in or not?

When confronted with something like that, I might reason thusly:

Well, it's certainly not part of the True Cross, even if that existed. However, it is an important medieval artefact with probably a lot of history (I know that because it was in the most important cathedral in France). I value things like that quite highly, and don't want them to be destroyed*. On the other hand, beware status quo bias**: I don't particularly miss historical items that were destroyed in the past*; but this artefact would be appreciated by probably millions in the future, and that matters to me*. Lots of religious people, French people, and people with a respect for history would not want it destroyed, and I value their satisfaction to some extent*. A chance of ten deaths should be considered similar, in my estimation, to 1 death with certainty** (however, without the connotations of "this person specifically must die"**). Firefighters are professionals, who specifically accept to take risks in these kinds of situations, so they're not like "innocent victims" who I would give extra weight to*. They are likely pretty young, and I care about the amount of life they could lose*.

So, would I pay one life for a very-but-not-fantastically valuable artefact that would be valued and enjoyed by millions? But those people would enjoy the cathedral and its art even without this particular relic. Estimating how much value such an artefact could create, compared with the value remaining in a human life (this is a relevant comparison for me**), I'd guess that it's not worth saving in this circumstance* (but I would change my mind if the risk of death were lower or the artefact more valuable/appreciated).

Now, this is very far from a single model with a single instinctive answer! Instead it's the combination of many different simple partial preferences; I've tried to indicate base level partial preference in that reasoning with a *, and partial meta-preferences with a **.

So I don't think we should call this a partial preference. Instead, we should call it a collection of partial preferences, tied together by a reasoning process which itself is a partial meta-preference (in what it considers and what it ignores).

Thus that hypothetical can elicit a lot of different partial preferences in its answer; but the answer itself should not be considered a partial preference, as it doesn't exist as a preference in a single model.

Simple partial preferences

In contrast, simple partial preferences would instead have taken the form of:

It's never worth sacrificing a human life to save a mere object with only sentimental value.


Millions have worshipped this relic; one life or ten lives are a cheap price to pay for its survival.

In these cases, a single model is created in the brain, and a single comparison is made.



Ω 5