Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Do I contradict myself? Very well then I contradict myself, (I am large, I contain multitudes.)

Walt Whitman

It is good for our decision processes to be time-consistent, transitive, and independent.

Or as Steve Omohundro apparently put it:

If you prefer being in Berkeley to being in San Francisco; prefer being in San Jose to being in Berkeley; and prefer being in San Francisco to being in San Jose; then you're going to waste a lot of time on taxi rides.

But actually, if you live in Berkeley, work in San Francisco, and like to take weekends in San Jose, then going back and forth continuously makes perfect sense, even if it costs you time and money.

When I'm hungry, I want to eat; after lunch, not so much. Chocolate fondues have their time and place in my diet, as do vegetables. And nothing about that feels particularly inconsistent, even though my preferences are seemingly flipping all over the place as time goes on.

Choosing consistency

Of course, it is possible to have inconsistent eating preferences; diet-overeat-fast cycles, for instance. But more consistent eating behaviours look quite similar to this: indulging more in some circumstances, being stricter in others, and maybe adding the occasional fasting. There is no bright line dividing the inconsistent behaviour from the consistent one.

To resolve this, we can posit a mixture of "true" underlying preferences, such as hedonic enjoyment of eating, social connection, energy, health, weight, and so on, and see the fluctuations of behaviour as just instrumental changes for these stable underlying preferences, coupled with a dose of irrationality. Human preferences are very underdefined, so figuring out what the "true" preferences are is a tricky process.

To pick one of those example underlying preferences, suppose I say that "I desire a certain level of social interactions, on average, in a given week". There are three easy ways to categorise this desire:

  1. Bias or error: this is an inconsistent preference: I either desire social interactions, or I don't. So this should be collapsed to "I want social interactions", "I want to avoid social interactions", or deleted.
  2. True preference: there is nothing wrong with defining a preference over an average quantity of social interactions (or anything else, for that matter). We could define a variable = "average amount of social interaction in the last week", and prefer to have in a certain range.
  3. Side effect of true preference: here we'd posit that there's some underlying true variable[^truvar] ("feeling of connectedness", or "life satisfaction"), and that the level of influences it. This makes the level of into a purely instrumental goal, one that can be done away with if we managed to short cut to the underlying variable (maybe through virtual friends, television, or social-feeling drugs).

Do too much of 1., and we lose all our preferences entirely. Do too much of 2., and every minor bias, mood swing or quirk becomes a "true preference". Do too much of 3., and we wirehead ourselves to some simple variables.

The failure modes of 1. and 3. are very similar; they both involve collapsing to an over-simple small number of variables. These make the other variables fungible" - interchangeable. This is the world where I could forgo all friendships as long as I got to eat delicious food all the time, for example[1].

Consistency can't really guide us here; inconsistent preferences can be made into time- or variable-dependent consistent preferences. Consistency is a retrospective judgement we can make once we've determined our "true" preferences: the other parts of our desires are classified as biased or inconsistent.

We use our meta-preferences to determine our "true" preferences; thus these meta-preference determine what counts as "consistency" for our values.

Thus, just observing behaviour alone is not enough to say if someone is consistent or inconsistent or kindaconsistent. Even knowing people's first order preferences is not enough. We need to dive deeper into their meta-preferences as well.

For example, what's my favourite movie? Well, it depends on my mood. Would I like to get rid of my mood changes? No. Are my mood changes perfect currently? No, I would like more control, and maybe to eliminate some very negative moods entirely. In that case, why do I want to preserve any mood swings or changes at all? Because I'd prefer it that way.

We need a formalisation of preferences that can cope with that level of complexity in preferences and meta-preferences.

  1. That's one reason that so many experiments in inconsistent preferences involve money, because money is taken to be the ultimate fungible commodity: £1 is £1, no matter how the subject of the experiment came by it.

    Like many assumptions, this can be false: a coin collector, a cleanliness fanatic, or someone who attaches stories to their coins, may not see £1 coins as really interchangeable. So we are using assumptions about the preferences of others - or at least about the preferences of "most people in a psychological experiment" - to decide that money is close enough to fungible (and liquid) that we can draw conclusions from the experiment. ↩︎

New Comment