In this penultimate post on "learning human values" series, I just want to address some human values/preferences/rewards that don't fit neatly into the (p, R) model where p in the planning algorithm and R the actual reward.

Preferences over preferences and knowledge

Most people have preferences over their own preferences - and that of others. For example, consider someone who has an incorrect religious faith. They might believe something like:

"I want to always continue believing. I flinch away from certain sceptical arguments, but I'm sure my deity would protect me from doubt if I ever decided to look into them".

Hope this doesn't sound completely implausible for someone. Here they have beliefs, preferences over their future beliefs, and beliefs over their future beliefs. This doesn't seem to be able to be easily captured in the (p, R) framework. We can also see that asking them equivalent questions "Do you want to doubt your deity?" and "Do you want to learn the truth?" will get very different answers.

But it's not just theism, an example which is too easy to pick on. I have preferences over knowledge, for instance, as do most people. I would prefer that people had accurate information, for instance. I would also prefer that, when choosing between possible formalisations of preferences, people went with the less destructive and less self-destructive options. These are not overwhelmingly strong preferences, but they certainly exist.


Consider the following scenario: someone believes that roller-coasters are perfectly safe, but enjoys riding them for the feeling of danger they give them. It's clear that the challenge here is not reconciling the belief of safety with the alief of danger (which is simple: roller-coasters are safe), but to somehow transform the feeling of danger into another form that keeps the initial enjoyment.

Tribalism and signalling

The theism argument might suggest that tribalism will be a major problem, as various groups pressure adherents to conform to certain beliefs and preferences.

But actually that need not be such a problem. It's clear that there is a strong desire to remain part of that group (or, sometimes, just of a group). Once that desire is identified, all the rest become instrumental - the human will either do the actions that are needed to remain part of the group, without needing to change their beliefs or preference (just because evolution doesn't allow us to separate those two easily, doesn't mean an AI can't help us do it), or will rationally sacrifice beliefs and preferences to the cause of remaining part of the group.

Most signalling cases can be dealt with in the same way. So, though tribalism is a major reason people can end up with contingent preferences, it doesn't in itself pose problems to the (p, R) model.

Personal identity

The problem of personal identity is a tricky one. I would like to remain alive, happy, curious, having interesting experience, doing worthwhile and varied activities, etc...

Now, this is partially preferences about future preferences, but there's the implicit identity: I want this to happen to me. Even when I'm being altruistic, I want these experiences to happen to someone, not just to happen in some abstract sense.

But the concept of personal identity is a complicated one, and it's not clear if it can be collapsed easily into the (p, R) format.

"You're not the boss of me!"

Finally, even if personal identity is defined, it remains the case that people can judge different situations depending on how that situation is achieved. Being forced or manipulated into a situation will make them resent it much more than if they reach it through "natural" means. Of course, what counts as acceptable and unacceptable manipulations change, is filled with biases, inconsistencies, and incorrect beliefs (in my experience, far too many people think themselves immune to advertising, for instance).

Caring about derivatives rather than positions

People react strongly to situations getting worse of better, not so much to the absolute quality of the situation.

Values that don't make sense out of context

AIs would radically reshape the world and society. And yet humans have deeply held values that only make sense in narrow contexts - sometimes, they already no longer make sense. For instance, in my opinion, of the five categories in moral foundations theory, one no longer makes sense and three only make partial sense (and it seems to me that having these values in a world where it's literally impossible to satisfy them, is part of the problem people have with the modern world):

  • Care: cherishing and protecting others. This seems to me the strongest foundation; care remains well defined today, most especially in the negative "protect people from harm" sense.
  • Purity: abhorrence for disgusting things, foods, actions. This seems the weakest foundation. Our ancestral instincts of disgust for food and people are no longer correlated with actual danger, or with anything much. Disgust is the easiest value to argue against, and the hardest to defend, because it provokes such strong feelings but the boundaries drawn around the objects of disgust make no sense.
  • Fairness: rendering justice according to shared rules. Fairness and equality make only partial sense in today's world. It seems impossible to ensure that every interaction is fair, and that everyone gets their just desert (whatever that means) or gets the same opportunities. But two subcategories do exists: legal rights fairness/equality, and financial fairness/equality. Modern societies achieve the first to some extent, and make attempts at the second.
  • Authority: submitting to tradition and legitimate authority. This also makes partial sense. Traditions is a poor guide in many situations, and the source of authority doesn't simplify real problems or guarantee solutions (which is the main reasons that dictators are not generally any better at solving problems). As with fairness, the subcategory of legal authority is used extensively in the world today.
  • Loyalty: standing with your group, family, nation. This value is weak, and may end up further weakening, down to the level of purity. There are basically too many positive sum interactions in today's world. The benefits of trade and interacting with those outside your ingroup, are huge. Legally, most of loyalty is actually forbidden - we don't have laws encouraging nepotism, rather the opposite.

This can be seen as a subset of the whole "underdefined human values", but it could also be seen as an argument for preserving or recreating certain contexts, in which these values make sense.

A more complex format needed

These are just some of the challenges to the (p, R) format, and there are certainly others. It's not clear how much that format needs to complicated in order to usefully model all these extra types of preferences.


New Comment

New to LessWrong?