Defining the ways human values are messy

by Stuart_Armstrong 1y27th Mar 20186 comments

19


In many of my posts, I've been using phrases like "human values are contradictory, underdefined, changeable, and manipulable". I also tend to slide between calling things preferences, values, morals, rewards, and utilities. This post will clarify some of this terminology.

I say that human values are contradictory, when humans have firm and strong opinions that are in conflict. For instance, a respect for human rights versus desires to reduced harm, when those two come in conflict (more broadly, deontology versus utilitarian conflicts). Or enjoying food (or wanting to be someone who enjoys food) versus wanting to get thin (or wanting to be the someone who gets thin). Or family loyalty versus more universal values.

I say that human values are underdefined, when humans don't have a strong opinion on something, and where their opinion can be very different depending on how the something is phrased. This includes how the issue is framed (saving versus dying), or how people interpret moral choices (such as abortion or international press freedom) depending on what category they put that choice in. New technologies often open up new areas where old values don't apply, forcing people to define new values in the space (often by analogy to old values).

Notice that there is no clear distinction between contradictory and underdefined: as the values in conflict or potential conflict get firmer, this moves from underdefined to contradictory.

I say that human values are changeable, because of the way that values shift, often in predictable ways, depending on such things as social pressure, tribalism, changes in life-roles or positions, or new information (fictional as well as factual information). I suspect that most of these shifts are undetectable to the subject, just as most belief changes are.

I say that human values are manipulable, in that capable humans and potentially advanced AI, can use the vulnerabilities of human cognition to push values in a particular direction. This is a subset of changeable, but with a different emphasis.

Rewards/values/preferences...

At the object level, I see values, preferences, and morals as the same thing. All express the fact that a certain state of the world or a certain course of action, is better than another one.

At the meta level, humans tend to distinguish between them, seeing values and morals as fundamental, wrapped up with identity, and universalisable, and preferences as more personal and contingent. Since I'll be dealing with preferences and meta-preferences, however, I don't have a need to distinguish between the concepts, letting the meta-preferences do that automatically.

Finally, reward functions and utility functions rank outcomes in a similar way to how preferences do, so I'll generally slip between the three unless the difference is relevant (reward functions and utility functions are a total order, preferences need not be; rewards are generally defined over observations, utilities over world-states...).

Finally, there's the issue of hedonism, the fact that human pleasure and enjoyment don't match up perfectly with preferences. I'll generally be treating enjoyment the same as preferences (in that certain world states have higher enjoyment that others) with meta-preferences distinguishing them from standard preferences, and choosing the extent to endorse hedonism.

19