Note: This essay was originally posted on EA Forum here on Sept 16. I’d welcome comments from LessWrong readers and AI Alignment Forum experts. I have posted some related essays on EA Forum about the importance for AI alignment of considering corporal/body values, religious values, and the diversity of values across individuals. (I might revise those and post them here soon, if people seem interested.)
I argue that AI ‘alignment with human values’ needs to take more seriously the fact that there are many different types of human values that work in different ways, that have been studied for many decades by diverse behavioral and social sciences, and that need to be explicitly considered when we’re trying to develop alignment strategies that could actually handle the full diversity and heterogeneity of human values.
I worry that a lot of AI alignment research seems to rely on a dangerously simplistic view of human values, and that this will undermine our ability to safely align AI systems with human values.
The simplistic view seems to arise from taking expected utility theory too seriously as a model of human values and preferences. It’s true that we can often describe human decisions, post hoc, at a rather abstract and generic level, using the language of utility theory, Bayesian rationality, and statistical decision theory. This rather abstract and generic way of modeling human values has often been useful in the fields of economics, game theory, rational choice, consequentialist moral philosophy, and reinforcement learning theory.
However, within standard utility theory, there’s no fundamental difference between a consumer’s food preference for a certain flavor of jelly bean and a Muslim’s sacred taboo against eating pork. Utility theory doesn’t distinguish very well between someone who’s a vegan for health reasons, someone who’s a vegan for ethical reasons, and someone who has food allergies to animal proteins. Utility theory doesn’t distinguish very well between someone who’s polyamorous based on libertarian principles, someone who’s polyamorous so they can conform to their Burning Man peer group, and someone who just happened to inherit genes for a high degree of ‘sociosexuality’. Utility theory can’t even distinguish very well between deontologists, consequentialists, virtue ethicists, and religious fundamentalists.
With all due respect to utility theory as a normative theory, if we take it as a descriptive account of human psychology, it seems blind to the heterogeneity of human values types. It can’t model the complex architecture of human values. It can’t model the differences between values that are implemented in different psychological mechanisms such as reflexes, emotions, motivations, cognitive biases, learning biases, conscious preferences, implicit preferences, virtue signals, social norms, political attitudes, sacred values, and taboos. It can’t understand that specific human values fit into categories of value types that have different implications for learning, generalization, inference, and decision-making.
If these differences in types of values matter, at all, in any way, then AI alignment might need a richer model of human values than standard utility theory can offer. Or it might not. We can’t really tell until we think seriously about the heterogeneity of human value types – the whole range of different types of preferences, emotions, motivations, norms, and goals that we might want AI systems to become aligned with. (Note that I’m not talking about heterogeneity of values across individuals, groups, cultures, or historical eras; I’m talking about heterogeneity of different types of values within each human individual – e.g. the difference between a food preference, a religious taboo, and a sexual kink.)
To get a sense of that range of human value types, it might be helpful to start by considering the range of academic disciplines that study different kinds of human values.
How many kinds of human values are there?
Ever since Herbert Spencer published The principles of psychology in 1855, psychologists have studied human values, preferences, emotions, and motivations. We’ve had more than 160 years of research on these topics, with especially fruitful eras around 1870-1920 (after Darwin, before Behaviorism), and 1970 to now (after Behaviorism gave way to richer new fields like cognitive science, emotion research, and evolutionary psychology).
Currently, at least 30 subfields of psychology study different types of values, in different domains, that vary across individuals, groups, and contexts in different ways:
Each of these fields includes thousands of researchers, tens of thousands of journal papers, hundreds of academic books, and dozens of textbooks. Each is taught in thousands of psychology courses around the world. And each discovers new things about distinctive types of values.
Beyond psychology, many other academic fields study human values, preferences, and motivations. These include social sciences such as anthropology, economics, political sciences, and sociology. They include humanities such as philosophy, history, literature, art history, and ethnic studies. They include professional schools such as law schools, medical schools, business schools, and religious schools. Each of these also includes thousands of researchers, classes, and books. In fact, very few academic fields have nothing to say about human values, and nothing to contribute to our understanding of human values.
Do AI alignment researchers really need to learn about the heterogeneity of human value types?
If you’re an AI safety researcher, you might be thinking, dude, do you really expect us to master 30 subfields of psychology and dozens of other academic disciplines just to make sure that our AI systems can align with this alleged variety of human value types? Is this just a stratagem for enforcing a Long Reflection, a grand detour through the social sciences and humanities, that would delay AI research by a couple of centuries?
You might also be thinking, sure, academic fields often split into more specialized subfields so people can publish stuff and teach new courses to get tenure. There are academic incentives to hype the distinctiveness of one’s field, and to make it seem relevant to students and funders by emphasizing its relevance to understanding human values and concerns.
Why should we take all this value-scholarship and value-science seriously in AI alignment concerning human values? Why do we need anything beyond the two standard theoretical foundations of Effective Altruism -- normative consequentialist moral philosophy and expected utility theory -- to descriptively understand the variety of human values?
Couldn’t AI systems just extract heterogenous value types from human behavioral data?
Let me steel-man the case against AI alignment research needing to pay any attention to previous research on human values. Maybe a sufficiently powerful algorithm can reinvent everything that academics have learned about human value types over the last few centuries.
Imagine AI engineers develop a ‘deep value learning’ algorithm. You can feed it a firehose of data about human preferences and behaviors. You feed the system every book ever written in any language. Every EA blog. Every 80,000 Hours podcast. Every movie. Every YouTube video. Every PornHub video. Every surveillance video. Every social media post. Every Google query. Every consumer purchase. Every election vote. Every political speech and religious sermon ever recorded.
Call this the ‘digital value corpus’. It includes a few thousand exabytes of data. It will serve as the input data for value learning.
The algorithm does some kind of colossally powerful unsupervised learning that can statistically extract the complex architecture of human values given this value corpus. Maybe it doesn’t need any supervised learning or reinforcement learning. Maybe the heterogeneity of human value types is all there, latent in the data, ready to be extracted, modelled, and used for alignment.
I suspect that, if human values have causal effects on human behavior, communication, and interaction, then any sufficiently large and rich ‘digital value corpus’ of human behavior, communication, and interaction will include enough latent patterning that a superintelligent AI could – in principle – extract and model the full architecture and heterogeneity of our human values. (This would be functionally equivalent to superintelligent aliens inferred the entire value architecture of humanity just be observing everything that happens on our planet – including all visible behavior and all electronic traffic.)
Maybe the AI can reinvent every insight into human values that has ever emerged from scholars and scientists over the last few millennia. So, maybe we can ignore all of their work. After all, the scholars and scientists were just observing and abstracting about human values given the tiny slices of human behavioral data that they could access (including their fallible introspections), given their cultural biases, ideologies, and top-down models. Why would we trust their human insights more than an AI’s statistical model abstracted from a much larger and more comprehensive value corpus? When modelling the heterogeneity of human values, couldn’t an AI out-perform human scientists, in the same way that AlphaGo, fed with a huge historical corpus of previously played go games, could outperform human go masters?
One problem is, how would we know whether the AI had modelled human values accurately? Could the AI explain our value categories and value architecture to us in a way that we would understand? Would its models of our values be intelligible and interpretable? Could it really fold those values into its own decision-making systems in a way that we could trust? Or would we need to give it a lot more feedback through supervision or reinforcement learning, and a lot more tests to make sure its value architecture was both well-modelled and well-aligned with ours?
Why would heterogenous human value types matter to an AI?
To think more clearly about whether we need to pay attention to the heterogeneity of human value types in AI alignment, we could ask, what computational difference would value type heterogeneity really make to the AI system? Why put specific values into categories that correspond to our human value types?
Maybe the AI just needs to know how much importance or weight we attach to each value or preference, and that’s all it needs to make decisions aligned with our preferences, using standard expected utility calculations. Apart from the decision weight attached to the value, why would it matter what type of value it is – e.g. whether the value is a food preference, a religious taboo, a sexual kink, an aesthetic taste, or a career aspiration?
One simple thought experiment is to imagine a working mother trying to teach a new domestic AI system her values and preferences. She makes verbal statements about things she likes and doesn’t like, and the AI listens and learns. She likes chocolate croissants, she likes cobalt blue silk dresses, she likes for her baby to be safe, she likes promotions at work, she likes shibari ropes, she likes criminal justice reform, and she likes going to Mass on Sundays. Now, would it add any useful information if she said – or if the AI inferred – that these ‘likes’ represent different types of values – specifically, a food preference, a fashion preference, a parental safety preference, a career ambition, a sexual kink, a political value, and a religious value?
In other words, would knowing that a specific human preference falls into a particular value type help an AI do more effective learning, generalization, inference, and decision-making? Well, 60 years of cognitive psychology research suggest that doing better learning, generalization, inference, and decision-making is the entire point of putting things into mental categories. Categorization helps computation. Therefore, good categorization of value types should help AI alignment. That’s the general argument for why AI systems should pay attention to value types, and why AI alignment researchers should too.
But that’s pretty vague. Are there any more specific arguments for why AI systems would work better if they understood different value types? Here are a few specific ways that value types differ from each other in ways that AI systems might find computationally relevant.
Computationally relevant differences across value types
Tradeoffs. People tend to treat some value types (e.g. religious commandments, wedding vows) as ironclad deontological imperatives that are not open to cost/benefit reasoning or tradeoffs, whereas they treat other value types (e.g. Netflix movie choices, hotel preferences) as relatively trivial, transient, and superficial, and fully subject to tradeoffs against other values. If an AI understands the typical degrees of tradeoff flexibility within each value type, it’s likely to be more aligned with human values.
Correlations across value types. Some value types allow stronger inference about other values in other categories. For example, many traditional religions have surprisingly strong food taboos that create bridges between religious values and food preferences – so if the AI knows that someone is an Orthodox Jew or a devout Muslim, it can make inferences about their likely food preferences, but not necessarily about their movie preferences. On the other hand, visual aesthetic preferences (e.g. for Art Nouveau architecture) may not correlate very much with musical preferences (e.g. for Nordic folk metal). AI systems that understand the architecture of correlations across value types might make more accurately calibrated inferences across specific values.
Virtue signaling. Some value types tend to involve a high degree of authenticity – a high correlation between stated and revealed preferences, and a low degree of deception, hypocrisy, or virtue signaling. We don’t tend to virtue signal very much about travel logistics, such as whether we prefer a window or an aisle seat on flights, or whether we prefer an ocean or a garden view in hotels. Other value types tend to involve a lot more public signaling of socially rewarded values, but a lot more private hypocrisy. Political and religious virtue-signaling is famously important to humans. AI systems might make better predictions about our values and preferences if they understand the typical degrees, types, and channels of virtue signaling involved in each value type. They might also understand which of our hypocrisies can be quietly noted, but should not be mentioned out loud – lest we feel embarrassed, angry, and outraged at the AI.
Heritability. All values studied so far in behavioral genetics show some degree of heritability. Genetic differences between people within a culture account for some of the phenotypic differences in their values within that culture. Siblings tend to be more similar in their values than cousins do, for partly genetic reasons. But different value types tend to have different heritabilities. If an AI understands this, it can make better generalizations and inferences across relatives about their likely values. If the AI also has access to genomic data for the people it’s interacting with, it could use polygenic scores to infer some of their value types more easily than other value types.
Cultural transmission pattern. Many values are culturally transmitted – ‘vertically’ across generations, and/or ‘horizontally’ within generations. But different value types tend to be transmitted in different ways that allow different predictions about their commonality across people, families, and subcultures, their likely longevity over time, whether people treat them as moral imperatives or whimsical preferences, etc. Compare the spread of food cuisines, art styles, clothing fashions, mating norms, political ideals, and religious rituals. It might help an AI to understand which value types tend to have which kind of cultural transmission dynamics.
Lifespan development. Some value types tend to change quickly as humans grow up, go through different life stages, and get older; others tend to be more stable over time. Aversions to some ‘disgusting’ foods might get locked in by age 10, whereas preferences for certain cuisines might continue to develop throughout middle age. Sexual orientation might become relatively stable by age 20, whereas preferences for specific traits in a mate might change year by year. Religious beliefs might go through a period of instability in adolescence, and then settle down after marriage. It might help an AI to understand which value types are likely to change over each life-stage.
OK. Those are just six kinds of differences across value types that might be computationally (and ethically) relevant to AI systems. There are probably many other differences that could be explored in the future.
When testing AI alignment, we want to make sure that an AI system is aligned across all relevant types of human values. If we don’t explicitly list the value types that matter, we might overlook some important categories. And if value types involve different kinds of tradeoffs, correlations with other values, virtue signals, heritabilities, cultural transmission patterns, and lifespan development patterns, then an AI that looks aligned on some value types might not be aligned on other value types that haven’t been tested yet. Just because we can train an AI system to learn and embody our food preferences and movie preferences does not mean that it can learn and embody our sexual, political, or religious preferences. (This was one motivation for me to write my EA Forum post on religious values.) So we have to make sure we explicitly test performance and safety across all the value types. This is an important, practical, methodological issue in AI safety.
For the moment, I hope that this essay has helped make the case that AI alignment needs to take seriously the rich heterogeneity of human value types, the challenges of modeling our value architecture, and the importance of making sure that AI systems are aligned across all the kinds of values that really matter to us.
As someone working on value learning, I would like to emphasize that for what I see as realistic approaches, human values are not learned or stored as a big utility function over a single model of the world, instead they're learned entangled with information about the world, and ways of thinking about the world, and weighing them against each other is a large part of the challenge.
So in large part I agree with you, and I totally agree that an AI will learn about my values faster and better by using different ways of understanding different values.
That said, I think your introduction is pretty bad, and you overrate human programmers.
First, introduction. You worry that an AI using a utility function will be bad because it won't represent a fundamental difference between preference and religious taboo. To the reader, it sounds like either you're misunderstanding utility functions, and think that a utility function can't represent the behavioral consequences of this difference, or you're saying that it's important to you that the AI has a little metadata flag saying "religious taboo" inside of it even if there is no behavioral consequence.
Maybe both of these "bad impressions" are somewhat accurate. But you eventually get around to learning and generalization, where I think the actual benefit is, so that's good :)
Second, your picture of how we get humans' psychology knowledge into the AI is off. You seem to be picturing the programmers studying all these different ways of interpreting human behavior in terms of values and then designing, by hand, representations that the AI can use to learn those sorts of human values. This radically overestimates human programmers. You correctly point out that value learning is hard, and that if the AI learns the representations itself it's hard to tell if it's really capturing what we think is important, but this problem doesn't go away if it's humans doing the work!
Hi Charlie, thanks for your comment.
Just to clarify: I agree that there would be no point in an AI flagging different value types with a little metadata flag saying 'religious taboo' vs 'food preference' unless that metadata was computationally relevant to the kinds of learning, inference, generalization, and decision-making that the AI did. But my larger point was that humans treat these value types very differently in terms of decision-making (especially in social contexts), so true AI alignment would require that AI systems do too.
I wasn't picturing human programmers designing value representations by hand for each value type. I don't know how to take seriously the heterogeneity of value types when developing AI systems. I was just making an argument that we need to solve that problem somehow, if we actually want the AI to act in accordance with the way that humans treat different types of values differently.....