It is clear to most AI safety researchers that the idea of “human values” is underdefined, and this concept should be additionally formalized before it can be used in (mostly mathematical) models of AI alignment. In other words, “AI-conscious” human value theories are those which were created especially in order to help AI alignment research. Most existing psychological theories of human values are verbal, informal and underdefined, so they require some adaptation before they may be applied to AI safety.

In many cases, the theory of human values cannot be distinguished from the ways how values are expected to be extracted by a future AI, for example, by approval-directed AI. Some researchers (notably, Armstrong) have also presented different theories; providing a link to the researcher is not intended as a claim that said researcher currently adheres to exactly this theory.

In Figure 1, all current theories of human values are classified according to two main parameters:

1) Complexity of the values’ description (not the complexity of the theory): Some theories of human values assume that human values are very simple, e.g., that there are only two drives: survival and reproduction; or only one desire, that of maximizing pleasure; or only 9 basic emotions. Other theories suggest that human values are very complex, e.g. they are a matrix of association between all concepts and rewards.

2) The level of “behaviorism”: Most theories are attracted to one of two poles: “internalist” theories, which assume that values actually exist, but are hidden inside the human brain, and “behaviorist” theories, which assume that values are only appear in human behavior (like approval or choice). Behaviorist theories of human values (generally) integrate the values and the method of their extraction (e.g. approval-directed AI). In contrast, internalist theories typically hold that values exist separately of the ways they are learned.

These two parameters are not strict, but serve as a useful instrument to orientate ourselves within the field of all possible theories.

Figure 1. Different real and possible theories about the nature of human values in AI safety

Another way to classify theories about human values is according to the level of abstractness: some theories could be applied to any possible mind and thus don’t take any inputs from human psychology and neurophysiology. Such theories are computationally complex and may still contain hidden assumptions about some properties of human minds, like stability, unity, consistency, etc. Human-centered theories depend on the existing theories of human minds. This distinction is illustrated in Figure 2.

Figure 2. Mind in general and human-mind theories of values


Yudkowsky is a pioneer in the field of AI safety, and among many other things he introduced the idea of the “complexity of values”, which basically means that any short verbal description can’t grasp the complexity of outcomes that we actually want. He summarized his critique of simple wishes as correct presentation of desirable outcomes in “Complex Value Systems are Required to Realize Valuable Futures”. In the same article, he also introduced the concept of “fragility of values”—for example, if one digit in a phone number is wrong, the call will go to a completely different person.

Another important contribution from Yudkowsky is the idea of Coherent Extrapolated Volition. He wrote in his “Complex Values” article: “We might try to define normativity not by our immediate current desires but by our reflective equilibria, what we would want in the limit of perfect knowledge, the ability to consider all options and arguments, and perfect self-knowledge without unwanted weakness of will (failure of self-control).” More links at his writings about values can be found at LessWrong Wiki.

The Arbital page “Value,” which is likely written by Yudkowsky, starts with a definition: “In the context of value alignment as a subject, the word 'value' is a speaker-dependent variable that indicates our ultimate goal - the property or meta-property that the speaker wants or 'should want' to see in the final outcome of Earth-originating intelligent life”. Obviously, most psychologists, and people in general, define “human values” differently, as most people don’t care about the remote future of humanity in their day-to-day preferences.

The Arbital article on values also presents a list of possible points of view on the nature of human values, which, in short is:

· Reflective equilibrium. What I should want “given more factual knowledge, time to consider more knowledge, better self-awareness, and better self-control”.

· Standard desires. “An object-level view that identifies value with qualities that we currently find very desirable, enjoyable, fun, and preferable, such as Frankena's list of desiderata”

· Immediate goods. “E.g., ‘Cure cancer’”

· Deflationary moral error theory. “This mostly ends up at an 'immediate goods' theory in practice, plus some beliefs relevant to the value selection debate”

· Simple purpose. “Value can easily be identified with X, for some X”.


Bostrom does not offer a preferred theory of human values, and suggests several instruments for AI alignment, which do not require a direct definition of human values and, in some sense, are similar to Coherent Extrapolated Volition (CEV). One such idea is the implementation of a virtual parliament of moral theories, where many simulated human minds are held in a simulation until they come to an agreement regarding the most correct presentation of human values.


Sotala analyzed human values in “Defining Human Values for Value Learners” (a discussion of which can be found here.) In this article, he listed several problems of simple model of human values as utility function:

· “The utility function model of value has difficulty dealing with internal conflicts and higher-order preferences”

· “The utility function model of value ignores the person’s internal experience.”

· “The utility function model of value does not model changing values.”

· “The utility function model of value does not give a way to generalize from our existing values to new ones.”

To solve this problem, he suggests the following definition of values: “…human values are concepts which abstract over situations in which we’ve previously received rewards, making those concepts and the situations associated with them valued for their own sake. A further suggestion is that, as humans tend to naturally find various mental concepts to be associated with affect (the subjective experience of a feeling or emotion, experienced as either positive or negative), the value function might be a least partially encoded in the affect of the various concepts.”

At the end of article, Sotala suggests useful criteria to estimate validity of any theory of human values. He outlines that such a theory should be:

● Psychologically realistic

● Compatible with individual variation

● Testable

● Integrated with existing theories

● Suited for exhaustively modeling different values

● Suited for modeling internal conflicts and higher order desires

● Suited for modeling changing and evolving values

● Suited for generalizing from existing values to new ones


Stuart Armstrong has written many articles and posts in which he defines the nature of human values and addresses popular misconceptions.

One of such contribution is his proof that it is impossible to distinguish between “values” and “policy” based only on observation of behavior: “The observed behavior can, in principle, be decomposed into two components: a reward function and a planning algorithm, both of which have to be inferred from behavior. This paper presents a No Free Lunch theorem, showing that, without making `normative' assumptions beyond the data, nothing about the human reward function can be deduced from human behavior.” However, he has probably assumed that “human values” = “human reward function”, and that a human reward function is approximately the same as a rational agent’s reward function.

But the problem is even wider than distinguishing between values and policy: we rather arbitrarily call some part of the process in the human mind “values,” which should be preserved, and call other parts “biases,” “policy,” and “knowledge,” concepts which could be freely changed and don’t have any values inside them.

Armstrong looks at “ontological nature” of human values, that is the question, do they actually exist, in “Learning values, or defining them.” In it, he wrote: “Many attempts at learning human values are framed as "humans have an underlying true reward R, and here is procedure P for determining it,” and “[e]ven if the moral realists are right, and there is a true R, thinking about it is still misleading. Because there is, as yet, no satisfactory definition of this true R, and it's very hard to make something converge better onto something you haven't defined.”

Another post by Armstrong on indefinability of human values Armstrong is Human can be assigned any values whatsoever and the article which covers all these ideas “Occam's razor is insufficient to infer the preferences of irrational agents.” In the post, he sums up his findings: “Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.” After he proceeds with a toy model which shows that even Kolmogorov complexity considerations can’t help to find the true values. Also, it seems that Armstrong uses “human values” and “human reward function” interchangeably, which is not necessary so, unless we redefine “human reward” in way which is different from human reward center activation and experience of qualia of pleasure.

Armstrong argues for the unknowability of human values: “Humans don’t know their own values (sub-issue: humans know their values better in retrospect than in prediction) in his “List of three problems and different approaches to solutions.” That post also offers a nice table of approaches to solve the AI safety problem.

In the blogpost “Resolving human values, completely and adequately,” he suggests the creation of a model of human values which is neither too abstract nor too narrow, but is adequate to escape disastrous outcomes. He suggests that such that such adequate definition is “This is defined to be all the value/preference/reward statements that H[uman] might agree to, more or less strongly.”

He also introduced the important idea of normative assumption in “Normative assumptions: regret,” that is, in a nutshell, the idea that any model of “human values” has normative assumptions, i.e., assumptions about which part of the human motivational system is more important than the others. He then suggested that a feeling of regret be used to determine true values: “In a previous post, I presented a model of human rationality and reward as pair (p, R)…[n]ormative assumptions are simply assumptions that distinguish between two pairs (p, R) (p', R') that lead to the same policy: p(R)=p'(R').” He continued, “[h]ow do we, as humans, define our own rationality and reward? Well, one strong way seems to be through our feeling of regret...[i]f we take ‘feelings and expressions of regret encode true reward information’ as a normative assumption, then this restricts the number of (p, R) pairs that are compatible with such an assumption.”

The case about regret could be made stronger if it incorporated concepts from the existing psychological literature, which has probably already explored the relationship between regret and values. It is also possible to imagine a “hyper-regret disorder,” in which a person will regret all of his/her choices; in that case, regret would be non-informative about preferences.

In the another post “Beyond algorithmic equivalence: self-modelling,” Armstrong defines “human reward” as follows: “we can define the reward of H[uman], as the reward that H models itself as having.” However, people may have a value against value extraction and will deny any correctly extracted values. Also, most humans don’t have anything near a complete model of their own values. Most humans are not that introspective.

In the post “Humans are not agents: short vs. long term” he looks at an example of logically contradicting preferences about life expectancy as an example of contradicting values.


Sarma and Hay state in their article “Mammalian value systems” that “[an] agent utilizing Inverse Reinforcement Learning or Bayesian Inverse Planning will learn and refine its model of human values by observing our behavior, it must begin with some very rough or approximate initial assumptions about the nature of the values it is trying to learn.” They then suggest the use of the phrase “mammalian value system” as a starting point, taking into account later cultural effects on values evolution. They suggest the description of the basic mammalian value system be based on the work of Panksepp and Biven who “categorize the informal list given above into seven motivational and emotional systems that are common to mammals: seeking, rage, fear, lust, care, panic/grief, and play.” Sarma and Hay then introduce the term “neural correlate of value” which is, according to them, some subcortical areas which serve as a mechanism for these seven basic mammalian emotions. That article concludes: “we argue that what we colloquially refer to as human values can be informally decomposed into 1) mammalian values, 2) human cognition, and 3) several millennia of human social and cultural evolution.”

In “AI Safety and Reproducibility: Establishing Robust Foundations for the Neuropsychology of Human Values,” Sarma and Hay point out that the reproducibility crisis in psychology makes identification of the correct theory of human values difficult, but urgent actions to that end are necessary to ensure AI safety.

The newest (as of 2018) article by Sarma, Safron and Hay is “Integrative Biological Simulation, Neuropsychology, and AI Safety,” in which they suggest creation of better biological models of animal minds to develop a better understanding of the nature of motivation.


Luke Muehlhauser wrote in the post “The Human's Hidden Utility Function (Maybe)” that human utility function is not satisfying axioms of rationality. He also wrote that humans have 3 value systems: model-based, model free (repeating of a successful action) and Pavlovian (described as more like unconditional reflexes). He suggests that only first one is “good,” and should be used as a basis for CEV extrapolation.

He also wrote in the post “Do Humans Want Things” that human choice depends on how choices were framed, but not depend on the values as was shown by Kahneman and Tversky: “as far as we can tell, our behavior is often not determined by our wanting a particular state of affairs, but by how our options are framed.”

In the post “The Neuroscience of Desire,” Muehlhauser starts with the observation (supported by a large bibliography) that decision-making in economics and computer science starts with integrating many dimensions in one scalar parameter, and then comparing such parameters for different options. It looks like human brain does the same: “More than a dozen studies show that the subjective utility of different goods or actions are encoded on a common scale by the ventromedial prefrontal cortex and the striatum in primates (including humans).”

Then he looks at neuroscience of choice: “Once a common-currency valuation of goods and actions has been performed, how is a choice made between them? Evidence implicates (at least) the lateral prefrontal and parietal cortex in a process that includes neurons encoding probabilistic reasoning. Interestingly, while valuation structures encode absolute (and thus transitive) subjective value, choice-making structures ‘rescale these absolute values so as to maximize the differences between the available options before choice is attempted,’ perhaps via a normalization mechanism like the one discovered in the visual cortex.”

In the post “A Crash Course in the Neuroscience of Human Motivation,” he wrote a rather long review of different theories of human motivation, beginning from “folk theory” and economics. The article starts with rather bold statement: “But now, neuroscientists are directly measuring the neurons whose firing rates encode value and produce our choices.”

In an article by Muehlhauser, and Helm, “The singularity and machine ethics,” they demonstrate that some (and probably all) known moral theories are unsafe if installed in a powerful optimizer. In Section 5.1 of the article they discuss that “Humans don’t know their values” based on an experiment in which participants explained preference to faces the participant didn’t choose. They state that “[c]ognitive science suggests instead that our knowledge of our own desires is just like our knowledge of others’ desires: inferred and often wrong.”

Further, they argue against the idea that human beings are rational utility maximizers:

“Ever since M. Friedman (1953), economists have insisted that humans only behave “as if” they are utility maximizers, not that humans actually compute expected utility and try to maximize it. It was a surprise, then, when neuroscientists located the neurons in the primate brain that encode (in their firing rates) the expected subjective value for possible actions in the current “choice set.” Several decades of experiments that used brain scanners and single neuron recorders to explore the primate decision-making system have revealed a surprisingly well-understood reduction of economic primitives to neural mechanisms; for a review see Glimcher (2010). To summarize: the inputs to the primate’s choice mechanism are the expected utilities for several possible actions under consideration, and these expected utilities are encoded in the firing rates of particular neurons. Because neuronal firing rates are stochastic, a final economic model of human choice will need to use a notion of “random utility,” as in McFadden (2005) or Gul and Pesendorfer (2006).”

They then look at neurological mechanisms of values: “Recent studies reveal the complexity of subjective values in the brain. For example, the neural encoding of human values results from an interaction of both ‘model-free’ and ‘model-based’ valuation processes.”

They explore complexity of relation between personal preferences and choices: “...may be that each human being contains something like a ‘hidden’ utility function (within the model-based valuation system) that isn’t consistently expressed in behavior because choice is also partly determined by other systems whose valuations we wouldn’t reflectively endorse because they are ‘blind’ and ‘stupid’ compared to the more sophisticated goal-directed model-based valuation system.”


Sezener, in the article “Inferring human values for safe AGI design,” suggested that: human values are an arbitrary complex reward function.

Sezener’s main idea is the use of the Solomonoff induction to find most simple combination of two programs, one of which encodes agent’s reward function, and the other which encodes the agent itself based on an observable sequence of actions and observations. This is similar to Armstrong’s approach of presenting human as (М,З)—a value and policy pair, and then using complexity considerations to find the simplest such pair that explains observable behavior.

Sezener’s critique of Hibbard: “However, a shortcoming of this approach is that what human models say they value and what they value can still be different.”

Sezener also wrote about hidden assumptions in inverse reinforcement learning (IRL): “Soares [7] suggests using methods similar to IRL for learning human values. However, the current IRL methods are limited and cannot be used for inferring human values because of their long list of assumptions. For instance, in most IRL methods the environment is usually assumed to be stationary, fully observable, and sometimes known; the policy of the agent is assumed to be stationary and optimal or near-optimal; the reward function is assumed to be stationary as well; and the Markov property is assumed. Such assumptions are reasonable for limited motor control tasks such as grasping and manipulation; however, if our goal is to learn high-level human values, they become unrealistic”.

Main problems of the Sezener’s approach: 1. behavior and only behavior is the correct representation of a human reward function (what about unconscious parasitic behavior); 2. reward function = values; 3. the model ignores internal contradiction; 4. the model is incomputable; 5. there is an assumption of simplicity of both reward function and agent, which provides “free lunch” in terms of Armstrong.


John Maxwell in the post “Friendly AI through Ontology Autogeneration,” wrote: “If an AI is to be Friendly, it must operate based on an ontology that’s capable of expressing our values,” and, “[r]egardless of the ontology autogeneration algorithm that’s chosen, it’s almost certain that the initial autogeneration will either (a) capture human values with insufficient fidelity or (b) contain so many concepts that finding human values among them will be its own project.”


Hibbard wrote in Avoiding Unintended AI Behaviors that in order to evaluate policy an agent “can simply ask model human d to express a utility value between 0 and 1 for the policy.

This could be called a “counterfactual approval by human model,” which evaluates all possible outcomes of actions. In Hibbard’s case, safe AI consists of two levels: the first creates a model of the world (which includes all humans and their ways of behaving or reacting), and the second calculates how humans in the model will react to possible future histories.


Christino explored many ideas which could be a proxy for human values. One is the use of human approval (or hypothetical approval) of AI actions, summed up in quote “Estimate the expected rating Hugh [human] would give each action if he considered it at length. Take the action with the highest expected rating” from the post “Approval-directed agents.” In that case, the idea of “human values” is ignored and replaced with the much more measurable “approval.” There is a lengthy discussion about the robustness and scalability of this approach and its vulnerability to edge cases like wireheading.


In “Multi-agent predictive minds and AI alignment,” Jan Kulveit combines predictive processing and a multi-agent model of cognition to create a model of the human mind, and then uses it to create approaches to AI alignment.

In his framework:

“…how do motivations and “values” arise? The guess is, in many cases something like a “subprogram” is modelling/tracking some variable, “predicting” its desirable state, and creating the need for action by “signalling” prediction error. Note that such subprograms can work on variables on very different hierarchical layers of modelling - e.g. tracking a simple variable like “feeling hungry” vs. tracking a variable like “social status”.

Such sub-systems can be large: for example, tracking “social status” seems to require lot of computation.” Later, Kulveit states that “In this model, it is hardly possible to disentangle “beliefs” and “motivations” (or values).”

Kulveit then suggests that the human mind could be modeled by a system consisting of many subagents, larger than Minsky’s small agents, but smaller than a psychologist’s human-like subpersonalities, which may or may not be aligned with each other. The latter results in internally contradictional behavior. This has different implications for different approaches of AL alignment, of which four are listed: 1. Approval (where “only the output of the internal aggregation has some voice”); 2. CIRL-like; 3. “Alignment with the whole system, including the human aggregation process itself” (that is, not just its output); 4. “Adding layers of indirection to the question”.

Kavanagh, Linebarger and Schoots

Sabrina Kavanagh, Erin Linebarger and Nandi Schoots from “Human preferences” team at the 2018 AI Safety Camp 2 in Prague wrote “Acknowledging Human Preference Types to Support Value Learning.” They start with idea that internal conflicts in humans could be explained by different preference types, and explore three such types: liking, wanting, and approval. They list all eight combinations of these three preferences and demonstrate that each corresponds to some type of behavior.

They later state: “Liking, wanting and approving are for the most part hidden processes. They are not directly observable, but they influence observable behaviors. As a proxy for liking we propose to use facial expressions, body language or responses to questionnaires. Although a cognitive scan may be the most accurate proxy for liking, there is evidence to suggest both that facial expressions and body language are indicators of pleasure and pain [Algom et al. 1994] and that they can be classified well enough to make them technically feasible proxies [Giorgiana et al. 2012]. The observable proxy of wanting is revealed preferences. We propose to encode a proxy for approval via stated preferences.”


Scott Alexander wrote “To what degree do we have goals,” in which he explores the idea that only conscious personal values should be taken into account, and unconscious ones should be ignored.

In the post “Would Your Real Preferences Please Stand Up?” he argues that in many cases our declared preferences are just “social signaling.”


Drexler, in “Reframing superintelligence: Comprehensive AI Services as General Intelligence,” (p.152) plainly states that: “It seems impossible to define human values in a way that would be generally accepted.”


Yampolskiy is also skeptical that human values could be formalized: “human values are inconsistent and dynamic and so can never be understood/programmed into a machine. Suggestions for overcoming this obstacle require changing humanity into something it is not, and so by definition destroying it” (Roman Yampolskiy on AI Safety Engineering).

In the article “Personal Universes: A Solution to the Multi-Agent Value Alignment Problem” he suggests a solution to escape difficult value aggregating problem in personal universe which will “which are optimally and dynamically adjusting to align their values and preferences.”


Gordon Worley wrote about the need for artificial general intelligence (AGI) alignment to take into account the mental phenomena of consciousness in Formally Stating the AI Alignment Problem and that human values should be describe by the instruments of phenomenology.

In personal communication, he clarifies his position: “So my view is that values are inextricably tied to the existence of consciousness because they arise from our self-aware experience. This means I think values have a simple, universal structure and also that values are rich with detail in their content within that simple structure. This view also necessarily suggests values are not fully discoverable via behavioral methods and that there is always a hidden, internal part that may not even be accessible by the agent themselves.”

Conclusions about the different theories of human values put forward by AI safety researchers

There are several internally consistent theories of human values suggested by AI safety researchers in the existing literature. However, despite their internal consistency, these theories are not very compatible with each other and present a very wide set of opinions.

There are also other “possible theories,” that is, approaches which have not yet (to the best of my knowledge) been put forth by any researcher, but which could be generated based on the same principles as other theories. One is just to assume that all human values are products of evolutionary fitness and can be derived from basic evolutionary considerations in the same way as Omohundro’s basic drives. This could explain most basic human drives like survival, sex, status-seeking, exploration instinct, etc. Another such theory is that an AI Oracle should first read the extant psychological literature, choose the best theory of mind, and create its structure of human values based on that theory.

From this, we could conclude that internal consistency, as well as experimental support and an extensive literature is not enough to provide us with a “correct” theory of human values. We must dig deeper into a question, “do human values exist at all?”

This work was done during AI safety Camp 2 in Prague (2018) in the “hidden assumptions” team – but all possible errors are my own. The work is part of a longer text about the nature of human values.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 4:59 PM

Looks like the images are not loading for me, even when I try to follow the links to them.

Try these links:

Fig 1: https://i.imgur.com/sef5SgH.jpg

Fig 2: https://i.imgur.com/EOWwz4x.jpg

Nice overview :) One point: the introductory sentences don't seem to match the content.

It is clear to most AI safety researchers that the idea of “human values” is underdefined, and this concept should be additionally formalized before it can be used in (mostly mathematical) models of AI alignment.

In particular, I don't interpret most of the researchers you listed as claiming that "[human values] should be formalized". I think that's a significantly stronger claim than, for example, the claim that we should try to understand human values better.