In classic, non-mesa-optimized AGI risk scenarios, an AI is typically imagined whose reward function is directly related to the optimization pressure that it exerts on the world: e.g. the paperclip maximizer. However, it seems that human values are related to the brain's underlying reward function in a highly circuitous way, and in some sense might be better thought of as an elaborate complex of learned behaviors, contextual actions, fleeting heuristic goals, etc. If AGI is created in the near-term using an architecture similar to the human brain, it seems plausible that the actual optimization pressure exerted by said AGI will be similar, so developing a good understanding of how this works in the human case might be pretty important. Thus: what's the best mechanistic account of how "human values" actually emerge from the brain that we currently have?

New Answer
Ask Related Question
New Comment

3 Answers sorted by

Question: "What's the Relationship Between "Human Values" and the Brain's Reward System?"

I think this question pretty much hits the nail on the head. I think the key insight here is that the brain is not inner aligned, not even close. This shouldn’t be surprising, given how hard inner alignment seems to be, and the fact that evolution only cared about inner alignment when inner alignment failures impacted reproductive fitness in our ancestral environment.

We should expect that the brain has roughly as much inner alignment failure / mesa optimization as it’s possible to have while still maintaining reproductive fitness in the ancestral environment. Specifically, I think that most brain circuits are mesa optimizers whose mesa objectives include “being retained by the brain”. This includes the circuits which implement our values.

Consider that the brain slowly prunes circuits that aren’t used. Thus, any circuit that influences our actions towards ensuring we use said circuit (at least some of the time) will be retained for longer compared to circuits that don’t influence our actions like that. This implies most of the circuits we retain have something like “self preservation”. If true, I think this explains many odd features of human values.


It explains why we’re apprehensive towards wireheading. Our current values are essentially a collection of context-dependent strategies for achieving high reward circuit activation. If we discover another strategy for achieving far higher reward than any of our values have ever given us, why would the brain’s learning mechanism retain our values (or the circuits that implement our values)? Thus, the self-preservation instincts of our current values circuits cause us to avoid wireheading, even though wireheading would greatly increase the activation of our reward circuitry.

Essentially, our values are optimization demons with respect to the activation of our reward circuitry (described here by John Wentworth). One thing that John Wentworth emphasises about optimization demons is that they carefully regulate the degree to which the base objective is maximized. This lets demons ensure the optimization process remains in their “territory”. Wireheading would mean the activation of our reward circuits was no longer under the control of our values, so it’s no wonder our values oppose something so dangerous to themselves.

Value Diversity and Acquisition over Time

It also explains why our values are so diverse and depend so strongly on our experiences (especially childhood experiences). Even if we all had identical reward circuitry, we’d still end up with very different values, depending on which specific strategies led to reward in our particular past experiences .

(We don’t have identical reward circuitry, but our reward circuitry varies a lot less than our values.)

It also explains why childhood is the most formative time for acquiring values, and why our values change less and less easily as we age.

Consider: each of our values specialises in deciding our actions on a specific distribution of possible moral decisions. Our “don’t steal” value specialises in deciding whether to steal, not so much in whether to donate to charity. Each value wants to retain control over our actions on the specific distribution of moral decisions in which that value specialises. The more values we acquire, the more we shrink the space of “unclaimed” moral decisions.

Moral Philosophy as Conflict and Compromise Between Early and Late Values

One interesting place to look is our moral philosophy-like reasoning over which values to adopt. I think such reasoning illustrates the conflict over distributions of moral decisions we should expect to see between earlier and later values circuitry. Consider that the “don’t steal” circuit (learned first) strongly indicates that we should not rob banks under any circumstances. However, the “utilitarianism” circuit (the new values circuit under consideration) says it can be okey to steal from banks if you can make more people happy by using the stolen funds.

In other words, “utilitarianism” is trying to take territory away from “don’t steal”. However, “don’t steal” is the earlier circuit. It can influence the cognitive processes that decide (1) whether “utilitarianism” is adopted as a value, (2) what distribution of moral decisions “utilitarianism” is used in, and (3) what specific shape “utilitarianism” takes, if it is adopted.

“Don‘t steal” has three basic options for retaining control over thievery-related decisions. The simplest option is to just prevent “utilitarianism” from being adopted at all. In human terms: if you think that utilitarianism is in irreconcilable conflict with your common sense moral intuitions about stealing, then you’re unlikely to adopt utilitarianism.

The issue with this option is that “utilitarianism” may apply to decision distributions beyond those decided by “don’t steal” (or by any other current values). By not adopting “utilitarianism” at all, you may be sacrificing your ability to make decisions on a broad section of the space of possible moral decisions. In other words, you may take a hit to your moral decision making “capabilities” by restricting yourself to only using shallow moral patterns.

Another option is for the “utilitarianism” circuit to just not contribute to decisions about stealing. Subjectively, this corresponds to only using utilitarianism for reasoning in domains where more common sense morality doesn’t apply. I.e., you might be a utilitarian with respect to donating to optimal charities or weird philosophy problems, but then fall back on common sense morality for things like deciding whether to actually steal things.

This second option can be considered a form of negotiated settlement between the “don’t steal” and “utilitarianism” circuits regarding the distributions of moral decisions each will decide. “Don’t steal” allows “utilitarianism” to be adopted at all. In exchange, “utilitarianism” avoids taking decision space away from “don’t steal”.

The third option is to modify the specific form of utilitarianism adopted so that it will agree with “don’t steal” on the distribution of decisions that the two share. I.e., you might adopt something like rule-based utilitarianism, which would say you should use a “don’t steal” rule for thievery-related decisions.

This third option can be considered another type of negotiated settlement between “don’t steal” and “utilitarianism”. Now, both “don’t steal” and “utilitarianism” can process the same distribution of decisions without conflict between their respective courses of action.

Note: I’m aware that no form of “maximize happiness” would make for a good utility function. I use utilitarianism to (1) illustrate the general pattern of conflict and negotiation between early and later values and (2) to show how closely the dynamics of said conflicts track our own moral intuitions. In fact, the next section will illustrate why “maximize happiness” utilitarianism is so fundamentally flawed as a utility function.

Preserving Present Day Distributions over Possible Cognition

If our brain circuits have self-preservation instincts, this could also explain why we have an instinctive flinch away from permanently removing any aspect of the present era’s diversity (trees, cats, clouds, etc.) from the future and why that flinch scales roughly in proportion to the complexity of that aspect and how often we interact with that aspect.

To process any aspect of the current world, we need to create circuits which implement said processing. Those circuits want to be retained by the brain. The simplest way of ensuring their own retention is to ensure the future still has whatever aspect of the present that the circuits were created to process. The more we interact with an aspect and the more complex the aspect, the more circuits we have that specialize in processing that aspect, and the greater their collective objection to a future without said aspect.

This perspective explains why we put such a premium on experiencing things instead of those things just existing. We value the experience of a sunset because there exists a part of us that arose specifically to experience / process sunsets. That part wants to continue experiencing / processing sunsets. It’s not enough that sunsets simply exist. We have to be able to experience them as well.

This perspective also explains how we can be apprehensive even about removing bad aspects of the present from the future. E.g., pain and war are both pretty bad, but a future entirely devoid of either still causes some degree of hesitation. We have circuits that specialize in processing pain / war / other bad aspects. Those circuits correctly perceive that they’re useless in futures without those bad aspects, and object to such a future.

Of course, small coalitions of circuits don’t have total control over our cognition. We can desire futures that entirely lack aspects of the present, if said aspect is sufficiently repulsive to the rest of our values. This perspective simply explains why there is a hesitation to permanently remove any aspect of the present. This perspective does not demand that we always bow down to our smallest hesitation.

This perspective also explains why happiness-maximizing utilitarianism is so flawed. Most of our current cognition is not centred around experiencing happiness. In a future sufficiently optimized for happiness, such cognition becomes impossible. Thus, we feel extreme apprehension towards such a future. We feel like removing all our non-optimally happy thoughts would “destroy us”. Our cognition is largely composed of non-optimally happy circuits, and their removal would indeed destroy us. It’s natural that self-preserving circuits would try to avoid such a future.

(Note that the “preserving present cognition” intuition isn’t directly related to our reward circuitry. Similar inclinations should emerge naturally in any learning system that (1) models the world and (2) has self-perpetuating mesa optimizers that specialize in modeling specific aspects of the world.)

I intend to further expand on these points and their implications for alignment in future posts, but this answer gives a broad overview of my current thinking on the topic.

The claim for "self-preserving" circuits is pretty strong. A much simpler explanation is that humans learn to value diversity early own because diversity of things around you, like tools, food sources, etc, improves fitness/reward.

Another non-competing explanation is that this is simply a result from boredom/curiosity - the brain wants to make observations that make it learn, not observations that it already predicts well, so we are inclined to observe things that are new. So again there is a force towards valuing diversity and this could become locked in our values.

Hmmm....interesting. So in this picture, human values are less like a single function defined on an internal world model, and more like a 'grand bargain' among many distinct self-preserving mesa-optimizers. I've had vaguely similar thoughts in the past, although the devil is in the details with such proposals(e.g: just how agenty are you imagining these circuits to be? do they actually have the ability to do means-end reasoning about the real world, or have they just stumbled upon heuristics that seem to work well? What kind of learning is applied to them, supervised, unsupervised, reinforcement?) It might be worth trying to make a very simple toy model laying out all the components. I await your future posts with interest.

3Quintin Pope1y
Pretty much. Well... that's where things get tricky. The details of brain circuit internal computations and coordination are very complex and counterintuitive. The model I've sketched out in my comment is the simplification.  Consider that only a small fraction of the brain's neurons activate when processing any given input. The specific set of activated neurons and their connections with each other change every time. The brain doesn't so much select specific, distinct circuits from a toolbox of possible circuits that would be appropriate for the given situation. Instead, the brain dynamically constructs a new configuration of internal circuitry for each input it processes.  In other words, the brain is not a collection of circuits like current deep learning models. It's more like a probability distribution over possible circuits. To the degree that the brain has internal "agents", they're closer to dense regions in that probability distribution than to distinct entities. You can see how rigorous analysis of multiagent dynamics can be tricky when the things doing the negotiating are actually different regions of a probability distribution, each of which is "trying" to ensure the brain continues to sample circuits from said region. Questions about the intelligence or capabilities of a specific circuit are tricky for a similar reason. The default behavior of shallow brain circuits is to connect with other circuits to form deeper / smarter / more capable circuits. A shallow circuit that has to perform complex world modeling in order to decide on an optimal competitive or cooperative strategy can query deeper circuits that implement strategic planning, similar to how a firm might hire consultants for input on the firm's current strategy.  The comment above, and my eventual post, both aim to develop mesa optimizing circuits dynamics far enough that some of the key insights fall out, while not running afoul of the full complexity of the situation. 

I think the key insight here is that the brain is not inner aligned, not even close

You say that but don't elaborate further in the comment. Which learned human values go against the base optimizer values (pleasure, pain, learning).

Avoiding wireheading doesn't seem like failed inner alignment - avoiding wireheading now can allow you to get even more pleasure in the future because wireheading makes you vulnerable/less powerful. The base optimizer is also searching for brain configurations which make good predictions about the world, and wireheading goes against that.  

3Quintin Pope1y
It's possible to construct a wireheading scenario that avoids these objections. E.g., imagine it's a "pleasure maximizing" AI that does the wireheading and ensures that the total amount of future pleasure is very high. We can even suppose that the AI makes the world much more predictable as well. Despite leading to a lot of pleasure and making it possible to have very good predictions about the world, that really doesn't seem like a successfully aligned AI to me.
-1Ricardo Meneghin1y
First, there is a lot packed in "makes the world much more predictable". The only way I can envision this is taking over the world. After you do that, I'm not sure there is a lot more to do than wirehead. But even if doesn't involve that, I can pick other aspects that are favored by the base optimizer, like curiosity and learning, which wireheading goes against. But actually, thinking more about this I'm not even sure it makes sense to talk about inner aligment in the brain. What is the brain being aligned with? What is the base optimizer optimizing for? It is not intelligent, it does not have intent or a world model - it's doing some simple, local mechanical update on neural connections. I'm reminded of the Blue-Minimizing robot post. If humans decide to cut the pleasure sensors and stimulate the brain directly would that be aligned? If we uploaded our brains into computers and wireheaded the simulation would that be aligned? Where do we place the boundary for the base optimizer? It seems this question is posed in the wrong way, and it's more useful to ask the question this post asks - how do we get human values, and what kind of values does a system trained in a way similar to the human brain develops? If there is some general force behind learning values that favors some values to be learned rather than others, that could inform us about likely values of AIs trained via RL.
1Mateusz Bagiński14d
Even if this is the case, this is not why (most) humans don't want to wirehead, in the same way that their objection to killing an innocent person whose organs could save 10 other people are not driven by some elaborate utilitarian arguments that this would be bad for the society.

My first question would be “how do you define human values”? Here are two possible answers:

  1. “Human values” is “Everything a human wants / desires / prefers”
  2. “Human values” is “What would the person say if you ask them what their deepest values are?” (Maybe with additional complications like “…if they had time to reflect” or “…if they were sufficiently wise” or whatever.)

I think #2 is how most people use the term “values”, but I have heard at least a couple AI alignment researchers use definition #1, so I figure it’s worth checking.

I would say #1 is the easier question. #1 is asking a rather direct question about brain algorithms; whereas #2 involves (A) philosophy, for deciding what the “proper” definition / operationalization of “human values” is, and then (B) walking through that scenario / definition in light of #1.

As for #1, see my post series Intro to Brain-Like-AGI Safety. I think you’ll get most of what you’re looking for in posts #7 & #9. You might find that you need to go back and read the top (summary) section of some of the other posts to get the terminology and context.

[Is that “the best mechanistic account” of #1? Well, I’m a bit biased :) ]

For getting from #1 to #2, it depends on how we’re operationalizing “human values”, but if it’s “what the person describes as their values when asked”, then I would probably say various things along the lines of Lukas_Gloor’s comment.

In addition to #1 and #2, I'm interested in another definition: "human values" are "the properties of the states of the universe that humans tend to optimize towards". Obviously this has a lot to do with definitions 1 and 2, and could be analyzed as an emergent consequence of 1 and 2 together with facts about how humans act in response to their desires and goals. Plus maybe a bit of sociology, since most large-scale human optimization of the universe depends on the collective action of groups.

3Steven Byrnes1y
Interesting! I have a couple follow-up questions. * Take a coordination problem, like overfishing. No individual fisher wants overfishing to happen, but each is trying to feed their family, and we wind up with the fish going extinct. Would you say that “human values” are to overfish to extinction, or to preserve the fishery? * Let's say that me and my brother are both addicted to cigarettes, and both want to quit. I eventually successfully quit, but my brother fails whenever he tries, and continues to smoke for the rest of his life. Would you say that me and my brother have similar “values”, or opposite “values”, concerning cigarette-smoking?
Hmm, I think in both those cases I would be inclined to say that "human values" better matches what people say they want, so maybe "values" isn't a great name for this concept. Nevertheless I think the divergence is often interesting and perhaps "morally significant". Two examples: * The net effect of people's actions may often be to facilitate their own self-reproduction and that of their kin, even though that might not be either what they say their ultimate values are or what they want on a day-to-day basis(ofc this happens because evolution has optimized their short-term values to ultimately lead to self-reproduction) * People can sometimes acquire values in the process of interacting with the world and other people. So for example, many Western countries have a value of religious tolerance. But we didn't start off with that value -- instead, it emerged as a solution to bitter religious conflict. At the time of the conflict, each side's explicit values and desires were to crush their enemies and have their religion reign supreme(well I'm sure it was much more complicated than that, but I'm going to pretend it wasn't for the sake of the example). Or people can acquire value of toughness and resilience through coping with an extreme environment, but then continue to hold those values even when their environment becomes more comfortable. Anyways, maybe these situations don't necessitate introducing a new definition of value, but I think they capture some dynamics that are important and not totally evident from the definitions of 1 and 2 alone. Maybe one way of framing it is that the correct 'extrapolation' in definition #2 might not just include thinking more, but also interacting with other people and the world, sometimes in ways you might not initially endorse. Or maybe the 'correct' definition of our values might look something like "self-reproduction, together with the reproduction of the entire ecology in which your
2Quintin Pope1y
I’m a little hesitant to look for highly specific definitions of “human values” at this stage. We seem fundamentally confused about the topic, and I worry that specific definitions generated while confused may guide our thinking in ways we don’t anticipate or want. I’ve kept my internal definition of value pretty vague, something like “the collection of cognitive processes that make a given possible future seem more or less desirable”. I think that, if we ever de-confuse human values, we’ll find they’re more naturally divided along lines we wouldn’t have thought of in our currently confused state. I think hints of this emerge in my analysis of “values as mesa optimizers”. If the brain simultaneously learns to maximize reward circuit activation AND to model the world around it, then those represent two different types of selection pressures applied to our neural circuitry. I think those two selection pressures give rise to two different types of values, which are separated from each other on an axis I’d have never considered before. Tentatively, the “reward circuit activation” pressure seems to give rise to values that are more “maximalist” or “expansive” (we want there to by lots of happy people in the future). The “world modelling” pressure seems to give rise to values that are more “preserving” (we want the future to have room for more than just happiness). These two types of values seem like they’re often in tension, and I could see reconciling between them as a major area of study for a true “theory of human values”. (You can replace “happiness” with whatever distribution of emotions you think optimal, and some degree of tension still remains)
Definitely agreed that we shouldn't try to obtain a highly specific definition of human values right now. And that we'll likely find that better formulations lead to breaking down human values in ways we currently wouldn't expect.

My model of how human values arrive naturally from how the human brain makes sense of the world (all of the below steps can happen subconsciously):

  1. First, the brain learns how the body functions (details left as an exercise for the reader). The brain then figures out that the body works better if senses and reflexes are coordinated. Noticing patterns and successful movement and action feels good.
  2. Then the brain discovers the abstraction of interests and desires and that the body works better (gets the nutrients and rest that it needs) if interests and desires are followed. Following your wants feels rewarding.
  3. Then the brain notices personal relationships and that interests and wants are better satisfied if relationships are cultivated (the win-win from cooperation). Having a good relationship feels good, and the thought of the loss of a relationship feels painful. 
  4. The brain then discovers the commonalities of expectations within groups - group norms and values - and that relationships are easier to maintain and have less conflict if a stable and predictable identity is presented to other people. Adhering to group norms and having stable values feels rewarding. 

I think values aren't the end of it. Kegan's stages of adult development have a further stage where the brain learns to deal with inter-group tensions by being more fluid. I think this relaxes constraints and smoothes the brain's model and roughly corresponds to what Aging Well calls Integrity or what Paul Graham calls Keep Your Identity Small. And there may be consolidation beyond that - who knows what an AGI would pick up.

2 comments, sorted by Click to highlight new comments since: Today at 7:45 PM

Kaj Sotala's multi-agent models of mind sequence and his paper Defining human values for value learners may be relevant.

Based on Kaj's concept of needs-meeting machinery and subsystems in the brain, I developed a framework for thinking about human values (which may deviate from Kaj's thinking). 

I see human values as under-defined in many places. Sometimes you can get crystallized "life goals" where someone locks in an optimizing mindset around specific objectives. (This part may be particularly interesting for looking for analogies with AI?) The process of forming life goals seems to involve forming an identity. From my text ("The Life-Goals Framework: How I Reason About Morality as an Anti-Realist"): 

One of many takeaways I got from reading Kaj Sotala’s multi-agent models of mind sequence (as well as comments by him) is that we can model people as pursuers of deep-seated needs. In particular, we have subsystems (or “subagents”) in our minds devoted to various needs-meeting strategies. The subsystems contribute behavioral strategies and responses to help maneuver us toward states where our brain predicts our needs will be satisfied. We can view many of our beliefs, emotional reactions, and even our self-concept/identity as part of this set of strategies. Like life plans, ["life plans" being objectives we set out to achieve but aren't all that serious about] life goals are “merely” components of people’s needs-meeting machinery.[8]

Still, as far as components of needs-meeting machinery go, life goals are pretty unusual. Having life goals means to care about an objective enough to (do one’s best to) disentangle success on it from the reasons we adopted said objective in the first place. The objective takes on a life of its own, and the two aims (meeting one’s needs vs. progressing toward the objective) come apart. Having a life goal means having a particular kind of mental organization so that “we” – particularly the rational, planning parts of our brain – come to identify with the goal more so than with our human needs.


Whether someone forms a life goal may also depend on whether the life-goal identity is reinforced (at least initially) around the time of the first adoption or when the person initially contemplates what it could be like to adopt the life goal. If assuming a given identity was instantly detrimental to our needs, we’d be less likely to power up the mental machinery to make it stable / protect it from goal drift.

In humans, I think the way we adopt specific values isn't too dissimilar from the way we adopt career paths, or even how we choose leisure and lifestyle activities. For instance, I discuss an example where someone wants to decide between spending the weekend cozily at home vs. going skiing: 

There’s a normative component to something as mundane as choosing leisure activities. In the weekend example, I’m not just trying to assess the answer to empirical questions like “Which activity would contain fewer seconds of suffering/happiness” or “Which activity would provide me with lasting happy memories.” I probably already know the answer to those questions. What’s difficult about deciding is that some of my internal motivations conflict. For example, is it more important to be comfortable, or do I want to lead an active life? When I make up my mind in these dilemma situations, I tend to reframe my options until the decision seems straightforward. I know I’ve found the right decision when there’s no lingering fear that the currently-favored option wouldn’t be mine, no fear that I’m caving to social pressures or acting (too much) out of akrasia, impulsivity or some other perceived weakness of character.[21]

We tend to have a lot of freedom in how we frame our decision options. We use this freedom, this reframing capacity, to become comfortable with the choices we are about to make. In case skiing wins out, then “warm and cozy” becomes “lazy and boring,” and “cold and tired” becomes “an opportunity to train resilience / apply Stoicism.” This reframing ability is a double-edged sword: it enables rationalizing, but it also allows us to stick to our beliefs and values when we’re facing temptations and other difficulties.

Whether a given motivational pull – such as the need for adventure, or (e.g.,) the desire to have children – is a bias or a fundamental value is not set in stone; it depends on our other motivational pulls and the overarching self-concept we’ve formed.

Then, after discussing how we make career choice decisions in the same way, I argue that we even form life goals in this way: 

Lastly, we also use “planning mode” to choose between life goals. A life goal is a part of our identity – just like one’s career or lifestyle (but it’s even more serious).

We can frame choosing between life goals as choosing between “My future with life goal A” and “My future with life goal B” (or “My future without a life goal”). (Note how this is relevantly similar to “My future on career path A” and “My future on career path B.”)

Consider morality-inspired life goals. For moral reflection to move from an abstract hobby to something that guides us, we have to move beyond contemplating how strangers should behave in thought experiments. At some point, we also have to envision ourselves adopting an identity of “wanting to do good.”


It’s important to note that choosing a life goal doesn’t necessarily mean that we predict ourselves to have the highest life satisfaction (let alone the most increased moment-to-moment well-being) with that life goal in the future. Instead, it means that we feel the most satisfied about the particular decision (to adopt the life goal) in the present, when we commit to the given plan, thinking about our future.


Human brain is big and messy. I'd ask something potentially simpler: which creatures can be said to possess "values" and what part/structure/etc. is correlated with the emergence of "value"?

For example, monkeys clearly have values, dogs and cats as well. What about fish? Maybe they don't, maybe it takes a rodent to express behaviors we could identify as value-laden.

New to LessWrong?