Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I have two strong intuitions about human values that'd seemed utterly irreconcilable to me, up until recently.

  • On the one hand, human values are clearly inchoate, unstable messes of niche heuristics and preferences that often contradict each other and can change on a dime. A twenty-years-old doesn't value all the same things their fifty-years-old self will value, modern Europeans don't value the very same things as ancient Egyptians, etc. And how can it be otherwise? Evolution-built bodies are messy hacks. Why would evolution-built minds be any different?
  • On the other hand, my models of human psyche tell me that humans are clearly approximate utility-maximizers for some very specific utility function. This utility function is stable across time and lifetimes and cultures, despite the fact that the object-level behaviors humans engage in and their stated preferences can change arbitrarily. This utility function has something to do with human "flourishing", whatever that is, and with things that feel a very special kind of "right" according to a human's judgement. There is a strong sense in which we all want the same thing, on some deeply abstract level.

I believe I see a way to unify the two. The crucial insights have been supplied by the Shard Theory, and the final speculative picture is broadly supported by it. This result mostly dissolved all of my high-level confusions about human values and goal-directed behavior, in addition to satisfying a lot of other desiderata.


1. The Shard Theory of Human Value: A Recap

Disclaimer: This summary does not represent the views of Team Shard, but only my subjective understanding. For the official summary, see this.

According to the Shard Theory, in the course of brain development, humans jointly learn two things:

  • A world-model.
  • Heuristics attached to that world-model, reinforced by the credit-assignment algorithm because their execution historically led to reward.

The latter are "shards" . In their most primitive form, they're just  activation patterns. You see a lollipop enter your field of vision, you grab it. You see a flashlight pointed at your face, you close your eyes.

As the world-models grows more advanced, the shards could grow more sophisticated as well. Instead of only attaching to observations, they can attach to things in the world-model. If you're modelling the world as containing a lollipop the next room over, your lollipop-shard will bid for a plan to go grab it. If your far-future model says that becoming a salaried professional will give you enough income to buy a lot of lollipops, your lollipop-shard will bid for it.

A lot of other values and habits are implemented the same way. The desire to do nice things for people you like, the avoidance of life-threatening situations, the considerations that go into the choice of career — all of those are just shard-implemented reaction patterns, which react to things in your world-model and bid for particular responses to them. If you expect someone you like to be unhappy, a shard activates, bidding for an action-sequence that changes that prediction. If you expect to be in a life-threatening situation, a whole bunch of shards rebel against that vision. If you're considering career choices, you're choosing between different models of the future, and whichever wins the "popularity contest" among the shards is what ends up implemented.

Shards can conflict. Some values are mutually contradictory; the preference for lollipops might conflict with preferences for health and being attractive and avoiding dentists, so plans a lollipop-shards bids for may be overruled by other shards. If the lollipop-shard is suppressed too many times, it'll atrophy and die out.

Shards have a self-preservation instinct. Some indirect — they see that certain changes to personality will decrease the amount of things they value in the future, and will bid against such value-drift plans (you don't want to self-modify to hate your loved ones, because that will make you do things that will make them unhappy). Some direct — these shards can identify themselves in the world-model, and directly bid against plans that eliminate them. (You might inherently like some aspects of your personality, and protest against changes to them — not because of outside-world outcomes, but because that's who you like to be. Conversely, imagine a non-reflectively-stable shard, like a crippling fear of spiders or drug addiction. You don't value valuing this, so you can implement plans that eliminate the corresponding shards via e. g. therapeutic interventions.)

All together, a mind like this would resemble humans pretty well. In particular, it crisply defines what "human flourishing" is. It's the state of the world which minimizes constituent shards' resistance to it; a picture of the world that the maximum number of shards approve. And in addition to satisfying our values on the object-level, it'll also need to satisfy shards' preferences for self-perpetuation.

Hence our preference for diverse, dynamic futures in which we remain ourselves.

Sidebar: Note an important thing here: most of the complexity in a mind like this comes from the world-model. Shards can be very simple if-then functions, but the mere fact that they're implemented over a very sophisticated cross-temporal world model can give rise to some very complex behaviors. This, in part, is why I think the Shard Theory is compelling — it fits very well with various stories of incremental development of goals.


2. The Gaps in the Picture

But. That's clearly not a complete story of how humans work, is it?

The shard economy as presented in Part 1 is too rigid. According to it, a human's policy is a relatively shallow function of that human's constituent shards, and significant changes to it imply correspondingly significant changes in the shard economy. And such changes would be rare: ancient, deeply-established shards would have a lot of sway, and their turnover would be low.

But that's not what we often observe. Humans can change their action patterns on a dime, inspired by philosophical arguments, convinced by logic, indoctrinated by political or religious rhetoric, or plainly because they're forced to.

Suppose a human has a bunch of deeply ingrained values, like a) "donate to the local community" or b) "eat pork" or c) "don't kill".

  1. Introducing that human to utilitarianism may lead to them suppressing (a), despite the fact that any hypothetical "utilitarianism" shard should be too newborn to win against a shard that might've been around since childhood.
  2. Showing this human a convincing proof that pigs are sentient would lead to them suppressing (b), which would suddenly be in conflict with (c).
  3. More broadly, the whole "rationality" thing. Strong rationalists can get rid of whole swathes of old yet inefficient heuristics, as the result of noticing that they're logically incoherent.
  4. And a lot of shards can be suppressed if the human, e. g., somehow found themselves trapped in a totalitarian surveillance state with an ideological bent. If the human values their life above all, they would figure out what values the authorities want them to pretend to have, then somehow display these values and only these values, overriding their natural responses.
  5. Affective Death Spirals are another example — when a human becomes so convinced of an ideology their actions starts to be dictated by it more than by their previous beliefs and values.

None of these are knock-out rebuttals. Indeed, even in the last two examples, the new action patterns are not implacable. A sufficiently strong trigger/shard — like a deep trauma, or a very strong value like the love for a child — can break past the life-preservation act in (4) or the ideological takeover in (5).

But this doesn't fully gel with the basic shard-centred picture either. It implies circumstances in which a human's behavior is mainly explained and controlled by some isolated deliberative process, not their entire set of ingrained values. Some part of the human logically reasons out a new policy and then implements it; not as the result of stochastic shard negotiation, but in circumvention of it.

Another issue is the sheer generalizability of human behavior this implies. I can imagine responding to any event my world-model can model in any way I can model. I don't need a special  shard for every case — my collection of shards is already somehow fully generalizable. And if I were trapped in a dystopia, I'd be able to spoof the existence of whatever shards my captors want me to have, regardless of my actual shard makeup.

So what's up with that?


3. An Attempt At Reconciliation

We clearly need to introduce some mechanism of planning/search. The exact implementation is a source of some disagreement:

  1. It might be a mechanism wholly separate from the shards, like the world-model. This "planner" might be trained by self-supervised learning: it generates thoughts/plan steps, the constituent shards vote for/against every step based on the vision of the future conditioned on that step's execution, and if a step is rejected, the planner is updated to be less likely to generate that step in the future. Eventually, it converges towards generating optimal-according-to-the-shard-economy plans out of the gate, skipping the lengthy negotiation process.
  2. It might be a particular "voting bloc" of advanced shards specialized in plan-making. Their "business model" would be: look at the segment of the world-model describing the human's self-model, analyse the inner shard economy, then generate a plan of actions that would satisfy as many shards as possible.
  3. It might be something in-between these extremes, like the capability of the most advanced shards to agree on a common policy they all commit to follow when they're "at the wheel".

Regardless of the specifics, however what we get is: an advanced, consequentialist plan-making mechanism whose goal is to come up with plans that satisfy the weighted sum of the preferences of the shard economy.

This, I argue, is what we are: that planner mechanism, a fairly explicit mesa-optimizer algorithm running on our brains. And our terminal value is to satisfy our shards' preferences.

Which is... a pretty difficult proposition, actually. Because many of these shards do not actually codify preferences, and certainly not universal ones. Some of our goals might be defined over specific environments/segments of the world-model, in ways that are difficult to translate/generalize to other environments. Some others might not be "goals" at all, just if-then activation patterns. To do our job, we essentially have to compile our own values, routing around various type errors.

To illustrate what I mean, a few examples:

  1. Consider a human with a strong preference for "winning". Suppose they're playing chess. The planner's job is to consult the environment-independent internal description of the "winning" value, and "adapt" or "translate" or "interpret" it for chess, outputting a chess-specific objective: "checkmate the opponent's king".
  2. Consider a human who responds to seeing a spider with intense fear. They may interpret it as an instinctive response, perhaps an unwanted one, and seek to remove that fear. Alternatively, they may interpret it as a value, and generalize it: "I dislike spiders".
  3. Consider a human who'd grew up taught that certain actions/behaviours are good and moral, and others are immoral, and developed corresponding habits. They may interpret these habits as values, becoming a deontologist. Or they may view them as instrumentally-useful heuristics optimized for the objective of "making people happy", and become a consequentialist utilitarian.

Hence all of our problems with value reflection: there are often multiple "valid" ways to bootstrap any specific shard to value status.

Hence the various pitfalls we could fall into. These processes of interpretation or generalization are conducted by a deliberative and logical process. And that process can be mistaken, can be fooled by logical or logical-sounding arguments. Hence our prosperity to adopt flawed-but-neat ideologies, or become mistaken about what we really want.

Hence our ability to self-modify. The planner can become convinced (either rightly or not) that certain shards need to be created or destroyed for the good of the whole shard economy, then implement plans that do so (build/destroy good/bad habits, remove values that contradict others). At the same time, we also have preferences for retaining our ability to self-modify — both because we're not sure our current model of our desires is accurate, and maybe because we have a shard-implemented preference for mutability.

Thus: We are approximations of idealized utility-maximizers over an inchoate mess of a thousand shards of desire.

Of note: Consider the reversal happening here. Shards began as heuristics optimized by the credit-assignment mechanism to collect a lot of reward. Up to a point, the human's cognitive capabilities were implemented as shards; shards were the optimization process. At that stage, the human wasn't a proper optimizer. In particular, they weren't retargetable.

Over time, however, some components of that system — be that an external planner algorithm or a coalition of planner-shards — developed universal problem-solving capacity. That made the whole shard economy obsolete. But because of the developmental path the human mind took to get there, that mechanism didn't end up optimizing reward. Instead, it was developed to assist shards, and so it re-interpreted shards as its mesa-objectives, in all their messiness.

And it seems very plausible that AIs would follow a similar developmental path.


4. Nice Things About This Framework

  • It goes towards explaining our apparent robustness to ontology shifts. Namely: figuring out how to adapt our preferences to something they don't apply to is business as usual for us. We're in a continuous process of re-inventing our own values. The fact of robustness is thus unsurprising, even if the exact mechanisms of it are somewhat opaque.
  • At the same time, our true core terminal objective — the satisfaction of our constituent shards — cannot be damaged without damaging the actual structure of our mind. As long as there's a world-model and shards attached to it[1], it'll keep working. We can be approximately as sure about it as about cogito ergo sum.
    • Take adopting or rejecting religion as an example. People often use this as an example of very strong value shifts, and certainly a lot of shards become irrelevant (those whose activation conditions were attached to the "God" node in the world-model).
    • But the planner's actual terminal value of satisfying the shard economy's weighted preferences wouldn't change — an apostate and a born-again Christian would still be trying to increase their life satisfaction, in whichever ways seem proper for them.
    • As part of that, they may re-interpret some of their shards. E. g., an apostate choosing to seek spiritual fulfillment in other pursuits.
  • It concretizes the System 1 vs. System 2 conflict. There are literally pieces of the self that correspond to them: System 1 is the raw shard economy, System 2 is the planner. Sometimes we use raw System 1 dynamics to navigate internal conflicts (figuring out what we really want/prefer more), sometimes we logically reason it out.
    • Notably, humans are not wrapper-minds even for an esoteric wrapper like the planner. The planner doesn't run everything all the time; sometimes it's overruled or just not engaged, sometimes we run on autopilot.
    • (In the model where the planner is a shard itself, perhaps that's what "willpower" is? The amount of resources the planner-shard has that it can burn on overruling other shards?)
  • It explains identity/self-image/the story of the self. It's the planner's model of the shard economy. It can be arbitrarily accurate (if you often consult your desires), arbitrarily inaccurate (if you're delusional/in denial about them), deliberately inaccurate (if you're rejecting certain parts of yourself in an attempt to self-modify), etc.
  • It's compatible with future-proof ethics. In a way, "maximize the preferences of every shard of every human" is humanity's convergent goal, and that's similar to thin utilitarianism. (Though there are some finer points to work out — e. g., humans should probably not be disassembled into their constituent shards. Although that may be implicit in the planner's implementation?)

5. Closing Thoughts

This framework, in conjunction with my previous toy model, essentially dissolves my main confusions about goal-directedness, human values, and development thereof.

The question to tackle, now, seems to be goal translation/value compilation. How do we adapt the values/goals defined over one environment for another? How do we bootstrap things that do not have the type "value" to the status of a value? What algorithms, in general, exist for doing this? How many possible "solutions" (final value distributions) such procedures tend to have, and how can the space of solutions be constrained?

In a way, this is just a reformulation of the ontology-shift problem, but this framing seems to make it easier to reason about. And easier to investigate.


Acknowledgements

Thanks to TurnTrout, Charles Foster, and Quintin Pope for productive discussions and critique.

  1. ^

    Or, if we've experienced an ontology break so serious as to invalidate all of our constituent shards, as long as there's the potential for new shards to be formed, which will be adapted to the new world-model.

New to LessWrong?

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 6:45 PM

A twenty-years-old doesn't value all the same things their fifty-years-old self will value

I have heard this often, but is it really so? When I think about my past self, it seems to me that I am actually more coherent than the society keeps telling me. But of course, maybe I am just lying to myself, rewriting my memory to believe that my past self shared my current values.

I am really curious whether my 20 or 30 years younger self would be okay with my current values and behavior, perhaps after hearing about experience they didn't have yet.

How could we test this experimentally, without a time machine? By giving young people a values questionnaire, including a lot of hypotheticals, such as "if it turned out after repeated attempts that X does not work, would it be okay to give up on X?", then calling them 10 years later and comparing the answers?

This is so great, I find that this so satisfyingly ties together a bunch of piecemeal understandings in my head. Maybe it's not worth getting into, because it's more about understanding humans than the general case of shard-based agents, but... Human brains have a lot of weird bugs that can lead to accidental shard creations / shifts and other stuff like optical illusions or certain drugs being more addictive than would be predicted based on the amount of subjective pleasure they seem to deliver based on idiosyncrasies of how they activate the reward systems. Or like how the local plasticity of the cortex, which allows modules to learn, and also allows for local learning to reallocate module territory on the borders from one module to the other, can lead sometimes to information leaks between modules that can end up accidentally reinforced. Like sensory leaks between skin areas which aren't physically co-located on the body but whose receptive fields in the brain are next to each other and compete for territory. That's an example of something I wouldn't attempt to reproduce if I were trying to make a shard-based / brain-like agent.

But the planner's actual terminal value of satisfying the shard economy's weighted preferences...

Suppose I have a heuristic that fires strongly when I'm eating cupcakes or thinking thoughts that will lead to eating cupcakes, and then a control algorithm that makes my muscles fire to make things happen that correspond to thoughts that a heuristic rates highly, and this control algorithm is hooked up to the cupcake heuristic.

Saying that the "actual terminal value" of the control algorithm is to satisfy whatever heuristic is in its input slot (and so changing the heuristic to one that fires for strawberries wouldn't be a big deal) is kinda wrong. Saying that the "actual terminal value" of the control algorithm is only cupcakes and nothing else would make sense is also kinda wrong. They're both kinda wrong because trying to declare one thing the "actual terminal value" is the wrong exercise to be engaging in in the first place!

This is related to my other warning about the word "actual": this idea that you're "actually" the control algorithm and not the cupcake heuristic. There are multiple ways to think about you that work better or worse in different contexts (Since I just finished editing a sequence about this, I will shamelessly link it). I am so large I don't just contain multitudes, I contain multitudes of ways of parceling myself up into multitudes.

They're both kinda wrong because trying to declare one thing the "actual terminal value" is the wrong exercise to be engaging in in the first place!

I disagree. I'm not talking about the intentional stance or such "external" descriptions. I'm claiming that if you took the explicit algorithmic implementation of the human mind and looked over it, you would find some kind of distinct "planner" part, and that part would be something like an idealized utility-maximizer with a pointer to the shard economy in place of its utility function.

It's not a frame that can be kinda wrong/awkward to use. It's a specific mechanistic prediction that's either flat-out right or flat-out wrong.

This is related to my other warning about the word "actual": this idea that you're "actually" the control algorithm and not the cupcake heuristic

Mm, I'm more willing to relax this assumption. It ties into my model of self-awareness — I suspect it might be the case that the planner is the thing that's being fed summaries of the brain's state, making it literally the thing that's having qualia. But I haven't fully worked out my model of that.

I suspect that much of the appeal of shard theory is working through detailed explanations of model-free RL with general value function approximation for people who mostly think of AI in terms of planning/search/consequentialism.

But if you already come from a model-free RL value approx perspective, shard theory seems more natural.

Moment to moment decisions are made based on value-function bids, with little to no direct connection to reward or terminal values. The 'shards' are just what learned value-function approximating subcircuits look like in gory detail.

The brain may have a prior towards planning subcircuitry, but even without a strong prior planning submodules will eventually emerge naturally in a model-free RL learning machine of sufficient scale (there is no fundamental difference between model-free and model-based for universal learners). TD like updates ensure that the value function extends over longer timescales as training progresses. (and in general humans seem to plan on timescales which scale with their lifespan, as you'd expect)

Humans can change their action patterns on a dime, inspired by philosophical arguments, convinced by logic, indoctrinated by political or religious rhetoric, or plainly because they're forced to.

I'd add that action patterns can change for reasons other than logical/deliberative ones. For example, adapting to a new culture means you might adopt and have new reactions to objects, gestures, etc that are considered symbolic in that culture.