Man, what a post!
My knowledge of alignment is somewhat limited, so keep in mind some of my questions may be a bit dumb simply because there are holes in my understanding.
It seems hard to scan a trained neural network and locate the AI’s learned “tree” abstraction. For very similar reasons, it seems intractable for the genome to scan a human brain and back out the “death” abstraction, which probably will not form at a predictable neural address. Therefore, we infer that the genome can’t directly make us afraid of death by e.g. specifying circuitry which detects when we think about death and then makes us afraid. In turn, this implies that there are a lot of values and biases which the genome cannot hardcode…
I basically agree with the last sentence of this statement, but I'm trying to figure out how to square it with my knowledge of genetics. Political attitudes, for example, are heritable. Yet I agree there are no hardcoded versions of "democrat" or "republican" in the brain.
This leaves us with a huge puzzle. If we can’t say “the hardwired circuitry down the street did it”, where do biases come from? How can the genome hook the human’s preferences into the human’s world model, when the genome doesn’t “know” what the world model will look like? Why do people usually navigate ontological shifts properly, why don’t people want to wirehead, why do people almost always care about other people if the genome can’t even write circuitry that detects and rewards thoughts about people?”.
This seems wrong to me. Twin studies, GCTA estimates, and actual genetic predictors all predict that a portion of the variance in human biases is "hardcoded" in the genome. So the genome is definitely playing a role in creating and shaping biases. I don't know exactly how it does that, but we can observe that such biases are heritable, and we can actually point to specific base pairs in the genome that play a role.
Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.
Wow. I'm not sure if you're aware of this research, but shard theory sounds shockingly similar to Guynet's description of how the parasitic lamprey fish make decisions in "The Hungry Brain". Let me just quote the whole section from Scott Alexander's Review of the book:
How does the lamprey decide what to do? Within the lamprey basal ganglia lies a key structure called the striatum, which is the portion of the basal ganglia that receives most of the incoming signals from other parts of the brain. The striatum receives “bids” from other brain regions, each of which represents a specific action. A little piece of the lamprey’s brain is whispering “mate” to the striatum, while another piece is shouting “flee the predator” and so on. It would be a very bad idea for these movements to occur simultaneously – because a lamprey can’t do all of them at the same time – so to prevent simultaneous activation of many different movements, all these regions are held in check by powerful inhibitory connections from the basal ganglia. This means that the basal ganglia keep all behaviors in “off” mode by default. Only once a specific action’s bid has been selected do the basal ganglia turn off this inhibitory control, allowing the behavior to occur. You can think of the basal ganglia as a bouncer that chooses which behavior gets access to the muscles and turns away the rest. This fulfills the first key property of a selector: it must be able to pick one option and allow it access to the muscles.
Spoiler: the pallium is the region that evolved into the cerebral cortex in higher animals.
Each little region of the pallium is responsible for a particular behavior, such as tracking prey, suctioning onto a rock, or fleeing predators. These regions are thought to have two basic functions. The first is to execute the behavior in which it specializes, once it has received permission from the basal ganglia. For example, the “track prey” region activates downstream pathways that contract the lamprey’s muscles in a pattern that causes the animal to track its prey. The second basic function of these regions is to collect relevant information about the lamprey’s surroundings and internal state, which determines how strong a bid it will put in to the striatum. For example, if there’s a predator nearby, the “flee predator” region will put in a very strong bid to the striatum, while the “build a nest” bid will be weak…
Each little region of the pallium is attempting to execute its specific behavior and competing against all other regions that are incompatible with it. The strength of each bid represents how valuable that specific behavior appears to the organism at that particular moment, and the striatum’s job is simple: select the strongest bid. This fulfills the second key property of a selector – that it must be able to choose the best option for a given situation…
With all this in mind, it’s helpful to think of each individual region of the lamprey pallium as an option generator that’s responsible for a specific behavior. Each option generator is constantly competing with all other incompatible option generators for access to the muscles, and the option generator with the strongest bid at any particular moment wins the competition.
You can read the whole review here or the book here. It sounds like you may have independently rederived a theory of how the brain works that neuroscientists have known about for a while.
I think this independent corroboration of the basic outline of the theory makes it even more likely shard theory is broadly correct.
I hope someone can work on the mathematics of shard theory. It seems fairly obvious to me that shard theory or something similar to it is broadly correct, but for it to impact alignment, you're probably going to need a more precise definition that can be operationalized and give specific predictions about the behavior we're likely to see.
I assume that shards are composed of some group of neurons within a neural network, correct? If so, it would be useful if someone can actually map them out. Exactly how many neurons are in a shard? Does the number change over time? How often do neurons in a shard fire together? Do neurons ever get reassigned to another shard during training? In self-supervised learning environments, do we ever observe shards guiding behavior away from contexts in which other shards with opposing values would be activated?
Answers to all the above questions seem likely to be downstream of a mathematical description of shards.
This seems wrong to me. Twin studies, GCTA estimates, and actual genetic predictors all predict that a portion of the variance in human biases is "hardcoded" in the genome.
I'd also imagine that mathematical skill is heritable. [Finds an article on Google Scholar] The abstract of https://doi.org/10.1037/a0015115 seems to agree. Yet due to information inaccesibility and lack of selection pressure ancestrally, I infer math ability probably isn't hardcoded.
There are a range of possible explanations which reconcile these two observations, like "better genetically specified learning hyperparameters in brain regions which convergently get allocated to math" or "tweaks to the connectivity initialization procedure[1] involving that brain region (how neurons get ~randomly wired up at the local level)."
I expect similar explanations for heritability of biases.
So the genome is definitely playing a role in creating and shaping biases. I don't know exactly how it does that, but we can observe that such biases are heritable, and we can actually point to specific base pairs in the genome that play a role.
Agreed.
Compare eg the efficacy of IID Gaussian initialization of weights in an ANN vs using Xavier to tamp down the variance of activations in later layers.
That's a really interesting reference to lamprey decision-making, thanks!
The descriptions and even the terminological choices are very similar to that in the hierarchical reinforcement learning literature, for example Sutton, Precup and Singh 1999. They use 'option' to refer to a kind of sub-policy which sequentially composes atomic/primitive actions, and can be activated or deactivated by higher-level controllers, and learned or improved through reinforcement - I assume Quintin and/or Alex invoke this or similar by use of the same term.
As you allude by discussing shards for cooperative tendencies, the Shard Theory approach seems relevant for intent alignment too, not just value alignment. (For value alignment, the relevance of humans as an example is “How did human values evolve despite natural selection optimizing for something different and more crude?” For intent alignment, the relevance is “How come some humans exhibit genuinely prosocial motivations and high integrity despite not sharing the exact same goals as others?”)
Studying the conditions for the evolution of genuinely prosocial motivations seems promising to me.
By “prosocial motivations,” I mean something like “trying to be helpful and cooperative” at least in situations where this is “low cost.” (In this sense, classical utilitarians with prosocial motivations are generally safe to be around even for those of us who don’t want to be replaced by hedonium.)
We can make some interesting observations on prosocial motivations in humans:
By the last bullet point, I mean that it seems plausible that we can learn a lot about someone's character even in situations that are obviously "a test." E.g., the best venture capitalists don't often fall prey to charlatan founders. Paul Graham writes about his wife Jessica Livingston:
I'm better at some things than Jessica, and she's better at some things than me. One of the things she's best at is judging people. She's one of those rare individuals with x-ray vision for character. She can see through any kind of faker almost immediately. Her nickname within YC was the Social Radar, and this special power of hers was critical in making YC what it is. The earlier you pick startups, the more you're picking the founders. Later stage investors get to try products and look at growth numbers. At the stage where YC invests, there is often neither a product nor any numbers.
If Graham is correct about his wife's ability, this means that people with "shady character" sometimes fail in test situations specifically due to their character – which is strange because you'd expect that the rational strategy in these situation is "act as though you had good character."
In humans, "perfect psychopaths" arguably don't exist. That is, people without genuinely prosocial motivations, even when they're highly intelligent, don't behave the same as genuinely prosocial people in 99.9% of situations while saving their deceitful actions for the most high-stakes situations. Instead, it seems likely that they can't help but behave in subtly suspicious ways even in situations where they're able to guess that judges are trying to assess their character.
From the perspective of Shard Theory's approach, it seems interesting to ask "Why is this?"
My take (inspired by a lot of armchair psychology and – even worse – armchair evolutionary psychology – is the following:
In the context of training TAI systems, we could attempt to recreate these conditions and select for integrity and prosocial motivations. One difficulty here lies in recreating the right "developmental constraints" and in keeping a balance the relative capabilities between judges and to-be-evaluated agents. (Humans presumably went through an evolutionary arms race related to assessing each others' competence and character, which means that people were always surrounded by judges of similar intelligence.)
Lastly, there's a problem where, if you dial up capabilities too much, it becomes increasingly easier to "fake everything." (For the reasons Ajeya explains in her account of deceptive alignment.)
(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)
I haven't fully understood all of your points, but they gloss as reasonable and good. Thank you for this high-effort, thoughtful comment!
(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)
I encourage applicants to also read Quintin's Evolution is a bad analogy for AGI (which I wish more people had read, I think it's quite important). I think that evolution-based analogies can easily go astray, for reasons pointed out in the essay. (It wasn't obvious to me that you went astray in your comment, to be clear -- more noting this for other readers.)
Could you clarify what you mean by values not being "hack after evolutionary hack"?
What this sounds like, but I think you don't mean: "Human values are all emergent from a simple and highly general bit of our genetic blueprint, which was simple for evolution to find and has therefore been unchanged more or less since the invention of within-lifetime learning. Evolution never developed a lot of elaborate machinery to influence our values."
What I think you do mean: "Human values are emergent from a simple and general bit of our genetic blueprint (our general learning algorithm), plus a bunch of evolutionary nudges (maybe slightly hackish) to guide this learning algorithm towards things like friendship, eating when hungry, avoiding disgusting things, etc. Some of these nudges generalize so well they've basically persisted across mammalian evolution, while some of them humans only share with social primates, but the point is that even though we have really different values from chimpanzees, that's more because our learning algorithm is scaled up and our environment is different, the nudges on the learning algorithm have barely had to change at all."
What I think you intend to contrast this to: "Every detail of human values has to be specified in the genome - the complexity of the values and the complexity of the genome have to be closely related."
What I think you do mean:
This is an excellent guess and correct (AFAICT). Thanks for supplying so much interpretive labor!
What I think you intend to contrast this to: "Every detail of human values has to be specified in the genome - the complexity of the values and the complexity of the genome have to be closely related."
I'd say our position contrasts with "A substantial portion of human value formation is genetically pre-determined in a complicated way, such that values are more like adaptations and less like exaptations—more like contextually-activated genetic machinery and influences than learned artifacts of simple learning-process-signals."
In terms of past literature, I disagree with the psychological nativism I've read thus far. I also have not yet read much evolutionary psychology, but expect to deem most of it implausible due to information inaccessibility of the learned world model.
In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community.
The upsides
- majority of the claims is true or at least approximately true
- "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as 'shards'
- part of the success leads at some people in the AGI labs to think about mathematical structures of human values, which is an important problem
The downsides
- almost none of the claims which are true are original; most of this was described elsewhere before, mainly in the active inference/predictive processing literature, or thinking about multi-agent mind models
- the claims which are novel seem usually somewhat confused (eg human values are inaccessible to the genome or naive RL intuitions)
- the novel terminology is incompatible with existing research literature, making it difficult for alignment community to find or understand existing research, and making it difficult for people from other backgrounds to contribute (while this is not the best option for advancement of understanding, paradoxically, this may be positively reinforced in the local environment, as you get more credit for reinventing stuff under new names than pointing to relevant existing research)
Overall, 'shards' become so popular that reading at least the basics is probably necessary to understand what many people are talking about.
But how does this help with alignment? Sharded systems seem hard to robustly align outside of the context of an entity who participates on equal footing with other humans in society.
Well for starters, it narrows down the kind of type signature you might need to look for to find something like a "desire" inside an AI, if the training dynamics described here are broad enough to hold for the AI too.
It also helped me become less confused about what the "human values" we want the AI to be aligned with might actually mechanistically look like in our own brains, which seems useful for e.g. schemes where you try to rewire the AI to have a goal given by a pointer to its model of human values. I imagine having a better idea of what you're actually aiming for might also be useful for many other alignment schemes.
Are you asking about the relevance of understanding human value formation? If so, see Humans provide an untapped wealth of evidence about alignment. We know of exactly one form of general intelligence which grows human-compatible values: humans. So, if you want to figure out how human-compatible values can form at all, start by understanding how they have formed empirically.
But perhaps you're asking something like "how does this perspective imply anything good for alignment?" And that's something we have deliberately avoided discussing for now. More in future posts.
I'm basically re-raising the point I asked about in your linked post; the alignability of sharded humans seems to be due to people living in a society that gives them feedback on their behavior that they have to follow. This allows cooperative shards to grow. It doesn't seem like it would generalize to more powerful beings.
We decide what loss functions to train the AIs with. It's not like the AIs have some inbuilt reward circuitry specified by evolution to maximize the AI's reproductive fitness. We can simply choose to reinforce cooperative behavior.
I think this leads to a massive power disparity (in our favor) between us and the AIs. Someone with total control over your own reward circuitry would have a massive advantage over you.
Maybe a nitpick, but ideally the reinforcement shouldn’t just be based on “behavior”; you want to reward the agent when it does the right thing for the right reasons. Right? (Or maybe you’re defining “cooperative behavior” as not only external behavior but also underlying motivations?)
What do power differentials have to do with the kind of mechanistic training story posited by shard theory?
The mechanistically relevant part of your point seems to be that feedback signals from other people probably transduce into reinforcement events in a person's brain, such that the post-reinforcement person is incrementally "more prosocial." But the important part isn't "feedback signals from other people with ~equal power", it's the transduced reinforcement events which increase prosociality.
So let's figure out how to supply good reinforcement events to AI agents. I think that approach will generalize pretty well (and is, in a sense, all that success requires in the deep learning alignment regime).
I guess to me, shard theory resembles RLHF, and seems to share its flaws (unless this gets addressed in a future post or I missed it in one of the existing posts or something).
So for instance learning values by reinforcement events seems likely to lead to deception. If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
This doesn't become much of a problem in practice among humans (or well, it actually does seem to be a fairly significant problem, but not x-risk level significant), but the most logical reinforcement-based reason I can see why it doesn't become a bigger problem is that people cannot reliably deceive each other. (There may also be innate honesty instincts? But that runs into genome inaccessibility problems.)
These seem like standard objections around here so I assume you've thought about them. I just don't notice those thoughts anywhere in the work.
I think a lot (but probably not all) of the standard objections don't make much sense to me anymore. Anyways, can you say more here, so I can make sure I'm following?
If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
(A concrete instantiated scenario would be most helpful! Like, Bob is talking with Alice, who gives him approval-reward of some kind when he does something she wants, and then...)
So I guess if we want to be concrete, the most obvious place to start would be classical cases where RLHF has gone wrong. Like a gripper pretending to pick up an object by placing its hand in front of the camera, or a game-playing AI pretending to make progress by replaying the same part of the game over and over again. Though these are "easy" in the sense that they seem correctable by taking more context into consideration.
One issue with giving concrete examples is that I think nobody has gotten RLHF to work in problems that are too "big" for humans to have all the context. So we don't really know how it would work in the regime where it seems irreparably dangerous. Like I could say "what if we give it the task of coming up with plans for an engineering project and it has learned to not make pollution that causes health problems obvious? Due to previously having suggested a design with obvious pollution and having that design punished", but who knows how RLHF will actually be used in engineering?
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
Like, you said:
shard theory resembles RLHF, and seems to share its flaws
So, if some alignment theory says "this approach (e.g. RLHF) is flawed and probably won't produce human-compatible values", and we notice "shard theory resembles RLHF", then insofar as shard theory is actually true, RLHF-like processes are the only known generators of human-compatible values ever, and I'd update against the alignment theory / reasoning which called RLHF flawed. (Of course, there are reasons -- like inductive biases -- that RLHF-like processes could work in humans but not in AI, but any argument against RLHF would have to discriminate between the human/AI case in a way which accounts for those obstructions.)
On the object level:
If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
Are you saying "The AI says something which makes us erroneously believe it saved a person's life, and we reward it, and this can spawn a deception-shard"? If so -- that's not (necessarily) how credit assignment works. The AI's credit assignment isn't necessarily running along the lines of "people were deceived, so upweight computations which deceive people."
the most logical reinforcement-based reason I can see why it doesn't become a bigger problem is that people cannot reliably deceive each other.
I don't know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Are you saying "The AI says something which makes us erroneously believe it saved a person's life, and we reward it, and this can spawn a deception-shard"?
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person's life. Whether that's deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else.
If so -- that's not (necessarily) how credit assignment works.
Your post points out that you can do all sorts of things in theory if you "have enough write access to fool credit assignment". But that's not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
I don't know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say "RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation." And then, of course, we should compare the predicted alignment concerns in people, with the observed alignment situation, and update accordingly. I've updated down hard on alignment difficulty when I've run this exercise in the past.
Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person's life.
Your post points out that you can do all sorts of things in theory if you "have enough write access to fool credit assignment". But that's not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.
That wasn't the part of the post I meant to point to. I was saying that just because we externally observe something we would call "deception/misleading task completion" (e.g. getting us to reward the AI for prosociality), does not mean that "deceptive thought patterns" get reinforced into the agent! The map is not the territory of the AI's updating process. The reward will, I think, reinforce and generalize the AI's existing cognitive subroutines which produced the judged-prosocial behavior, which subroutines don't necessarily have anything to do with explicit deception (as you noted).
Would you not agree that models are unaligned by default, unless there is something that aligns them?
Is a donut "unaligned by default"? The networks start out randomly initialized. I agree that effort has to be put in to make the AI care about human-good outcomes in particular, as opposed to caring about ~nothing, or caring about some other random set of reward correlates. But I'm not assuming the model starts out deceptive, nor that it will become that with high probability. That's one question I'm trying to figure out with fresh eyes.
I think I sorta disagree in the sense that high-functioning sociopaths live in the same society as neurotypical people, but don’t wind up “aligned”. I think the innate reward function is playing a big role. (And by the way, nobody knows what that innate human reward function is or how it works, according to me.) That said, maybe the innate reward function is insufficient and we also need multi-agent dynamics. I don’t currently know.
I’m sympathetic to your broader point, but until somebody says exactly what the rewards (a.k.a. “reinforcement events”) are, I’m withholding judgment. I’m open to the weaker argument that there are kinda dumb obvious things to try where we don’t have strong reason to believe that they will create friendly AGI, but we also don’t have strong reason to believe that they won’t create friendly AGI. See here. This is a less pessimistic take than Eliezer’s, for example.
I agree that you need more than just reinforcement learning.
I’m sympathetic to your broader point, but until somebody says exactly what the rewards (a.k.a. “reinforcement events”) are, I’m withholding judgment.
So in a sense this is what I'm getting at. "This resembles prior ideas which seem flawed; how do you intend on avoiding those flaws?".
I think Shard Theory is one of the most promising approaches on human values that I've seen on LW, and I'm very happy to see this work posted. (Of course, I'm probably biased in that I also count my own approaches to human values among the most promising and Shard Theory shares a number a similarities with it - e.g. this post talks about something-like-shards issuing mutually competitive bids that get strengthened or weakened depending on how environmental factors activate those shards, and this post talked about values and world-models being learned in an intertwined manner.)
Curated. "Big if true". I love the depth and detail in shard theory. Separate from whether all its details are correct, I feel reading and thinking about this will get me towards a better understanding of humans and artificial networks both, if only via making reflect on how things work.
I do fear that shard theory gets a bit too much popularity from the coolness of the name, but I do think there is merit here, and if we had more theories of this scope, it'd be quite good.
A bit tangential: Regarding the terminology, what you here call "values" would be called "desires" by philosophers. Perhaps also by psychologists. Desires measure how strongly an agent wants an outcome to obtain. Philosophers would mostly regard "value" as a measure of how good something is, either intrinsically or for something else. There appears to be no overly strong connection between values in this sense and desires, since you may believe that something is good without being motivated to make it happen, or the other way round.
If you say (or even believe) that X is good, but never act to promote X, I’d say you don’t actually value X in a way that’s relevant for alignment. An AI which says / believes human flourishing is good, without actually being motivated to make said flourishing happen, would be an alignment failure.
I also think that an agents answers to the more abstract values questions like “how good something is” are strongly influenced by how the simpler / more concrete values form earlier in the learning process. Our intent here was to address the simple case, with future posts discussing how abstract values might derive from simpler ones.
I agree that value in the sense of goodness is not relevant for alignment. Relevant is what the AI is motivated to do, not what it believes to be good. I'm just saying that your usage of "value" would be called "desire" by philosophers.
Often it seems that using the term "values" suffers a bit from this ambiguity. If someone says an agent A "values" an outcome O, do they mean A believes that O is good, i.e. that A believes O has a high value, a high degree of goodness? Or do they mean that A wants O to obtain, i.e. that A desires O? That seems often ambiguous.
A solution would be to taboo the term "values" and instead talk directly about desires, or what is good or believed to be good.
But in your post you actually clarified in the beginning that you mean value in the sense of desire, so this usage seems fine in your case.
The term "desire" actually has one problem itself -- it arguably implies consciousness. We like to talk about anything an AI might "want", in the sense of being motivated to realize, without necessarily being conscious.
I'm often asked, Why "shard theory"? I suggested this name to Quintin when realizing that human values have the type signature of contextually activated decision-making influences. The obvious choice, then, was to call these things "shards of value"—drawing inspiration from Eliezer's Thou art godshatter, where he originally wrote "shard of desire."
(Contrary to several jokes, the choice was not just "because 'shard theory' sounds sick.")
This name has several advantages. Value-shards can have many subshards/facets which vary contextually (a real crystal may look slightly different along its faces or have an uneven growth pattern); value-shards grow in influence over time under repeated positive (just as real crystals can grow); value-shards imply a degree of rigidity, but also incompleteness—they are pieces of a whole (on my current guess, the eventual utility function which is the reflective equilibrium of value-handshakes between the set of endorsed shards which bid as a function of their own future prospects). Lastly, a set of initial shards will (I expect) generally steer the future towards growing themselves (e.g. banana-shard leads to more banana-consumption -> more reward -> the shard grows and becomes more sophisticated); similarly, given an initial start of a crystalline lattice which is growing, I'd imagine it becomes more possible to predict the later lattice configuration due to the nature of crystals.
the genome can’t directly make us afraid of death
It's not necessarily direct, but in case you aren't aware of it, prepared learning is a relevant phenomenon,since apparently the genome does predispose us to certain fears
This is one of the most important posts ever on LW though I don't think the implications have been fully drawn out. Specifically, this post raises serious doubts about the arguments for AI x-risk as a result of alignment mismatch and the models used to talk about that risk. It undercuts both Bostrom's argument that an AI will have a meaningful (self-aware?) utility function and Yudkowsky's reward button parables.
The role these two arguments play in convincing people that AI x-risk is a hard problem is to explain why, if you don't anthropomorphize should a program that's , say, excellent at conducting/scheduling interviews to ferret out moles in the intelligence community try to manipulate external events at all not just think about them to better catch moles? I mean it's often the case that ppl fail to pursue their fervent goals outside familiar context. Why will AI be different? Both arguments conclude that AI will inevitably act like it's very effectively maximizing some simple utility function in all contexts and in all ways.
Bostrom tries to convince us that as creatures get more capable they tend to act more coherently (more like they are governed by a global utility function). This is of course true for evolved creatures but by offering a theory of how value type things can arise this theory predicts that if you only train your AI in a relatively confined class of circumstances (even if that requires making very accurate predictions about the rest of the world) it isn't going to develop that kind of simple global value but, rather, would likely find multie shards in tension without clear direction if forced to make value choices in very different circumstances. Similarly, it exains why the AI won't just wirehead itself by pressing it's rewaes button.
This is really interesting. It's hard to speak too definitively about theories of human values, but for what it's worth these ideas do pass my intuitive smell test.
One intriguing aspect is that, assuming I've followed correctly, this theory aims to unify different cognitive concepts in a way that might be testable:
For the experiment proper, by which point Albert was 11 months old, he was put on a mattress on a table in the middle of a room. A white laboratory rat was placed near Albert and he was allowed to play with it. At this point, Watson and Rayner made a loud sound behind Albert's back by striking a suspended steel bar with a hammer each time the baby touched the rat. Albert responded to the noise by crying and showing fear. After several such pairings of the two stimuli, Albert was presented with only the rat. Upon seeing the rat, Albert became very distressed, crying and crawling away.
[...]
In further experiments, Little Albert seemed to generalize his response to the white rat. He became distressed at the sight of several other furry objects, such as a rabbit, a furry dog, and a seal-skin coat, and even a Santa Claus mask with white cotton balls in the beard.
A couple more random thoughts on stories one could tell through the lens of shard theory:
I do have a question about your claim that shards are not full subagents. I understand that in general different shards will share parameters over their world-model, so in that sense they aren't fully distinct — is this all you mean? Or are you arguing that even a very complicated shard with a long planning horizon (e.g., "earn money in the stock market" or some such) isn't agentic by some definition?
Anyway, great post. Looking forward to more.
I do have a question about your claim that shards are not full subagents. I understand that in general different shards will share parameters over their world-model, so in that sense they aren't fully distinct — is this all you mean? Or are you arguing that even a very complicated shard with a long planning horizon (e.g., "earn money in the stock market" or some such) isn't agentic by some definition?
I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard economy. Importantly, I didn't want the reader to think that we're positing a bunch of homunculi. Maybe I should have just written that.
But I also feel relatively ignorant more advanced shard dynamics. While I can give interesting speculation, I don't have enough evidence-fuel to make such stories actually knowably correct.
I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard economy.
What's your take on "parts work" techniques like IDC, IFS, etc. seeming to bring up something like private (or at least not completely shared) world models? Do you consider the kinds of "parts" those access as being distinct from shards?
I would find it plausible to assume by default that shards have something like differing world models since we know from cognitive psychology that e.g. different emotional states tend to activate similar memories (easier to remember negative things about your life when you're upset than if you are happy), and different emotional states tend to activate different shards.
I suspect that something like the Shadlen & Shohamy take on decision-making might be going on:
The proposal is that humans make choices based on subjective value [...] by perceiving a possible option and then retrieving memories which carry information about the value of that option. For instance, when deciding between an apple and a chocolate bar, someone might recall how apples and chocolate bars have tasted in the past, how they felt after eating them, what kinds of associations they have about the healthiness of apples vs. chocolate, any other emotional associations they might have (such as fond memories of their grandmother’s apple pie) and so on.
Shadlen & Shohamy further hypothesize that the reason why the decision process seems to take time is that different pieces of relevant information are found in physically disparate memory networks and neuronal sites. Access from the memory networks to the evidence accumulator neurons is physically bottlenecked by a limited number of “pipes”. Thus, a number of different memory networks need to take turns in accessing the pipe, causing a serial delay in the evidence accumulation process.
Under that view, I think that shards would effectively have separate world models, since each physically separate memory network suggesting that an action is good or bad is effectively its own shard; and since a memory network is a miniature world model, there's a sense in which shards are nothing but separate world models.
E.g. the memory of "licking the juice tasted sweet" is a miniature world model according to which licking the juice lets you taste something sweet, and is also a shard. (Or at least it forms an important component of a shard.) That miniature world model is separate from the shard/memory network/world model holding instances of times when adults taught the child to say "thank you" when given something; the latter shard only has a world model of situations where you're expected to say "thank you", and no world model of the consequences of licking juice.
In the things you write, I see a clear analogy with Bernard Baars' Global Workspace Theory. Especially his writings on "Goal Frames" and "Frame Stacks" seem to overlap with some of your ideas on how shards bid for global dominance. Also not unlike Dennett's "Fame in the Brain".
GWT is also largely a theory on how a massively parallel group of processors can give rise to a limited, serial conscious experience. His work is a bit difficult to get into and it's been a while, so it would take me some more time to write up a distillation. Let me know if you are interested.
Some quick thoughts about "Content we aren’t (yet) discussing":
SL (Cloning) is more important than RL. Humans learn a world model by SSL, then they bootstrap their policies through behavioural cloning and finally they finetune their policies thought RL.
Why? Because of theoretical reasons and from experimental data points, this is the cheapest why to generate good general policies…
The learned values known by the previous generation.
Why?
Some instrument goals are learned as final goal, they are “internalised”.
Why?
Why?
We have here 3 level of rewards function:
Hardcoded in our body
Optimisation process creating it: Evolution
Not really flexible
Almost no generalization power
Called sensations, pleasure, pain
Learned through life
Optimisation process creating it: SL and RL relying on biological rewards
Flexible in term of years
Medium generalization power
Called intuitions, feelings
Shard theory may be explaining only this part
Decided upon reflection
Optimisation process creating it: Thinking relying on the brain
Flexible in term of minutes
Can have up to very high generalization power
Called values, moral values
In short, to get more utility OOD.
A bit more details:
Because we want to design policies far OOD (out of our space of lived experiences). To do that, we know that we need to have a value function|reward model|utility function that generalizes very far. Thanks to this chosen general reward function, we can plan and try to reach a desired outcome far OOD. After reaching it, we will update our learned utility function (lvl 2).
Thanks to lvl 3, we can design public policies, dedicate our life to exploring the path towards a larger reward that will never be observed in our lifetime.
This could explain why most philosophers can support scope sensitive values but never act on them.
A person might deliberately avoid passing through the sweets aisle in a supermarket in order to avoid temptation.
I like this example, and the related discussion around reflective endorsement and contextual activation/weighting of various 'decision-making influences' is great.
This relates closely in my reading to the concepts of Actualism and Possibilism from moral philosophy[1].
In short, actualism emphasises what the ongoing agent would actually (be expected to) do as the decision-relevant factors - in the sweets example, perhaps that's succumbing to a sweet tooth given the context of proximity to the sweets, and buying lots of packets.
Possibilism instead emphasises the ongoing counterfactual of what options are considered available as the decision-relevant factors - even if you can predict that you'll succumb, the right thing to do is to take the shorter route via the sweets and then just have the willpower dammit to overcome the sweet tooth!
Some authors contrast naive possibilist with 'resolute' or 'sophisticated' decisions for this kind of sequential problem[2].
In my mind these relate closely to the concepts of policy-relative action-advantage functions from reinforcement learning, except applied to more compound 'options' than whatever plain atomic action space is assumed to exist. But I've not seen this comparison made anywhere.
(Presumably you use the term 'option' as I do, to refer to something similar to the contextually-activated semi-policies of Sutton and Precup and others?)
By the way, I find it helpful to think about decision-makers only existing in their present form for the duration of a 'single decision' and I think this substantially gets at the heart of embeddedness.
See the short Wikipedia take here or the longer Stanford take here. ↩︎
See Stanford here with their discussion of Ulysses under Sequential decisions ↩︎
I wonder how the following behavioral patterns fit into Shard Theory
Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before.
Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water
I've been wondering about the latter for a while. These two results are less strongly predicted by shard theoretic reasoning than by "hardcoded" hypotheses. Pure-RL+SL shard theory loses points on these two observations, and points to other mechanisms IMO (or I'm missing some implications of pure-RL+SL shard theory).
Ever since the discovery that the mammalian dopamine system implements temporal difference learning of reward prediction error, a longstanding question for those seeking a satisfying computational account of subjective experience has been: what is the relationship between happiness and reward (or reward prediction error)? Are they the same thing?
Or if not, is there some other natural correspondence between our intuitive notion of “being happy” and some identifiable computational entity in a reinforcement learning agent?
A simple reflection shows that happiness is not identical to reward prediction error: If I’m on a long, tiring journey of predictable duration, I still find relief at the moment I reach my destination. This is true even for journeys I’ve taken many times before, so that there can be little question that my unconscious has had opportunity to learn the predicted arrival time, and this isn’t just a matter of my conscious predictions getting ahead of my unconscious ones.
On the other hand, I also gain happiness from learning, well before I arrive, that traffic on my route has dissipated. So there does seem to be some amount of satisfaction gained just from learning new information, even prior to “cashing it in”. Hence, happiness is not identical to simple reward either.
Perhaps shard theory can offer a straightforward answer here: happiness (respectively suffering) is when a realized feature of the agent’s world model corresponds to something that a shard which is currently active values (respectively devalues).
If this is correct, then happiness, like value, is not a primitive concept like reward (or reward prediction error), but instead relies on at least having a proto-world model.
It also explains the experience some have had, achieved through the use of meditation or other deliberate effort, of bodily pain without attendant suffering. They are presumably finding ways to activate shards that simply do not place negative value on pain.
Finally: happiness is then not a unidimensional, inter-comparable thing, but instead each type is to an extent sui generis. This comports with my intuition: I have no real scale on which I can weigh the pleasure of an orgasm against the delight of mathematical discovery.
A strong positive sign of a new theory's validity is one is frequently able to simplify one's understanding of old discoveries through use of the new theory. Here are a few examples from the past few days of old ideas that make more sense when framed in terms of shard theory:
Adaptation-Executers, not Fitness-Maximizers
"Individual organisms are best thought of as adaptation-executers rather than as fitness-maximizers." —John Tooby and Leda Cosmides, The Psychological Foundations of Culture.
This is highly speculative, so feel free to point out if I am misrepresenting the theory, but couldn't human reward circuitry itself be viewed as "evolutionary shards" so to speak? As in, the reward circuitry itself is a collection of shards created during the pursuit of the "maximize reproductive fitness" goal?
Another example from Scott Alexander's most recent post (and congrats on the shout-out for your post on "Reward is not the optimization target").
I once attended a presentation on grief at a psychiatry conference. The presenter treated grief as a form of updating on prediction error. Your spouse dies. The next morning, you wake up and expect to find your spouse in bed with you. They aren’t. The situation is worse than you expected. Actual hedonic state is lower than predicted hedonic state, reward prediction error is negative. You now feel bad.
Of course, your conscious brain should be able to fully update on “I will not see my spouse again” the moment they die. This explanation assumes that the unconscious is slower to update. I accept this assumption. I’ve never had a partner die, but I’ve had some bad breakups. The next few months really are a series of “If only X were here…” and “This is so much worse without X”. Then eventually I mostly update and stop thinking of X being around as a natural comparison.
Shard theory seems to provide a much more natural explanation of these "subsequent pangs of grief"comapred with the "subconscious updating slower than the conscious mind" explanation.
Deeply embedded features of our lives like spouses are present in many different shards, and every shard has to update before the grief can be processed.
This is quite interesting. It strikes me as perhaps a first-principles derivation of the theory of constructed preferences in behavioral economics.
Compare your
A shard of value refers to the contextually activated computations which are downstream of similar historical reinforcement events … We think that simple reward circuitry leads to different cognition activating in different circumstances. Different circumstances can activate cognition that implements different values, and this can lead to inconsistent or biased behavior. We conjecture that many biases are convergent artifacts of the human training process and internal shard dynamics. People aren’t just randomly/hardcoded to be more or less “rational” in different situations.
to Bernheim’s
According to this view, I aggregate the many diverse aspects of my experience only when called upon to do so for a given purpose, such as making a choice or answering a question about my well-being. … To answer a question about my overall welfare, or to choose between alternatives without deploying a previously constructed rule of thumb, I must weigh the positives against the negatives and construct an answer de novo. …This perspective potentially attributes particular choice anomalies to the vagaries of aggregation. In particular, when I deliberate and aggregate, the weights I attach to the various dimensions of my subjective experience may be sensitive to context.
Values are closely related to preferences, and preferences have been extensively studied in behavioral econ. I've written more on the connection between AI and behavioral econ here.
Thanks for the reference (and sorry for just now getting around to replying).
I think Bernheim's paper is somewhat related to the shard theory of human values. There are several commonalities, including
However, I think that shard theory is not a rederivation of this work, or other work mentioned in this paper:
That doesn't mean these works are unrelated. If you want to deeply understand welfare and "idealized preferences" / what people "should" choose, I think that we should understand more about how people make choices, via what neural circuits. This is a question of neuroscience and reinforcement learning theory. The shard theory of human values aims to contribute to that question.
As you pointed out in private correspondence, the shard theory of human values can be viewed as a hypothesis about where the context-sensitive preferences come from.
This is pretty exciting. I've not really done any direct work to push forward alignment in the last couple years, but this is exactly the sort of direction I was hoping someone would go when I wrote my research agenda for deconfusing human values. What came out of it was that there was some research to do that I wasn't equipped to do myself, and I'm very happy to say you've done the sort of thing I had hoped for.
On first pass this seems to address many of the common problems with traditional approaches to formalizing values. I hope that this proves a fruitful line of research!
Thank you for the post!
I found it interesting to think about how self-supervised learning + RL can lead to human-like value formation, however I'm not sure how much predictive power you gain out of the shards. The model of value formation you present feels close to the Alpha Go setup:
You have an encoder E, an action decoder D, and a value head V. You train D°E with something close to self-supervised learning (not entirely accurate, but I can imagine other RL systems trained with D°E doing exactly supervised learning), and train V°E with hard-coded sparse rewards. This looks very close to shard theory, except that you replace V with a bunch of shards, right? However, I think this later part doesn't make predictions different from "V is a neural network", because neural networks often learn context-dependent things, and I expect Alpha Go V-network to be very context dependent.
Is sharding a way to understand what neural networks can do in human understandable terms? Or is it a claim about what kind of neural network V is (because there are neural networks which aren't very "shard-like")?
Or do you think that sharding explains more than "the brain is like Alpha Go"? For example, maybe it's hard for different part of the V network to self-reflect. But that feels pretty weak, because human don't do that much either. Did I miss important predictions shard theory does and the classic RL+supervised learning setup doesn't?
For what it's worth, I wrote a critique of Shard Theory here on LessWrong (on Oct 20, 2022) from the perspective of behavior genetics and the heritability of values.
The comments include some helpful replies and discussions with Shard Theory developers Quintin Pope and Alex Trout.
I'd welcome any other feedback as well.
I'm new here so please forgive me if I'm a bit naive on how the comment section works, but I have one simple question: are there some falsifiable predictions that you could come up with for this theory? In this way, one could create an experiment and test them, e.g. in the bias or moral decision-making fields.
Thanks for leaving a comment. I think the theory is best falsified via its assumptions. If the brain didn't do RL, that would seem sufficient to discount the reasoning presented in this post. If evolved modularity were broadly true, then probably shard-like dynamics don't tell most of the story of human values. The trouble with experiments for shard theory writ large, is that current shards should be sensitively dependent on the person's past reinforcement events (IIRC you can see this kind of dynamic reflected in conditional reinforcer events, where the effect of reward changes depending on beliefs -- "Why did I get the banana?"). You can get lots of arrangements of shards (of values) given appropriate historical reinforcement events and epistemic states. So it's not like shard theory is ruling out wide swaths of values or actions (nor should it, since people do seem to have a wide range of values).
Regarding time inconsistency of rewards, where subjects displayed a "today-bias", might this be explained by shards formed in relation to "payout-day" (getting pocket money or salary)? For many people, agency and well-being vary over the month, peaking on the day of their monthly payout. It makes sense to me that these variations create a shard that values getting paid TODAY rather than tomorrow.
For the 365 vs 366 example, I would assume that the selection is handled more rationally, optimizing for the expected return.
I keep finding examples of decision making that can be explained by shard theory. In particular, here's an example I read in ACX today about how depressed people make decisions that keep them depressed, which sounds an awful lot like "shards guiding behavior in ways that tend to lead to their own future self-reinforcement":
Millgram et al (2015) find that depressed people prefer to listen to sad rather than happy music. This matches personal experience; when I'm feeling down, I also prefer sad music. But why? Try setting aside all your internal human knowledge: wouldn’t it make more sense for sad people to listen to happy music, to cheer themselves up?
A later study asks depressed people why they do this. They say that sad music makes them feel better, because it’s more "relaxing" than happy music. They’re wrong. Other studies have shown that listening to sad music makes depressed people feel worse, just like you’d expect. And listening to happy music makes them feel better; they just won’t do it.
Scott's preferred explanation is one of a kind of "mood setpoint" which the depressed individual's actions are trying to reach:
Depression is often precipitated by some psychosocial event (like loss of a job, or the death of a loved one). It’s natural to feel sad for a little while after this. But instead of correctly activating regulatory processes to get mood back to normal, the body accepts the new level as its new set point, and tries to defend it.
By “defend it”, I mean that healthy people have a variety of mechanisms to stop being sad and get their mood back to a normal level. In depression, the patient appears to fight very hard to prevent mood getting back to a normal level. They stay in a dark room and avoid their friends. They even deliberately listen to sad music!
Self-reinforcing "depression shards" are obviously a mechanism through which depressive states can be maintained. But then the question becomes why are some people more vulnerable to this kind of depression homeostasis than others?
There's certainly a genetic component, but given how polygenic depression risk is (>30k variants involved) the mechanisms are likely multi-causal.
I like this theory of human motivation. My two main points: there could be many almost consistent theories of motivation, and the choice between them is difficult when we want to upload them into AI.
Not anything that looks like human value is actually valuable in sense that that future AI should care about it. For example, in one model human motivation consists from "animal" desires and socially accepted rules. If AI will learn my preferences I prefer that it will learn my rules, but not animal desires. I wrote more about critics of the idea of human values here.
If I know that Shard Theory is 100% true, then I don't care if I see or don't see direct benefits for Alignment: understanding human psychology is important for Alignment one way or another.
But if I'm not sure that Shard Theory is true, then I would like to judge it by its direct benefits for Alignment too. Maybe even judge its plausibility by its usefulness. (Because of this I'm not sure that making multiple posts about multiple aspects of Shard Theory is the best option.) An example of mixing "plausibility" and "usefulness": neural nets don't make predictions about human behavior, but they're extremely useful, they're the only thing that does anything human-like, so the idea of "neural nets as a model of humans" gets its plausibility from its usefulness.
If you think that usual Alignment problems don't apply to shards, the burden of proof is on you. You have to translate those problems into the language of the theory yourself.
Examples of explaining human behavior are strange:
The third point may be the most important one. Here're some specific examples:
Why do we care more about nearby visible strangers as opposed to distant strangers?
I'm not sure that's true. Here we already moved from facts to interpretations of facts.
We think that the answer is simple. First consider the relevant context. The person sees a drowning child. What shards activate? Consider the historical reinforcement events relevant to this context. Many of these events involved helping children and making them happy. These events mostly occurred face-to-face.
One thing that feels strange about those examples is that they seem to ignore people's general ability to think. The theory may be getting too mechanical on too high level of cognition.
Personally, I (TurnTrout) am more inclined to make plans with my friends when I’m already hanging out with them—when we are already physically near each other. But why?
I think anything can trigger making plans with friends, depending on your personality and absolutely random factors. If it weren't true people would be as rigid as zombies.
Therefore, the sunflower-timidity-shard was grown from… Hm. It wasn’t grown. The claim isn’t true, and this shard doesn’t exist, because it’s not downstream of past reinforcement.
Thus: Shard theory does not explain everything, because shards are grown from previous reinforcement events and previous thoughts. Shard theory constrains anticipation around actual observed human nature.
I'm not sure Shard Theory really doesn't explain this. You could say that sunflowers are very similar to other good things in the past, e.g. Beautiful Calm Nature and Movies and Flowers As Something Good and Places That Are 100% Not School or Workplace or Busy City. And I think that Shard Theory is actually in trouble either way:
Maybe that's the key thing that makes me doubt the theory.
You could say that sunflowers are very similar to other good things in the past, e.g. Beautiful Calm Nature and Movies and Flowers As Something Good and Places That Are 100% Not School or Workplace or Busy City. And I think that Shard Theory is actually in trouble either way:
How does that reinforcement event history create a sunflower-timidity-shard?
I think such reinforcement history could create "Nature - timidity" shard and sunflowers (and flowers in general) could be a strong symbol of nature.
By the way, I would like if explanations of human behavior were discussed more in the post. E.g. if the post proposed a couple of shard based explanations and compared them to some non-shard based explanations. For example: (I realize that the explanation is just an example, it's not final)
For example, perhaps there is a hardcoded reward circuit which is activated by a crude subcortical smile-detector and a hardcoded attentional bias towards objects with relatively large eyes. Then reinforcement events around making children happy would cause people to care about children. For example, an adult’s credit assignment might correctly credit decisions like “smiling at the child” and “helping them find their parents at a fair” as responsible for making the child smile. “Making the child happy” and “looking out for the child’s safety” are two reliable correlates of smiles, and so people probably reliably grow child-subshards around these correlates.
Does the theory say that a full-grown adult wouldn't have enough mental machinery to care about children strong enough if she lacked "smile-detector" and "large eyes detector" or a couple of specific decisions in the past?
If you saw that someone vandalizes something important to your friend (e.g. her artworks), you probably would get a strong reaction to that just because you understand what's happening. Or because of some more general shards (e.g. related to "effort" and to yourself and to your friend). Wouldn't a drowning child activate much more shards and/or other things?
Sorry if you already wrote about it, but does Shard Theory fall under the umbrella of behaviorism?
Behaviorism is a systematic approach to understanding the behavior of humans and other animals.[1] It assumes that behavior is either a reflex evoked by the pairing of certain antecedent stimuli in the environment, or a consequence of that individual's history, including especially reinforcement and punishment contingencies, together with the individual's current motivational state and controlling stimuli. Although behaviorists generally accept the important role of heredity in determining behavior, they focus primarily on environmental events.
That's certainly an interesting position in discussion about what people want!
Namely, that actions and preferences are just conditionally-activated and those context activations are balanced against each other. That means that person's preference system may be not only incomplete but incoherent in architecture, and moral systems and goals obtained via reflection are almost certainly not total (will lack in some contexts), creating problem in RLHF.
The first assumption, that part of neurons is basically randomly initialized, can't be tested really well because all humans are born in similar gravity field, see similarly-structured images in first days (all "colorful patches" correspond to objects which are continuous, mostly flat or uniformly round), etc and that leaves a generic imprint.
Time inconsistency example: You’ve described shards as context-based predictions of getting reward. One way to model the example would be to imagine there is one shard predicting the chance of being rewarded in the situation where someone is offering you something right now, and another shard predicting the chance you will be rewarded if someone is promising they will give you something tomorrow.
For example, I place a substantially better probability on getting to eat cake if someone is currently offering me the slice of cake, compared to someone promising that they will bring a slightly better cake to the office party tomorrow. (In the second case, they might get sick, or forget, or I might not make it to the party.)
You’ve described shards as context-based predictions of getting reward.
I think you're summarizing "Shard theory views 'shards' as contextually-activated predictors of low-level reward events (i.e. reward prediction errors)." If so, that's not what I meant to communicate. On my view, shards usually aren't reward predictors at all, the shards were simply shaped into existence by past reward events. Here's how I'd analyze the situation:
My cake-shard would have been shaped into existence by past reinforcement events related to cake. My cake shard affects my decisions more strongly in situations which are similar to the past reinforcement events (e.g. because I internalized heuristics like "If I see cake, then be more likely to eat cake"), and therefore I'm more tempted by cake when I can see cake.
I wasn't thinking of shards as reward prediction errors, but I can see how the language was confusing. What I meant is that when multiple shards are activated, they affect behavior according to how strongly and reliably they were reinforced in the past. Practically, this looks like competing predictions of reward (because past experience is strongly correlated with predictions of future experience), although technically it's not a prediction - the shard is just based on the past experience and will influence behavior similarly even if you rationally know the context has changed. E.g. the cake shard will probably still reinforce eating cake even if you know that you just had mouth-changing surgery that means you don't like cake anymore.
(However, I would expect that shards evolve over time. So in the this example, after enough repetitions reliably failing to reinforce cake eating, the cake shard would eventually stop making you crave cake when you see cake.)
So in my example, cleaner language might be: For example, I more reliably ate cake in the past if someone was currently offering me the slice of cake, compared to someone promising that they will bring a slightly better cake to the office party tomorrow. So when the "someone is currently offering me something" shard and the "someone is promising me something" shard are both activated, the first shard affects my decisions more, because it was rewarded more reliably in the past.
(One test of this theory might be whether people are more likely to take the bigger, later payout if they grew up in extremely reliable environments where they could always count on the adults to follow through on promises. In that case, their "someone is promising me something" shard should have been reinforced similarly to the "someone is currently offering me something" shard. This is basically one explanation given for the classic Marshmallow Experiment - kids waited if they trusted adults to follow through with the promised two marshmallows; kids ate the marshmallow immediately if they didn't trust adults.)
I'd just like to add that even if you think this piece is completly mistaken I think it certainly shows we are definitely not knowledgeable enough about what and how values and motives work in us much less AI to confidently make the prediction that AIs will be usefully described with a single global utility function or will work to subvert their reward system or the like.
Maybe that will turn out to be true but before we spend so many resources on trying to solve AI alignment let's try to make the argument for the great danger much more rigorous first...usually best way to start anyway.
How does shard theory explain romantic jealousy? It seems like most people feel jealous when their romantic partner does things like dancing with someone else or laughing at their jokes. How do shards like this form from simple reward circuitry? I'm having trouble coming up with a good story of how this happens. I would appreciate if someone could sketch one out for me.
I don't know.
Speculatively, jealousy responses/worries could be downstream of imitation/culture (which "raises the hypothesis"/has self-supervised learning ingrain the completion, such that now the cached completion is a consequence which can be easily hit by credit assignment / upweighted into a real shard). Another source would be negative reward events on outcomes where you end up alone / their attentions stray. Which, itself, isn't from simple reward circuitry, but a generalization of other learned reward events which I expect are themselves downstream of simple reward circuitry. (Not that that reduces my confusion much)
I do research in cooperation and game theory, including some work on altruism, and also some hard science work. Everyone looks at the Rorschach blot of human behavior and sees something different. Most of the disagreements have never been settled. Even experiment does not completely settle them.
My experience from having children and observing them in the first few months of life is more definitive. They come with values and personal traits that are not very malleable, and not directly traceable to parents. Sometimes grandparents (who were already dead, so it had to be genetic).
My experience with A.I. researchers is that they are looking for a shortcut. No one's career will permit raising an experimental AI as a child. The tech of the AI would be obsolete before the experiment was complete. This post is wishful thinking that a shortcut is available. It is speculative, anecdotal, and short on references to careful experiment. Good luck with that.
My experience from having children and observing them in the first few months of life is more definitive. They come with values and personal traits that are not very malleable, and not directly traceable to parents.
What, really? What observations produced these inferences? Even if that were true in the first few months, how would we know that?
Personality traits are highly heritable and not very malleable/depend on the early environment. Indeed more experience reduces personality:
Decades of research have shown that about half of individual differences in personality traits is heritable. Recent studies have reported that heritability is not fixed, but instead decreases across the life span. [...] For most traits, findings provided evidence for an increasing relative importance of life
experiences contributing to personality differences across the life span.
How Genetic and Environmental Variance in Personality Traits Shift Across the Life Span: Evidence From a Cross-National Twin Study (just add "gwern" to your heritability Google search)
I don't think this disproves shard theory. I think that differences in small children's attention or emotional regulation levels lead to these differences. Shards will form around things that happen reliably in contexts created by the emotional behaviors or the objects of attention. Later on, with more context and abstraction, some of these shards may coalesce or be outbid by more generally adaptive shards.
ADDED: Hm, it seems you have seen The heritability of human values: A behavior genetic critique of Shard Theory which has much more of this.
(Note that 'life experiences' here is being used in the (misleading to laymen) technical sense of 'non shared-environment': all variance on the raw measurement which cannot be ascribed to either genetic variance at conception or within-family shared-across-all-siblings influences. So 'life experience' includes not just that rousing pep talk your coach gave you in highschool you never forgot, which is probably the sort of thing you are thinking of when you read the phrase 'life experiences', but also that personality item question you misunderstood due to outdated wording & answered the wrong way, and that ear infection as a 6 month old baby that set up the trigger for an autoimmune disorder 50 years later, and that A/B test on Facebook which showed you the wrong job ad, and that gamma ray which mutated a critical neuron at age 35 & gave you brain cancer & made you misanthropic, and... If you are unsure if 'non shared-environment' is being used in a meaningful way, simply try swapping in various contributors to non shared-environment like 'somatic mutations during the first trimester' and see how sensible the claim remains: sometimes you'll get something absurd like "the decrease of heritability and increasing importance of somatic mutations during the first trimester over the course of a lifetime proves we have free will".)
TL;DR: We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry.
We think that human value formation is extremely important for AI alignment. We have empirically observed exactly one process which reliably produces agents which intrinsically care about certain objects in the real world, which reflect upon their values and change them over time, and which—at least some of the time, with non-negligible probability—care about each other. That process occurs millions of times each day, despite genetic variation, cultural differences, and disparity in life experiences. That process produced you and your values.
Human values look so strange and inexplicable. How could those values be the product of anything except hack after evolutionary hack? We think this is not what happened. This post describes the shard theory account of human value formation, split into three sections:
Terminological note: We use “value” to mean a contextual influence on decision-making. Examples:
To us, this definition seems importantly type-correct and appropriate—see Appendix A.2. The main downside is that the definition is relatively broad—most people wouldn’t list “donuts” among their “values.” To avoid this counterintuitiveness, we would refer to a “donut shard” instead of a “donut value.” (“Shard” and associated terminology are defined in section II.)
I. Neuroscientific assumptions
The shard theory of human values makes three main assumptions. We think each assumption is pretty mainstream and reasonable. (For pointers to relevant literature supporting these assumptions, see Appendix A.3.)
Assumption 1: The cortex[1] is basically (locally) randomly initialized. According to this assumption, most of the circuits in the brain are learned from scratch, in the sense of being mostly randomly initialized and not mostly genetically hard-coded. While the high-level topology of the brain may be genetically determined, we think that the local connectivity is not primarily genetically determined. For more clarification, see [Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain.
Thus, we infer that human values & biases are inaccessible to the genome:
Assumption 2: The brain does self-supervised learning. According to this assumption, the brain is constantly predicting what it will next experience and think, from whether a V1 neuron will detect an edge, to whether you’re about to recognize your friend Bill (which grounds out as predicting the activations of higher-level cortical representations). (See On Intelligence for a book-long treatment of this assumption.)
In other words, the brain engages in self-supervised predictive learning: Predict what happens next, then see what actually happened, and update to do better next time.
Definition. Consider the context available to a circuit within the brain. Any given circuit is innervated by axons from different parts of the brain. These axons transmit information to the circuit. Therefore, whether a circuit fires is not primarily dependent on the external situation navigated by the human, or even what the person senses at a given point in time. A circuit fires depending on whether its inputs[2]—the mental context—triggers it or not. This is what the "context" of a shard refers to.
Assumption 3: The brain does reinforcement learning. According to this assumption, the brain has a genetically hard-coded reward system (implemented via certain hard-coded circuits in the brainstem and midbrain). In some[3] fashion, the brain reinforces thoughts and mental subroutines which have led to reward, so that they will be more likely to fire in similar contexts in the future. We suspect that the “base” reinforcement learning algorithm is relatively crude, but that people reliably bootstrap up to smarter credit assignment.
Summary. Under our assumptions, most of the human brain is locally randomly initialized. The brain has two main learning objectives: self-supervised predictive loss (we view this as building your world model; see Appendix A.1) and reward (we view this as building your values, as we are about to explore).
II. Reinforcement events shape human value shards
This section lays out a bunch of highly specific mechanistic speculation about how a simple value might form in a baby’s brain. For brevity, we won’t hedge statements like “the baby is reinforced for X.” We think the story is good and useful, but don’t mean to communicate absolute confidence via our unhedged language.
Given the inaccessibility of world model concepts, how does the genetically hard-coded reward system dispense reward in the appropriate mental situations? For example, suppose you send a drunk text, and later feel embarrassed, and this triggers a penalty. How is that penalty calculated? By information inaccessibility and the absence of text messages in the ancestral environment, the genome isn’t directly hard-coding a circuit which detects that you sent an embarrassing text and then penalizes you. Nonetheless, such embarrassment seems to trigger (negative) reinforcement events... and we don’t really understand how that works yet.
Instead, let’s model what happens if the genome hardcodes a sugar-detecting reward circuit. For the sake of this section, suppose that the genome specifies a reward circuit which takes as input the state of the taste buds and the person’s metabolic needs, and produces a reward if the taste buds indicate the presence of sugar while the person is hungry. By assumption 3 in section I, the brain does reinforcement learning and credit assignment to reinforce circuits and computations which led to reward. For example, if a baby picks up a pouch of apple juice and sips some, that leads to sugar-reward. The reward makes the baby more likely to pick up apple juice in similar situations in the future.
Therefore, a baby may learn to sip apple juice which is already within easy reach. However, without a world model (much less a planning process), the baby cannot learn multi-step plans to grab and sip juice. If the baby doesn’t have a world model, then she won’t be able to act differently in situations where there is or is not juice behind her. Therefore, the baby develops a set of shallow situational heuristics which involve sensory preconditions like “IF juice pouch detected in center of visual field, THEN move arm towards pouch.” The baby is basically a trained reflex agent.
However, when the baby has a proto-world model, the reinforcement learning process takes advantage of that new machinery by further developing the juice-tasting heuristics. Suppose the baby models the room as containing juice within reach but out of sight. Then, the baby happens to turn around, which activates the already-trained reflex heuristic of “grab and drink juice you see in front of you.” In this scenario, “turn around to see the juice” preceded execution of “grab and drink the juice which is in front of me”, and so the baby is reinforced for turning around to grab the juice in situations where the baby models the juice as behind herself.[4]
By this process, repeated many times, the baby learns how to associate world model concepts (e.g. “the juice is behind me”) with the heuristics responsible for reward (e.g. “turn around” and “grab and drink the juice which is in front of me”). Both parts of that sequence are reinforced. In this way, the contextual-heuristics exchange information with the budding world model.
A shard of value refers to the contextually activated computations which are downstream of similar historical reinforcement events. For example, the juice-shard consists of the various decision-making influences which steer the baby towards the historical reinforcer of a juice pouch. These contextual influences were all reinforced into existence by the activation of sugar reward circuitry upon drinking juice. A subshard is a contextually activated component of a shard. For example, “IF juice pouch in front of me THEN grab” is a subshard of the juice-shard. It seems plain to us that learned value shards are[5] most strongly activated in the situations in which they were historically reinforced and strengthened. (For more on terminology, see Appendix A.2.)
While all of this is happening, many different shards of value are also growing, since the human reward system offers a range of feedback signals. Many subroutines are being learned, many heuristics are developing, and many proto-preferences are taking root. At this point, the brain learns a crude planning algorithm,[6] because proto-planning subshards (e.g. IF
motor-command-5214
predicted to bring a juice pouch into view, THEN execute) would be reinforced for their contributions to activating the various hardcoded reward circuits. This proto-planning is learnable because most of the machinery was already developed by the self-supervised predictive learning, when e.g. learning to predict the consequences of motor commands (see Appendix A.1).The planner has to decide on a coherent plan of action. That is, micro-incoherences (turn towards juice, but then turn back towards a friendly adult, but then turn back towards the juice, ad nauseum) should generally be penalized away.[7] Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.
Importantly, however, the juice-shard is shaped to bid for plans which the world model predicts actually lead to juice being consumed, and not necessarily for plans which lead to sugar-reward-circuit activation. You might wonder: “Why wouldn’t the shard learn to value reward circuit activation?”. The effect of drinking juice is that the baby's credit assignment reinforces the computations which were causally responsible for producing the situation in which the hardcoded sugar-reward circuitry fired.
But what is reinforced? The content of the responsible computations includes a sequence of heuristics and decisions, one of which involved the juice pouch abstraction in the world model. Those are the circuits which actually get reinforced and become more likely to fire in the future. Therefore, the juice-heuristics get reinforced. The heuristics coalesce into a so-called shard of value as they query the world model and planner to implement increasingly complex multi-step plans.
In contrast, in this situation, the baby's decision-making does not involve “if this action is predicted to lead to sugar-reward, then bid for the action.” This non-participating heuristic probably won’t be reinforced or created, much less become a shard of value.[8]
This is important. We see how the reward system shapes our values, without our values entirely binding to the activation of the reward system itself. We have also laid bare the manner in which the juice-shard is bound to your model of reality instead of simply your model of future perception. Looking back across the causal history of the juice-shard’s training, the shard has no particular reason to bid for the plan “stick a wire in my brain to electrically stimulate the sugar reward-circuit”, even if the world model correctly predicts the consequences of such a plan. In fact, a good world model predicts that the person will drink fewer juice pouches after becoming a wireheader, and so the juice-shard in a reflective juice-liking adult bids against the wireheading plan! Humans are not reward-maximizers, they are value shard-executors.
This, we claim, is one reason why people (usually) don’t want to wirehead and why people often want to avoid value drift. According to the sophisticated reflective capabilities of your world model, if you popped a pill which made you 10% more okay with murder, your world model predicts futures which are bid against by your current shards because they contain too much murder.
We’re pretty confident that the reward circuitry is not a complicated hard-coded morass of alignment magic which forces the human to care about real-world juice. No, the hypothetical sugar-reward circuitry is simple. We conjecture that the order in which the brain learns abstractions makes it convergent to care about certain objects in the real world.
III. Explaining human behavior using shard theory
The juice-shard formation story is simple and—if we did our job as authors—easy to understand. However, juice-consumption is hardly a prototypical human value. In this section, we’ll show how shard theory neatly explains a range of human behaviors and preferences.
As people, we have lots of intuitions about human behavior. However, intuitively obvious behaviors still have to have mechanistic explanations—such behaviors still have to be retrodicted by a correct theory of human value formation. While reading the following examples, try looking at human behavior with fresh eyes, as if you were seeing humans for the first time and wondering what kinds of learning processes would produce agents which behave in the ways described.
Altruism is contextual
Consider Peter Singer’s drowning child thought experiment:
Probably,[9] most people would save the child, even at the cost of the shoes. However, few of those people donate an equivalent amount of money to save a child far away from them. Why do we care more about nearby visible strangers as opposed to distant strangers?
We think that the answer is simple. First consider the relevant context. The person sees a drowning child. What shards activate? Consider the historical reinforcement events relevant to this context. Many of these events involved helping children and making them happy. These events mostly occurred face-to-face.
For example, perhaps there is a hardcoded reward circuit which is activated by a crude subcortical smile-detector and a hardcoded attentional bias towards objects with relatively large eyes. Then reinforcement events around making children happy would cause people to care about children. For example, an adult’s credit assignment might correctly credit decisions like “smiling at the child” and “helping them find their parents at a fair” as responsible for making the child smile. “Making the child happy” and “looking out for the child’s safety” are two reliable correlates of smiles, and so people probably reliably grow child-subshards around these correlates.
This child-shard most strongly activates in contexts similar to the historical reinforcement events. In particular, “knowing the child exists” will activate the child-shard less strongly than “knowing the child exists and also seeing them in front of you.” “Knowing there are some people hurting somewhere” activates altruism-relevant shards even more weakly still. So it’s no grand mystery that most people care more when they can see the person in need.
Shard theory retrodicts that altruism tends to be biased towards nearby people (and also the ingroup), without positing complex, information-inaccessibility-violating adaptations like the following:
Similarly, you may be familiar with scope insensitivity: that the function from (# of children at risk) → (willingness to pay to protect the children) is not linear, but perhaps logarithmic. Is it that people “can’t multiply”? Probably not.
Under the shard theory view, it’s not that brains can’t multiply, it’s that for most people, the altruism-shard is most strongly invoked in face-to-face, one-on-one interactions, because those are the situations which have been most strongly touched by altruism-related reinforcement events. Whatever the altruism-shard’s influence on decision-making, it doesn’t steer decision-making so as to produce a linear willingness-to-pay relationship.
Friendship strength seems contextual
Personally, I (TurnTrout) am more inclined to make plans with my friends when I’m already hanging out with them—when we are already physically near each other. But why?
Historically, when I’ve hung out with a friend, that was fun and rewarding and reinforced my decision to hang out with that friend, and to continue spending time with them when we were already hanging out. As above, one possible way this could[10] happen is via a genetically hardcoded smile-activated reward circuit.
Since shards more strongly influence decisions in their historical reinforcement situations, the shards reinforced by interacting with my friend have the greatest control over my future plans when I’m actually hanging out with my friend.
Milgram is also contextual
We think that people convergently learn obedience- and cooperation-shards which more strongly influence decisions in the presence of an authority figure, perhaps because of historical obedience-reinforcement events in the presence of teachers / parents. These shards strongly activate in this situation.
We don’t pretend to have sufficient mastery of shard theory to a priori quantitatively predict Milgram’s obedience rate. However, shard theory explains why people obey so strongly in this experimental setup, but not in most everyday situations: The presence of an authority figure and of an official-seeming experimental protocol. This may seem obvious, but remember that human behavior requires a mechanistic explanation. “Common sense” doesn’t cut it. “Cooperation- and obedience-shards more strongly activate in this situation because this situation is similar to historical reinforcement contexts” is a nontrivial retrodiction.
Indeed, varying the contextual features dramatically affected the percentage of people who administered “lethal” shocks:
Sunflowers and timidity
Consider the following claim: “People reliably become more timid when surrounded by tall sunflowers. They become easier to sell products to and ask favors from.”
Let’s see if we can explain this with shard theory. Consider the mental context. The person knows there’s a sunflower near them. What historical reinforcement events pertain to this context? Well, the person probably has pleasant associations with sunflowers, perhaps spawned by aesthetic reinforcement events which reinforced thoughts like “go to the field where sunflowers grow” and “look at the sunflower.”
Therefore, the sunflower-timidity-shard was grown from… Hm. It wasn’t grown. The claim isn’t true, and this shard doesn’t exist, because it’s not downstream of past reinforcement.
Thus: Shard theory does not explain everything, because shards are grown from previous reinforcement events and previous thoughts. Shard theory constrains anticipation around actual observed human nature.
Optional exercise: Why might it feel wrong to not look both ways before crossing the street, even if you have reliable information that the coast is clear?
Optional exercise: Suppose that it's more emotionally difficult to kill a person face-to-face than from far away and out of sight. Explain via shard theory.[11]
We think that many biases are convergently produced artifacts of the human learning process & environment
We think that simple reward circuitry leads to different cognition activating in different circumstances. Different circumstances can activate cognition that implements different values, and this can lead to inconsistent or biased behavior. We conjecture that many biases are convergent artifacts of the human training process and internal shard dynamics. People aren’t just randomly/hardcoded to be more or less “rational” in different situations.
Projection bias
We believe that this is not a misprediction of how tastes will change in the future. Many adults know perfectly well that they will later crave the candy bar. However, a satiated adult has a greater probability of choosing fruit for their later self, because their deliberative shards are more strongly activated than their craving-related shards. The current level of hunger strongly controls which food-related shards are activated.
Sunk cost fallacy
Why are we hesitant to shift away from the course of action that we’re currently pursuing? There are two shard theory-related factors that we think contribute to sunk cost fallacy:
Time inconsistency
A person might deliberately avoid passing through the sweets aisle in a supermarket in order to avoid temptation. This is a very strange thing to do, and it makes no sense from the perspective of an agent maximizing expected utility over quantities like "sweet food consumed" and "leisure time" and "health." Such an EU-maximizing agent would decide to buy sweets or not, but wouldn’t worry about entering the aisle itself. Avoiding temptation makes perfect sense under shard theory.
Shards are contextually activated, and the sweet-shard is most strongly activated when you can actually see sweets. We think that planning-capable shards are manipulating future contexts so as to prevent the full activation of your sweet shard.
Similarly,
In such situations, people tend to choose $500 in (A) but $505 in (B), which is inconsistent with exponentially-discounted-utility models of the value of money. To explain this observed behavioral regularity using shard theory, consider the historical reinforcement contexts around immediate and delayed gratification. If contexts involving short-term opportunities activate different shards than contexts involving long-term opportunities, then it’s unsurprising that a person might choose 500 dollars in (A) but 505 dollars in (B).[12] (Of course, a full shard theory explanation must explain why those contexts activate different shards. We strongly intuit that there’s a good explanation, but do not think we have a satisfying story here yet.)
Framing effect
This is another bias that’s downstream of shards activating contextually. Asking the same question in different contexts can change which value-shards activate, and thus change how people answer the question. Consider also: People are hesitant to drink from a cup labeled “poison”, even if they themselves were the one to put the label there.
Other factors driving biases
There are many different reasons why someone might act in a biased manner. We’ve described some shard theory explanations for the listed biases. These explanations are not exhaustive. While writing this, we found an experiment with results that seem contrary to the shard theory explanations of sunk cost. Namely, experiment 4 (specifically, the uncorrelated condition) in this study on sunk cost in pigeons.
However, the cognitive biases literature is so large and heterogeneous that there probably isn’t any theory which cleanly explains all reported experimental outcomes. We think that shard theory has decently broad explanatory power for many aspects of human values and biases, even though not all observations fit neatly into the shard theory frame. (Alternatively, we might have done the shard theory analysis wrong for experiment 4.)
Why people can't enumerate all their values
Shards being contextual also helps explain why we can’t specify our full values. We can describe a moral theory that seems to capture our values in a given mental context, but it’s usually easy to find some counterexample to such a theory—some context or situation where the specified theory prescribes absurd behavior.
If shards implement your values, and shards activate situationally, your values will also be situational. Once you move away from the mental context / situation in which you came up with the moral theory, you might activate shards that the theory fails to capture. We think that this is why the static utility function framing is hard to operate for humans.
E.g., the classical utilitarianism maxim to maximize joy might initially seem appealing, but it doesn’t take long to generate a new mental context which activates shards that value emotions other than joy, or shards that value things in physical reality beyond your own mental state.
You might generate such new mental contexts by directly searching for shards that bid against pure joy maximization, or by searching for hypothetical scenarios which activate such shards ("finding a counterexample", in the language of moral philosophy). However, there is no clean way to query all possible shards, and we can’t enumerate every possible context in which shards could activate. It's thus very difficult to precisely quantify all of our values, or to create an explicit utility function that describes our values.
Content we aren’t (yet) discussing
The story we’ve presented here skips over important parts of human value formation. E.g., humans can do moral philosophy and refactor their deliberative moral framework without necessarily encountering any externally-activated reinforcement events, and humans also learn values through processes like cultural osmosis or imitation of other humans. Additionally, we haven’t addressed learned reinforcers (where a correlate of reinforcement events eventually becomes reinforcing in and of itself). We’ve also avoided most discussion of shard theory’s AI alignment implications.
This post explains our basic picture of shard formation in humans. We will address deeper shard theory-related questions in later posts.
Conclusion
Working from three reasonable assumptions about how the brain works, shard theory implies that human values (e.g. caring about siblings) are implemented by contextually activated circuits which activate in situations downstream of past reinforcement (e.g. when physically around siblings) so as to steer decision-making towards the objects of past reinforcement (e.g. making plans to spend more time together). According to shard theory, human values may be complex, but much of human value formation is simple.
For shard theory discussion, join our Discord server. Charles Foster wrote Appendix A.3. We thank David Udell, Peter Barnett, Raymond Arnold, Garrett Baker, Steve Byrnes, and Thomas Kwa for feedback on this finalized post. Many more people provided feedback on an earlier version.
Appendices
A.1 The formation of the world model
Most of our values seem to be about the real world. Mechanistically, we think that this means that they are functions of the state of our world model. We therefore infer that human values do not form durably or in earnest until after the human has learned a proto-world model. Since the world model is learned from scratch (by assumption 1 in section I), the world model takes time to develop. In particular, we infer that babies don’t have any recognizable “values” to speak of.
Therefore, to understand why human values empirically coalesce around the world model, we will sketch a detailed picture of how the world model might form. We think that self-supervised learning (item 2 in section I) produces your world model.
Due to learning from scratch, the fancy and interesting parts of your brain start off mostly useless. Here’s a speculative[13] story about how a baby learns to reduce predictive loss, in the process building a world model:
In this story, the world model is built from the self-supervised loss signal. Reinforcement probably also guides and focuses attention. For example, perhaps brainstem-hardcoded (but crude) face detectors hook into a reward circuit which focuses the learning on human faces.
A.2 Terminology
Shards are not full subagents
In our conception, shards vary in their sophistication (e.g. IF-THEN reflexes vs planning-capable, reflective shards which query the world model in order to steer the future in a certain direction) and generality of activating contexts (e.g. only activates when hungry and a lollipop is in the middle of the visual field vs activates whenever you're thinking about a person). However, we think that shards are not discrete subagents with their own world models and mental workspaces. We currently estimate that most shards are "optimizers" to the extent that a bacterium or a thermostat is an optimizer.
“Values”
We defined[16] “values” as “contextual influences on decision-making.” We think that “valuing someone’s friendship” is what it feels like from the inside to be an algorithm with a contextually activated decision-making influence which increases the probability of e.g. deciding to hang out with that friend. Here are three extra considerations and clarifications.
Type-correctness. We think that our definition is deeply appropriate in certain ways. Just because you value eating donuts, doesn’t mean you want to retain that pro-donut influence on your decision-making. This is what it means to reflectively endorse a value shard—that the shards which reason about your shard composition, bid for the donut-shard to stick around. By the same logic, it makes total sense to want your values to change over time—the “reflective” parts of you want the shard composition in the future to be different from the present composition. (For example, many arachnophobes probably want to drop their fear of spiders.) Rather than humans being “weird” for wanting their values to change over time, we think it’s probably the default for smart agents meeting our learning-process assumptions (section I).
Furthermore, your values do not reflect a reflectively endorsed utility function. First off, those are different types of objects. Values bid for and against options, while a utility function grades options. Second, your values vary contextually, while any such utility function would be constant across contexts. More on these points later, in more advanced shard theory posts.
Different shard compositions can produce similar urges. If you feel an urge to approach nearby donuts, that indicates a range of possibilities:
So, just because you feel an urge to eat the donut, doesn’t necessarily mean you have a donut shard or that you “value” donuts under our definition. (But you probably do.)
Shards are just collections of subshards. One subshard of your family-shard might steer towards futures where your family is happy, while another subshard may influence decisions so that your mother is proud of you. On my (TurnTrout’s) current understanding, “family shard” is just an abstraction of a set of heterogeneous subshards which are downstream of similar historical reinforcement events (e.g. related to spending time with your family). By and large, subshards of the same shard do not all steer towards the same kind of future.
“Shard Theory”
Over the last several months, many people have read either a draft version of this document, Alignment Forum comments by shard theory researchers, or otherwise heard about “shard theory” in some form. However, in the absence of a canonical public document explaining the ideas and defining terms, “shard theory” has become overloaded. Here, then, are several definitions.
A.3 Evidence for neuroscience assumptions
In section I, we stated that shard theory makes three key neuroscientific assumptions. Below we restate those assumptions, and give pointers to what we believe to be representative evidence from the psychology & neuroscience literature:
More precisely, we adopt Steve Byrnes’ stronger conjecture that the telencephelon and cerebellum are locally ~randomly initialized.
There are non-synaptic ways to transmit information in the brain, including ephaptic transmission, gap junctions, and volume transmission. We also consider these to be part of a circuit’s mental context.
We take an agnostic stance on the form of RL in the brain, both because we have trouble spelling out exact neurally plausible base credit assignment and reinforcement learning algorithms, but also so that the analysis does not make additional assumptions.
In psychology, “shaping” roughly refers to this process of learning increasingly sophisticated heuristics.
Shards activate more strongly in historical reinforcement contexts, according to our RL intuitions, introspective experience, and inference from observed human behavior. We have some abstract theoretical arguments that RL should work this way in the brain, but won't include them in this post.
We think human planning is less like Monte-Carlo Tree Search and more like greedy heuristic search. The heuristic is computed in large part by the outputs of the value shards, which themselves receive input from the world model about the consequences of the plan stub.
For example, turning back and forth while hungry might produce continual slight negative reinforcement events, at which point good credit assignment blames and downweights the micro-incoherences.
We think that “hedonic” shards of value can indeed form, and this would be part of why people seem to intrinsically value “rewarding” experiences. However, two points. 1) In this specific situation, the juice-shard forms around real-life juice. 2) We think that even self-proclaimed hedonists have some substantial values which are reality-based instead of reward-based.
We looked for a citation but couldn’t find one quickly.
We think the actual historical hanging-out-with-friend reinforcement events transpire differently. We may write more about this in future essays.
“It’s easier to kill a distant and unseen victim” seems common-sensically true, but we couldn’t actually find citations. Therefore, we are flagging this as possibly wrong folk wisdom. We would be surprised if it were wrong.
Shard theory reasoning says that while humans might be well-described as “hyperbolic discounters”, the real mechanistic explanation is importantly different. People may well not be doing any explicitly represented discounting; instead, discounting may only convergently arise as a superficial regularity! This presents an obstacle to alignment schemes aiming to infer human preferences by assuming that people are actually discounting.
We made this timeline up. We expect that we got many details wrong for a typical timeline, but the point is not the exact order. The point is to outline the kind of process by which the world model might arise only from self-supervised learning.
For simplicity, we start the analysis at birth. There is probably embryonic self-supervised learning as well. We don’t think it matters for this section.
Interesting but presently unimportant: My (TurnTrout)’s current guess is that given certain hard-coded wiring (e.g. where the optic nerve projects), the functional areas of the brain comprise the robust, convergent solution to: How should the brain organize cognitive labor to minimize the large metabolic costs of information transport (and, later, decision-making latency). This explains why learning a new language produces a new Broca’s area close to the original, and it explains why rewiring ferrets’ retinal projections into the auditory cortex seems to grow a visual cortex there instead. (jacob_cannell posited a similar explanation in 2015.)
The actual function of each functional area is overdetermined by the convergent usefulness of e.g. visual processing or language processing. Convergence builds upon convergence to produce reliable but slightly-varied specialization of cognitive labor across people’s brains. That is, people learn edge detectors because they’re useful, and people’s brains put them in V1 in order to minimize the costs of transferring information.
Furthermore, this process compounds upon itself. Initially there were weak functional convergences, and then mutations finetuned regional learning hyperparameters and connectome topology to better suit those weak functional convergences, and then the convergences sharpened, and so on. We later found that Voss et al.’s Branch Specialization made a similar conjecture about the functional areas.
I (TurnTrout) don’t know whether philosophers have already considered this definition (nor do I think that’s important to our arguments here). A few minutes of searching didn’t return any such definition, but please let me know if it already exists!