Generated as part of SERI MATS, Team Shard's research, under John Wentworth.
Many thanks to Quintin Pope, Alex Turner, Charles Foster, Steve Byrnes, and Logan Smith for feedback, and to everyone else I've discussed this with recently! All mistakes are my own.
Shard theory is a research program aimed at explaining the systematic relationships between the reinforcement schedules and learned values of reinforcement-learning agents. It consists of a basic ontology of reinforcement learners, their internal computations, and their relationship to their environment. It makes several predictions about a range of RL systems, both RL models and humans. Indeed, shard theory can be thought of as simply applying the modern ML lens to the question of value learning under reinforcement in artificial and natural neural networks!
Some of shard theory's confident predictions can be tested immediately in modern RL agents. Less confident predictions about i.i.d.-trained language models can also be tested now. Shard theory also has numerous retrodictions about human psychological phenomena that are otherwise mysterious from only the viewpoint of EU maximization, with no further substantive mechanistic account of human learned values. Finally, shard theory fails some retrodictions in humans; on further inspection, these lingering confusions might well falsify the theory.
If shard theory captures the essential dynamic relating reinforcement schedules and learned values, then we'll be able to carry out a steady stream of further experiments yielding a lot of information about how to reliably instill more of the values we want in our RL agents and fewer of those we don't. Shard theory's framework implies that alignment success is substantially continuous, and that even very limited alignment successes can still mean enormous quantities of value preserved for future humanity's ends. If shard theory is true, then further shard science will progressively yield better and better alignment results.
The remainder of this post will be an overview of the basic claims of shard theory. Future posts will detail experiments and preregister predictions, and look at the balance of existing evidence for and against shard theory from humans.
Reinforcement Strengthens Select Computations
A reinforcement learner is an ML model trained via a reinforcement schedule, a pairing of world states and reinforcement events. We-the-devs choose when to dole out these reinforcement events, either handwriting a simple reinforcement algorithm to do it for us or having human overseers give out reinforcement. The reinforcement learner itself is a neural network whose computations are reinforced or anti-reinforced by reinforcement events. After sufficient training, reinforcement often manages to reinforce those computations that are good at our desired task. We'll henceforth focus on deep reinforcement learners: RL models specifically comprised of multi-layered neural networks.
Deep RL can be seen as a supercategory of many deep learning tasks. In deep RL, the model you're training receives feedback, and this feedback fixes how the model is updated afterwards via SGD. In many RL setups, because the model's outputs influence its future observations, the model exercises some control over what it will see and be updated on in the future. RL wherein the model's outputs don't affect the distribution of its future observations is called supervised learning. So a general theory of deep RL models may well have implications for supervised learning models too. This is important for the experimental tractability of a theory of RL agents, as appreciably complicated RL setups are a huge pain in the ass to get working, while supervised learning is well established and far less finicky.
Humans are the only extant example of generally intelligent RL agents. Your subcortex contains your hardwired reinforcement circuitry, while your neocortex comprises much of your trained RL model. So you can coarsely model the neocortex as an RL agent being fed observations by the external world and reinforcement events by subcortical circuitry, and ask about what learned values this human develops as you vary those two parameters. Shard theory boils down to using the ML lens to understand all intelligent deep systems in this way, and using this ML lens to build up a mechanistic model of value learning.
As a newborn baby, you start life with subcortical hardwired reinforcement circuitry (fixed by your genome) and a randomly initialized neocortex. The computations initialized in your neocortex are random, and so the actions initially outputted by your neocortex are random. (Your brainstem additionally hardcodes some rote reflexes and simple automatic body functions, though, accounting for babies' innate behaviors.) Eventually, your baby-self manages to get a lollipop onto his tongue, and the sugar molecules touching your tastebuds fire your hardwired reinforcement circuitry. Via a primitive, hardwired credit assignment algorithm -- say, single out whatever computations are different from the computations active a moment ago -- your reinforcement circuitry gives all those singled out computations more staying power. This means that these select computations will henceforth be more likely to fire, conditional on their initiating cognitive inputs being present, executing their computation and returning a motor sequence. If this path to a reinforcement event wasn't a fluke, those contextually activated computations will go on to accrue yet more staying power by steering into more future reinforcement events. Your bright-red-disk-in-the-central-visual-field-activated computations will activate again when bright red lollipops are clearly visible, and will plausibly succeed at getting future visible lollipops to your tastebuds.
Contextually activated computations can chain with one another, becoming responsive to a wider range of cognitive inputs in the process: if randomly crying at the top of your lungs gets a bright-red disk close enough to activate your bright-red-disk-sensitive computation, then credit assignment will reinforce both the contextual crying and contextual eating computations. These contextually activated computations that steer behavior are called shards in shard theory. A simple shard, like the reach-for-visible-red-disks circuit, is a subshard. Typical shards are chained aggregations of many subshards, resulting in a sophisticated, contextually activated, behavior-steering circuit in a reinforcement learner.
Shards are subcircuits of deep neural networks, and so can potentially run sophisticated feature detection. Whatever cognitive inputs a shard activates for, it will have to have feature detection for -- you simply can't have a shard sensitive to an alien concept you don't represent anywhere in your neural net. Because shards are subcircuits in a large neural network, it's possible for them to be hooked up into each other and share feature detectors, or to be informationally isolated from each other. To whatever extent your shards' feature detectors are all shared, you will have a single world-model that acts as input into all shards. To whatever extent your shards keep their feature detectors to themselves, they'll have their own ontology that only guides your behavior after that shard has been activated.
Subcortical reinforcement circuits, though, hail from a distinct informational world. Your hardwired reinforcement circuits don't do any sophisticated feature detection (at most picking up on simple regular patterns in retinal stimulation and the like), and so have to reinforce computations "blindly," relying only on simple sensory proxies.
Finally, for the most intelligent RL systems, some kind of additional self-supervised training loop will have to be run along with RL. Reinforcement alone is just too sparse a signal to train a randomly initialized model up to significant capabilities in an appreciably complex environment. For a human, this might look something like trying to predict what's in your visual periphery before focusing your vision on it, sometimes suffering perceptual surprise. Plausibly, some kind of self-supervised loop like this is training all of the computations in the brain, testing them against cached ground truths. This additional source of feedback from self-supervision will both make RL models more capable than they would otherwise be and alter inter-shard dynamics (as we'll briefly discuss later).
When You're Dumb, Continuously Blended Tasks Mean Continuous, Broadening Values
Early in life, when your shards only activate in select cognitive contexts and you are thus largely wandering blindly into reinforcement events, only shards that your reinforcement circuits can pinpoint can be cemented. Because your subcortical reward circuitry was hardwired by your genome, it's going to be quite bad at accurately assigning credit to shards. Here's an example of an algorithm your reinforcement circuitry could plausibly be implementing: reinforce all the diffs of all the computations running over the last 30 seconds, minus the computations running just before that. This algorithm is sloppy, but is also tractable for the primeval subcortex. In contrast, finding lollipops out in the real world involves a lot of computational work. As tasks are distributed in the world in extremely complex patterns and are always found blended together, again in a variety of setups, shards are going to have to cope with a continuously shifting flux of cognitive inputs that vary with the environment. When your baby self wanders into a real-world lollipop, many computations will have been active in steering behavior in that direction over the past 30 seconds, so credit assignment will reinforce all of them. The more you train the baby, the wider a range of proxies he internalizes via this reinforcement algorithm. In the face of enough reinforcement events, every representable proxy for the reinforcement event that marginally garners some additional reinforcement will come to hold some staying power. And because of this, simply getting the baby to internalize a particular target proxy at all isn't that hard -- just make sure that that target proxy further contributes to reinforcement in-distribution.
Many computations that were active while reinforcement was distributed will be random jitters. So the splash damage from hardcoded credit assignment will reinforce these jitter-inducing computations as well. But because jitters aren't decent proxies for reinforcement even in distribution, they won't be steadily reinforced and will just as plausibly steer into anti-reinforcement events. What jitters do accrete will look more like conditionally activated rote tics than widely activated shards with numerous subroutines, because the jitter computations didn't backchain reinforcement reliably enough to accrete a surrounding body of subshards. Shard theory thus predicts that dumb RL agents internalize lots of representable-by-them in-distribution proxies for reinforcement as shards, as a straightforward consequence of reinforcement events being gated behind a complex conditional distribution of task blends.
When You're Smart, Internal Game Theory Explains the Tapestry of Your Values
Subshards and smaller aggregate shards are potentially quite stupid. At a minimum, a shard is just a circuit that triggers given a particular conceptually chunked input, and outputs a rote behavioral sequence sufficient to garner more reinforcement. This circuit will not be well modeled as an intelligent planner; instead, it's perfectly adequate to think of it as just an observationally activated behavioral sequence. But as all your shards collectively comprise all of (or most of?) your neocortex, large shards can get quite smart.
Shards are all stuck inside of a single skull with one another, and only have (1) their interconnections with each other, (2) your motor outputs, and (3) your self-supervised training loop with which to causally influence anything. Game theoretically, your large intelligent shards can be fruitfully modeled as playing a negotiation game together: shards can interact with each other via their few output channels, and interactions all blend both zero-sum conflict and pure coordination. Agentic shards will completely route around smaller, non-agentic shards if they have conflicting ends. The interactions played out between your agentic shards then generate a complicated panoply of behaviors.
By the time our baby has grown up, he will have accreted larger shards equipped with richer world-models, activated by a wider range of cognitive inputs, specifying more and more complex behaviors. Where his shards once just passively dealt with the consequences effected by other shards via their shared motor output channel, they are now intelligent enough to scheme at the other shards. Say that you're considering whether to go off to big-law school, and are concerned about that environment exacerbating the egoistic streak you see and dislike in yourself. You don't want to grow up to be more of an egotist, so you choose to avoid going to your top-ranked big-law-school offer, even though the compensation from practicing prestigious big-shot law would further your other goals. On the (unreconstructed) standard agent model, this behavior is mysterious. Your utility function is fixed, no? Money is instrumentally useful; jobs and education are just paths through state space; your terminal values are almost orthogonal to your merely instrumental choice of career. On shard theory, though, this phenomenon of value drift isn't at all mysterious. Your egotistical shard would be steered into many reinforcement events were you to go off to the biggest of big-law schools, so your remaining shards use their collective steering control to avoid going down that path now, while they still have a veto. Similarly, despite knowing that heroin massively activates your reinforcement circuitry, not all that many people do heroin all the time. What's going on is that people reason now about what would happen after massively reinforcing a druggie shard, and see that their other values would not be serviced at all in a post-heroin world. They reason that they should carefully avoid that reinforcement event. On the view that reinforcement is the optimization target of trained reinforcement learners, this is inexplicable; on shard theory, it's straightforward internal game-theory.
Shards shouldn't be thought of as an alternative to utility functions, but as what utility functions look like for bounded trained agents. Your "utility function" (an ordering over possible worlds, subject to some consistency conditions) is far too big for your brain to represent. But a utility function can be lossily projected down into a bounded computational object by factoring it into a few shards, each representing a term in the utility function, each term conceptually chunked out of perceptual input. In the limit of perfect negotiation between your constituent shards, what your shards collectively pursue would (boundedly) resemble blended utility-function maximization! At lesser levels of negotiation competence between your shards, you'd observe many of the pathologies we see in human behavior. You might see agents who, e.g., flip back and forth between binge drinking and carefully avoiding the bar. Shard theory might explain this as a coalition of shards keeping an alcoholic shard in check by staying away from alcohol-related conceptual inputs, but the alcoholic shard being activated and taking over once an alcohol-related cognitive input materializes.
Shard theory's account of internal game theory also supplies a theory of value reflection or moral philosophizing. When shards are relatively good at negotiating outcomes with one another, one thing that they might do is to try to find a common policy that they all consistently follow whenever they are the currently activated shards. This common policy will have to be satisfactory to all the capable shards in a person, inside the contexts in which each of those shards is defined. But what the policy does outside of those contexts is completely undetermined. So the shards will hunt for a single common moral rule that gets them each what they want inside their domains, to harvest gains from trade; what happens off of all of their their domains is undetermined and unimportant. This looks an awful lot like testing various moral philosophies against your various moral intuitions, and trying (so far, in vain) to find a moral philosophy that behaves exactly as your various intuitions ask in all the cases where your intuitions have something to say.
There are some human phenomena that shard theory doesn't have a tidy story about. The largest is probably the apparent phenomenon of credit assignment improving over a lifetime. When you're older and wiser, you're better at noticing which of your past actions were bad and learning from your mistakes. Possibly, this happens a long time after the fact, without any anti-reinforcement event occurring. But an improved conceptual understanding ought to be inaccessible to your subcortical reinforcement circuitry -- on shard theory, being wiser shouldn't mean your shards are reinforced or anti-reinforced any differently.
One thing that might be going on here is that your shards are better at loading and keeping chosen training data in your self-supervised learning loop buffer, and so steadily reinforcing or anti-reinforcing themselves or their enemy shards, respectively. This might look like trying not to think certain thoughts, so those thoughts can't be rewarded for accurately forecasting your observations. But this is underexplained, and shard theory in general doesn't have a good account of credit assignment improving in-lifetime.
Relevance to Alignment Success
The relevance to alignment is that (1) if shard theory is true, meaningful partial alignment successes are possible, and (2) we have a theoretical road to follow to steadily better alignment successes in RL agents. If we can get RL agents to internalize some human-value shards, alongside a lot of other random alien nonsense shards, then those human-value shards will be our representatives on the inside, and will intelligently bargain for what we care about, after all the shards in the RL agent get smarter. Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot. And we can improve our expected fraction by studying the systematic relationships between reinforcement schedules and learned values, in both present RL systems and in humans.
Shard theory is a research program: it's a proposed basic ontology of how agents made of neural networks work, most especially those agents with path-dependent control over the reinforcement events they later steer into. Shard theory aims to be a comprehensive theory of which values neural networks learn conditional on different reinforcement schedules. All that shard science is yet to be done, and, on priors, shard theory's attempt is probably not going to pan out. But this relationship is currently relatively unexplored, and competitor theories are relatively unsupported accounts of learned values (e.g., that a single random in-distribution proxy will be the learned value, or that reinforcement is always the optimization target). Shard theory is trying to work out this relationship and then be able to demonstrably predict, ahead of time, specific learned values given reinforcement parameters.