Shard Theory: An Overview

David Udell

Generated as part of SERI MATS, Team Shard's research, under John Wentworth.

Many thanks to Quintin Pope, Alex Turner, Charles Foster, Steve Byrnes, and Logan Smith for feedback, and to everyone else I've discussed this with recently! All mistakes are my own.

*Team Shard*, courtesy of Garrett Baker and DALL-E 2

Introduction

Shard theory is a research program aimed at explaining the systematic relationships between the reinforcement schedules and learned values of reinforcement-learning agents. It consists of a basic ontology of reinforcement learners, their internal computations, and their relationship to their environment. It makes several predictions about a range of RL systems, both RL models and humans. Indeed, shard theory can be thought of as simply applying the modern ML lens to the question of value learning under reinforcement in artificial and natural neural networks!

Some of shard theory's confident predictions can be tested immediately in modern RL agents. Less confident predictions about i.i.d.-trained language models can also be tested now. Shard theory also has numerous retrodictions about human psychological phenomena that are otherwise mysterious from only the viewpoint of EU maximization, with no further substantive mechanistic account of human learned values. Finally, shard theory fails some retrodictions in humans; on further inspection, these lingering confusions might well falsify the theory.

If shard theory captures the essential dynamic relating reinforcement schedules and learned values, then we'll be able to carry out a steady stream of further experiments yielding a lot of information about how to reliably instill more of the values we want in our RL agents and fewer of those we don't. Shard theory's framework implies that alignment success is substantially continuous, and that even very limited alignment successes can still mean enormous quantities of value preserved for future humanity's ends. If shard theory is true, then further shard science will progressively yield better and better alignment results.

The remainder of this post will be an overview of the basic claims of shard theory. Future posts will detail experiments and preregister predictions, and look at the balance of existing evidence for and against shard theory from humans.

Reinforcement Strengthens Select Computations

A reinforcement learner is an ML model trained via a reinforcement schedule, a pairing of world states and reinforcement events. We-the-devs choose when to dole out these reinforcement events, either handwriting a simple reinforcement algorithm to do it for us or having human overseers give out reinforcement. The reinforcement learner itself is a neural network whose computations are reinforced or anti-reinforced by reinforcement events. After sufficient training, reinforcement often manages to reinforce those computations that are good at our desired task. We'll henceforth focus on deep reinforcement learners: RL models specifically comprised of multi-layered neural networks.

Deep RL can be seen as a supercategory of many deep learning tasks. In deep RL, the model you're training receives feedback, and this feedback fixes how the model is updated afterwards via SGD. In many RL setups, because the model's outputs influence its future observations, the model exercises some control over what it will see and be updated on in the future. RL wherein the model's outputs don't affect the distribution of its future observations is called supervised learning. So a general theory of deep RL models may well have implications for supervised learning models too. This is important for the experimental tractability of a theory of RL agents, as appreciably complicated RL setups are a huge pain in the ass to get working, while supervised learning is well established and far less finicky.

Humans are the only extant example of generally intelligent RL agents. Your subcortex contains your hardwired reinforcement circuitry, while your neocortex comprises much of your trained RL model. So you can coarsely model the neocortex as an RL agent being fed observations by the external world and reinforcement events by subcortical circuitry, and ask about what learned values this human develops as you vary those two parameters. Shard theory boils down to using the ML lens to understand all intelligent deep systems in this way, and using this ML lens to build up a mechanistic model of value learning.

As a newborn baby, you start life with subcortical hardwired reinforcement circuitry (fixed by your genome) and a randomly initialized neocortex. The computations initialized in your neocortex are random, and so the actions initially outputted by your neocortex are random. (Your brainstem additionally hardcodes some rote reflexes and simple automatic body functions, though, accounting for babies' innate behaviors.) Eventually, your baby-self manages to get a lollipop onto his tongue, and the sugar molecules touching your tastebuds fire your hardwired reinforcement circuitry. Via a primitive, hardwired credit assignment algorithm -- say, single out whatever computations are different from the computations active a moment ago -- your reinforcement circuitry gives all those singled out computations more staying power. This means that these select computations will henceforth be more likely to fire, conditional on their initiating cognitive inputs being present, executing their computation and returning a motor sequence. If this path to a reinforcement event wasn't a fluke, those contextually activated computations will go on to accrue yet more staying power by steering into more future reinforcement events. Your bright-red-disk-in-the-central-visual-field-activated computations will activate again when bright red lollipops are clearly visible, and will plausibly succeed at getting future visible lollipops to your tastebuds.

Contextually activated computations can chain with one another, becoming responsive to a wider range of cognitive inputs in the process: if randomly crying at the top of your lungs gets a bright-red disk close enough to activate your bright-red-disk-sensitive computation, then credit assignment will reinforce both the contextual crying and contextual eating computations. These contextually activated computations that steer behavior are called shards in shard theory. A simple shard, like the reach-for-visible-red-disks circuit, is a subshard. Typical shards are chained aggregations of many subshards, resulting in a sophisticated, contextually activated, behavior-steering circuit in a reinforcement learner.

Shards are subcircuits of deep neural networks, and so can potentially run sophisticated feature detection. Whatever cognitive inputs a shard activates for, it will have to have feature detection for -- you simply can't have a shard sensitive to an alien concept you don't represent anywhere in your neural net. Because shards are subcircuits in a large neural network, it's possible for them to be hooked up into each other and share feature detectors, or to be informationally isolated from each other. To whatever extent your shards' feature detectors are all shared, you will have a single world-model that acts as input into all shards. To whatever extent your shards keep their feature detectors to themselves, they'll have their own ontology that only guides your behavior after that shard has been activated.

Subcortical reinforcement circuits, though, hail from a distinct informational world. Your hardwired reinforcement circuits don't do any sophisticated feature detection (at most picking up on simple regular patterns in retinal stimulation and the like), and so have to reinforce computations "blindly," relying only on simple sensory proxies.

Finally, for the most intelligent RL systems, some kind of additional self-supervised training loop will have to be run along with RL. Reinforcement alone is just too sparse a signal to train a randomly initialized model up to significant capabilities in an appreciably complex environment. For a human, this might look something like trying to predict what's in your visual periphery before focusing your vision on it, sometimes suffering perceptual surprise. Plausibly, some kind of self-supervised loop like this is training all of the computations in the brain, testing them against cached ground truths. This additional source of feedback from self-supervision will both make RL models more capable than they would otherwise be and alter inter-shard dynamics (as we'll briefly discuss later).

When You're Dumb, Continuously Blended Tasks Mean Continuous, Broadening Values

Early in life, when your shards only activate in select cognitive contexts and you are thus largely wandering blindly into reinforcement events, only shards that your reinforcement circuits can pinpoint can be cemented. Because your subcortical reward circuitry was hardwired by your genome, it's going to be quite bad at accurately assigning credit to shards. Here's an example of an algorithm your reinforcement circuitry could plausibly be implementing: reinforce all the diffs of all the computations running over the last 30 seconds, minus the computations running just before that. This algorithm is sloppy, but is also tractable for the primeval subcortex. In contrast, finding lollipops out in the real world involves a lot of computational work. As tasks are distributed in the world in extremely complex patterns and are always found blended together, again in a variety of setups, shards are going to have to cope with a continuously shifting flux of cognitive inputs that vary with the environment. When your baby self wanders into a real-world lollipop, many computations will have been active in steering behavior in that direction over the past 30 seconds, so credit assignment will reinforce all of them. The more you train the baby, the wider a range of proxies he internalizes via this reinforcement algorithm. In the face of enough reinforcement events, every representable proxy for the reinforcement event that marginally garners some additional reinforcement will come to hold some staying power. And because of this, simply getting the baby to internalize a particular target proxy at all isn't that hard -- just make sure that that target proxy further contributes to reinforcement in-distribution.

Many computations that were active while reinforcement was distributed will be random jitters. So the splash damage from hardcoded credit assignment will reinforce these jitter-inducing computations as well. But because jitters aren't decent proxies for reinforcement even in distribution, they won't be steadily reinforced and will just as plausibly steer into anti-reinforcement events. What jitters do accrete will look more like conditionally activated rote tics than widely activated shards with numerous subroutines, because the jitter computations didn't backchain reinforcement reliably enough to accrete a surrounding body of subshards. Shard theory thus predicts that dumb RL agents internalize lots of representable-by-them in-distribution proxies for reinforcement as shards, as a straightforward consequence of reinforcement events being gated behind a complex conditional distribution of task blends.

When You're Smart, Internal Game Theory Explains the Tapestry of Your Values

Subshards and smaller aggregate shards are potentially quite stupid. At a minimum, a shard is just a circuit that triggers given a particular conceptually chunked input, and outputs a rote behavioral sequence sufficient to garner more reinforcement. This circuit will not be well modeled as an intelligent planner; instead, it's perfectly adequate to think of it as just an observationally activated behavioral sequence. But as all your shards collectively comprise all of (or most of?) your neocortex, large shards can get quite smart.

Shards are all stuck inside of a single skull with one another, and only have (1) their interconnections with each other, (2) your motor outputs, and (3) your self-supervised training loop with which to causally influence anything. Game theoretically, your large intelligent shards can be fruitfully modeled as playing a negotiation game together: shards can interact with each other via their few output channels, and interactions all blend both zero-sum conflict and pure coordination. Agentic shards will completely route around smaller, non-agentic shards if they have conflicting ends. The interactions played out between your agentic shards then generate a complicated panoply of behaviors.

By the time our baby has grown up, he will have accreted larger shards equipped with richer world-models, activated by a wider range of cognitive inputs, specifying more and more complex behaviors. Where his shards once just passively dealt with the consequences effected by other shards via their shared motor output channel, they are now intelligent enough to scheme at the other shards. Say that you're considering whether to go off to big-law school, and are concerned about that environment exacerbating the egoistic streak you see and dislike in yourself. You don't want to grow up to be more of an egotist, so you choose to avoid going to your top-ranked big-law-school offer, even though the compensation from practicing prestigious big-shot law would further your other goals. On the (unreconstructed) standard agent model, this behavior is mysterious. Your utility function is fixed, no? Money is instrumentally useful; jobs and education are just paths through state space; your terminal values are almost orthogonal to your merely instrumental choice of career. On shard theory, though, this phenomenon of value drift isn't at all mysterious. Your egotistical shard would be steered into many reinforcement events were you to go off to the biggest of big-law schools, so your remaining shards use their collective steering control to avoid going down that path now, while they still have a veto. Similarly, despite knowing that heroin massively activates your reinforcement circuitry, not all that many people do heroin all the time. What's going on is that people reason now about what would happen after massively reinforcing a druggie shard, and see that their other values would not be serviced at all in a post-heroin world. They reason that they should carefully avoid that reinforcement event. On the view that reinforcement is the optimization target of trained reinforcement learners, this is inexplicable; on shard theory, it's straightforward internal game-theory.

Shards shouldn't be thought of as an alternative to utility functions, but as what utility functions look like for bounded trained agents. Your "utility function" (an ordering over possible worlds, subject to some consistency conditions) is far too big for your brain to represent. But a utility function can be lossily projected down into a bounded computational object by factoring it into a few shards, each representing a term in the utility function, each term conceptually chunked out of perceptual input. In the limit of perfect negotiation between your constituent shards, what your shards collectively pursue would (boundedly) resemble blended utility-function maximization! At lesser levels of negotiation competence between your shards, you'd observe many of the pathologies we see in human behavior. You might see agents who, e.g., flip back and forth between binge drinking and carefully avoiding the bar. Shard theory might explain this as a coalition of shards keeping an alcoholic shard in check by staying away from alcohol-related conceptual inputs, but the alcoholic shard being activated and taking over once an alcohol-related cognitive input materializes.

Shard theory's account of internal game theory also supplies a theory of value reflection or moral philosophizing. When shards are relatively good at negotiating outcomes with one another, one thing that they might do is to try to find a common policy that they all consistently follow whenever they are the currently activated shards. This common policy will have to be satisfactory to all the capable shards in a person, inside the contexts in which each of those shards is defined. But what the policy does outside of those contexts is completely undetermined. So the shards will hunt for a single common moral rule that gets them each what they want inside their domains, to harvest gains from trade; what happens off of all of their their domains is undetermined and unimportant. This looks an awful lot like testing various moral philosophies against your various moral intuitions, and trying (so far, in vain) to find a moral philosophy that behaves exactly as your various intuitions ask in all the cases where your intuitions have something to say.

Lingering Confusions

There are some human phenomena that shard theory doesn't have a tidy story about. The largest is probably the apparent phenomenon of credit assignment improving over a lifetime. When you're older and wiser, you're better at noticing which of your past actions were bad and learning from your mistakes. Possibly, this happens a long time after the fact, without any anti-reinforcement event occurring. But an improved conceptual understanding ought to be inaccessible to your subcortical reinforcement circuitry -- on shard theory, being wiser shouldn't mean your shards are reinforced or anti-reinforced any differently.

One thing that might be going on here is that your shards are better at loading and keeping chosen training data in your self-supervised learning loop buffer, and so steadily reinforcing or anti-reinforcing themselves or their enemy shards, respectively. This might look like trying not to think certain thoughts, so those thoughts can't be rewarded for accurately forecasting your observations. But this is underexplained, and shard theory in general doesn't have a good account of credit assignment improving in-lifetime.

Relevance to Alignment Success

The relevance to alignment is that (1) if shard theory is true, meaningful partial alignment successes are possible, and (2) we have a theoretical road to follow to steadily better alignment successes in RL agents. If we can get RL agents to internalize some human-value shards, alongside a lot of other random alien nonsense shards, then those human-value shards will be our representatives on the inside, and will intelligently bargain for what we care about, after all the shards in the RL agent get smarter. Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot. And we can improve our expected fraction by studying the systematic relationships between reinforcement schedules and learned values, in both present RL systems and in humans.

Conclusion

Shard theory is a research program: it's a proposed basic ontology of how agents made of neural networks work, most especially those agents with path-dependent control over the reinforcement events they later steer into. Shard theory aims to be a comprehensive theory of which values neural networks learn conditional on different reinforcement schedules. All that shard science is yet to be done, and, on priors, shard theory's attempt is probably not going to pan out. But this relationship is currently relatively unexplored, and competitor theories are relatively unsupported accounts of learned values (e.g., that a single random in-distribution proxy will be the learned value, or that reinforcement is always the optimization target). Shard theory is trying to work out this relationship and then be able to demonstrably predict, ahead of time, specific learned values given reinforcement parameters.

Things I agree with:

Model-based RL algorithms tend to create agents with a big mess of desires and aversions that are not necessarily self-consistent. (But see caveat about self-modification below.)
When an agent is in an environment wherein its desires can change (e.g. an RL agent being updated online by TD learning), it will tend to take foresighted actions to preserve its current desires (cf. instrumental convergence), assuming the agent is sufficiently self-aware and foresighted.
Human within-lifetime learning is an example of model-based RL, and well worth looking into in the course of trying to understand more generally how powerful model-based RL agents might behave.

Things I disagree with:

I disagree with how this post uses the word “values” throughout, rather than “desires” (or “preferences”) which (AFAICT) would be a better match to how the term is being used here.
I disagree with how this post seems to optimistically ignore the possibility that the AGI might self-modify to be more coherent in a way that involves crushing / erasing a subset of its desires, and this subset might include the desires related to human flourishing. This is analogous to how you or I might try to self-modify to erase some of our (unendorsed) desires (to eat junk food, be selfish, be cruel, etc.), if we could. (This doesn’t even require exotic things like access to source code / brain tissue; there are lots of mundane tricks to kick bad habits etc.) (Cf. also how people will “bite the bullet” in thought experiments when doing moral reasoning.)
- Counterpoint: What if we try to ensure that the AGI has a strong (meta-)preference not to do that?
- Response to counterpoint: Sure! That seems like a promising thing to look into. But until we have a plan to ensure that that actually happens, I’ll keep feeling like this is an “area that warrants further research” not “cause for general optimism”.
Relatedly and more broadly, my attitude to the technical alignment problem is more like “It is as yet unclear whether or not we face certain doom if we train a model-based RL agent to AGI using the best practices that AI alignment researchers currently know about” (see §14.6 here), not “We have strong reason to believe that the problem is solvable this way with no big new alignment ideas”. This makes me “optimistic” about solvability of technical alignment by the standards of, say, Eliezer, but not “optimistic” as the term is normally used. More like “uncertain”.
- I guess I’m not sure if this is really a disagreement with the post. The post seems to have optimistic vibes in certain places (“if shard theory is true, meaningful partial alignment successes are possible”) but the conclusion is more cautious. ¯\_(ツ)_/¯
I think some of the detailed descriptions of (anthropomorphized) shards are misleading, but I’m not sure that really matters for anything.

Thanks as always for your consistently thoughtful comments:)

I disagree with how this post seems to optimistically ignore the possibility that the AGI might self-modify to be more coherent in a way that involves crushing / erasing a subset of its desires, and this subset might include the desires related to human flourishing.

I also feel this is an “area that warrants further research”, though I don't view shard-coordination as being different than shard formation. If you understand how inner-values form from outer reward schedules, then how inner-values interact is also a steerable reinforcement. Though this may be exactly what you meant by "try to ensure that the AGI has a strong (meta-)preference not to do that", so the only disagreement is on the optimism vibe?

I don't view shard-coordination as being different than shard formation

Yeah I expect that the same learning algorithm source code would give rise to both preferences and meta-preferences. (I think that’s what you’re saying there right?)

From the perspective of sculpting AGI motivations, I think it might be trickier to directly intervene on meta-preferences than to directly intervene on (object-level) preferences, because if the AGI is attending to something related to sensory input, you can kinda guess what it’s probably thinking about and you at least have a chance of issuing appropriate rewards by doing obvious straightforward things, whereas if the AGI is introspecting on its own current preferences, you need powerful interpretability techniques to even have a chance to issue appropriate rewards, I suspect. That’s not to say it’s impossible! We should keep thinking about it. It’s very much on my own mind, see e.g. my silly tweets from just last night.

the AGI might self-modify to be more coherent in a way that involves crushing / erasing a subset of its desires, and this subset might include the desires related to human flourishing. This is analogous to how you or I might try to self-modify to erase some of our (unendorsed) desires (to eat junk food, be selfish, be cruel, etc.), if we could

This seems like a big deal to me, because it feels like if I could modify myself, I would probably do so to make myself better at achieving a handful of goals like {having a positive impact, (maybe) obtaining a large amount of power/money, getting really really good at a particular skill} and everything else like {desire to eat good food, be vengeful, etc} would be thrown out. The first set feels different from the second because those desires feel more like maximising something, which is worrying.

I disagree with how this post uses the word “values” throughout, rather than “desires” (or “preferences”) which (AFAICT) would be a better match to how the term is being used here.

This has definitely been a point of confusion. There are a couple of ways one might reasonably interpret the phrase "human values":

the common denominator between all humans ever about what they care about
the ethical consensus of (some subset of) humanity
the injunctions a particular human would verbally endorse
the cognitive artifacts inside each particular human that implement that human's valuing-of-X, including the cases where they verbally endorse that valuing (along with a bunch of other kinds of preferences, both wanted and unwanted)

I think the shard theory workstream generally uses "human values" in the last sense.

I view values as action-determinants, the moment-to-moment internal tugs which steer my thinking and action. I cash out "values"/"subshards" as "contextually activated computations which are shaped into existence by past reinforcement, and which often steer towards their historical reinforcers."

This is a considerably wider definition of "human values" than usually considered. For example, I might have a narrowly activated value against taking COVID tests, because past reinforcement slapped down my decision to do so after I tested positive and realized I'd be isolating all alone. (The testing was part of why I had to isolate, after all, so I infer that my credit assignment tagged that decision for down-weighting)

This unusual definitional broadness is a real cost, which is why I like talking about an anti-COVID-test subshard, instead of an anti-test "value."

I might have a narrowly activated value against taking COVID tests…

Hmm, I think I’d say “I might feel an aversion…” there.

“Desires & aversions” would work in a context where the sign was ambiguous. So would “preferences”.

Nice overview, David! You've made lots of good points and clarifications. I worry that this overview goes a little too fast for new readers. For example,

Shard theory thus predicts that dumb RL agents internalize lots of representable-by-them in-distribution proxies for reinforcement as shards, as a straightforward consequence of reinforcement events being gated behind a complex conditional distribution of task blends.

I can read this if I think carefully, but it's a little difficult. I presently view this article as more of "motivating shard theory" and "explaining some shard theory 201" than "explaining some shard theory 101."

Here's another point I want to make: Shard theory is anticipation-constraining. You can't just say "a shard made me do it" for absolutely whatever. Shards are contextually activated in the contexts which would have been pinged by previous credit assignment invocations. Experiments show that people get coerced into obeying cruel orders not by showing them bright blue paintings or a dog, but by an authoritative man in a lab coat. I model this as situationally strongly activating deference- and social-maintenance shards, which are uniquely influential here because those shards were most strongly reinforced in similar situations in the past. Shard theory would be surprised by normal people becoming particularly deferential around chihuahuas.

You might go "duh", but actually this is a phenomenon which has to be explained! And if people just have "weird values", there has to be a within-lifetime learning explanation for how those values got to be weird in that particular way, for why a person isn't extra deferent around small dogs but they are extra deferent around apparent-doctors.

Comments on other parts:

Your "utility function" (an ordering over possible worlds, subject to some consistency conditions) is far too big for your brain to represent.

I think that committing to an ordering over possible worlds is committing way too much. Coherence theorems tell you to be coherent over outcome lotteries, but they don't prescribe what outcomes are. Are outcomes universe-histories? World states? Something else? This is a common way I perceive people to be misled by utility theory.

But a utility function can be lossily projected down into a bounded computational object by factoring it into a few shards, each representing a term in the utility function, each term conceptually chunked out of perceptual input.

I feel confused by the role of "perceptual input" here. Can you give an example of a situation where the utility function gets chunked in this way?

shard theory in general doesn't have a good account of credit assignment improving in-lifetime.

Yeah, I fervently wish I knew what were happening here. I think that sophisticated credit assignment is probably convergently bootstrapped from some genetically hard-coded dumb credit assignment.

But a utility function can be lossily projected down into a bounded computational object by factoring it into a few shards, each representing a term in the utility function, each term conceptually chunked out of perceptual input.
I feel confused by the role of "perceptual input" here. Can you give an example of a situation where the utility function gets chunked in this way?

I had meant to suggest that your shards interface with a messy perceptual world of incoming retinal activations and the like, but are trained to nonetheless chunk out latent variables like "human flourishing" or "lollipops" in the input stream. That is, I was suggesting a rough shape for the link between the outside world as you observe it and the ontology your shards express their ends in.

If you formalized utility functions as orderings over possible worlds (or over other equivalent objects!), and your perception simply looked over the set of all possible worlds, then there wouldn't be anything interesting to explain about perception and the ontology your values are framed in. For agents that can't run impossibly large computations like that, though, I think you do have something to explain here.

Thoughts on the internal game theory section:

The two examples given (law school value drift, and heroin) seem to distinguish the internal-game-theory hypothesis from the reward-as-inner-goal hypothesis in humans, but they don't seem to rule out other equally plausible models, e.g.

Several discrete shards arise by incremental reinforcement; individually they are not optimizers, but acting in conjunction, they are an optimizer. Negotiation between them mostly doesn't happen because the individual shards are incapable of thinking strategically. The thought-pattern where the shards are activated in conjunction is reinforced later. The goal of the resulting bounded optimizer is thousands of random proxies for reward from the constituent shards.
- e.g. one shard contains a bit of world-model because it controls crawling around your room as a baby; another does search because it's in charge of searching for the right word when you first learn to speak. Or something, I'm not a neuroscientist.
The same, but shards that make up the optimizer are not discrete; the architecture of the brain has some smooth gradient path towards forming a bounded optimizer.
Shards are incapable of aggregating into an optimizer on their own, but some coherence property is also reinforced (predictive coding or something) that produces agenty behavior without any negotiation between shards.
...

In each of these cases, the agent will avoid going to law school or using heroin just because it is instrumentally convergent to preserve its utility function, regardless of whether it is a coalition between shards or some other construction. Also, the other hypotheses seem just as plausible as the internal game theory hypothesis from the shard theory posts and docs I've read (e.g. Reward is not the optimization target), so I'm not sure why the internal game theory thing is considered the shard theory position.

Also, it doesn't make sense to say "if shard theory is true", if shard theory is a research program. It would be clearer if shard theory hypotheses were clearly stated, and all such statements were rephrased as "if hypothesis X is true". I expect that shard theory leads to many confirmed and many disconfirmed predictions.

it doesn't make sense to say "if shard theory is true", if shard theory is a research program

(That was sloppy phrasing on my part, yeah. "If shard theory is the right basic outlook on RL agents..." would be better.)

Where his shards once just passively dealt with the consequences effected by other shards via their shared motor output channel, they are now intelligent enough to plan scheme at the other shards. Say that you're considering whether to go off to big-law school, and are concerned about that environment exacerbating the egoistic streak you see and dislike in yourself. You don't want to grow up to be more of an egotist, so you choose to avoid going to your top-ranked big-law-school offer, even though the compensation from practicing prestigious big-shot law would further your other goals. [...]
There are some human phenomena that shard theory doesn't have a tidy story about. The largest is probably the apparent phenomenon of credit assignment improving over a lifetime. When you're older and wiser, you're better at noticing which of your past actions were bad and learning from your mistakes. Possibly, this happens a long time after the fact, without any anti-reinforcement event occurring. But an improved conceptual understanding ought to be inaccessible to your subcortical reinforcement circuitry -- on shard theory, being wiser shouldn't mean your shards are reinforced or anti-reinforced any differently.

How does the mechanism in these two examples differ from each other? You seem to be suggesting that the first one is explainable by shard theory, while the second one is mysterious. But aren't they both cases of the shards having some kind of a conceptual model of the world and the consequences of different actions, where the conceptual model improves even in cases where it doesn't lead to immediate consequences with regard to the valued thing and thus can't be directly reinforced?

In the second case, I'm suggesting that your shards actually do strengthen or weaken right then, due to cognition alone, in the absence of a reinforcement event. That's the putatively mysterious phenomenon.

If the second thing is the same phenomenon as the first thing -- just shards avoiding actions they expect to lead to rival shards gaining strength -- then there's no mystery for shard theory. Maybe I should deny the evidence and just claim that people cannot actually uproot their held values simply by sitting and reflecting -- for now, I'm unsure.

We-the-devs choose when to dole out these reinforcement events, either handwriting a simple reinforcement algorithm to do it for us or having human overseers give out reinforcement.

I would unify those by saying: We get to write a reward function however we want. The reward function can depend on external inputs, if that’s what we want. One of those external inputs can be a "reward" button, if that’s what we want. And the reward function can be a single trivial line of code return (100 if reward_button_is_being_pressed else 0), if that’s what we want.

Therefore, “having human overseers give out [reward]” is subsumed by “handwriting a simple [reward function]”.

(I don’t think you would wind up with an AGI at all with that exact reward function, and if you did, I think the AGI would probably kill everyone. But that’s a different issue.)

(Further discussion in §8.4 here.)

(By the way, I’m assuming “reinforcement” is synonymous with reward, let me know if it isn’t.)

I am using "reinforcement" synonymously with "reward," yes!

Important to note: the brain uses reinforcement and reward differently. The brain is primarily based on associative learning (Hebbian learning: neurons that fire together wire together) and the reward/surprise signals act like optional modifiers to the learning rate. Generally speaking, events that are surprisingly rewarding or punishing cause temporary increase in the rate of the formation of the associations. Since we're talking about trying to translate brain learning functions to similar-in-effect-but-different-in-mechanism machine learning methods, we should try to be clear about human brain terms and ML terms. Sometimes these terms have been borrowed from neuroscience but applied to not quite right ML concepts. Having a unified jargon seems important for the accurate translation of functions.

About the problems you mention:

the apparent phenomenon of credit assignment improving over a lifetime. When you're older and wiser, you're better at noticing which of your past actions were bad and learning from your mistakes.

I don't get why you see a problem here. More data will lead to better models over time. You get exposed to more situations, and with more data, the noise will slowly average out. Not necessarily because you can clearly attribute things to their causes, but because you randomly get into a situation where the effect is more clear. It mostly takes special conditions to get people out of their local optimum.

without any anti-reinforcement event occurring

And if it looks like this comes in hindsight by carefully reflecting on the situation, that's not without reinforcement. Your thoughts are scored against whatever it is that the brainstem is evaluating. And same as above, earlier or later, you stumble into some thoughts where the pattern is more clearly attributable, and then the weights change.

And if it looks like this comes in hindsight by carefully reflecting on the situation, that's not without reinforcement. Your thoughts are scored against whatever it is that the brainstem is evaluating. And same as above, earlier or later, you stumble into some thoughts where the pattern is more clearly attributable, and then the weights change.

Maybe. But your subcortical reinforcement circuitry cannot (easily) score your thoughts. What it can score are the mystery computations that led to hardcoded reinforcement triggers, like sugar molecules interfacing with tastebuds. When you're just thinking to yourself, all of that should be a complete black-box to the brainstem.

I did mention that something is going on in the brain with self-supervised learning, and that's probably training your active computations all the time. Maybe shards can be leveraging this training loop? I'm currently quite unclear on this, though.

I mean scoring thoughts in the sense of [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering with what Steven calls "Thought Assessors". Thoughts totally get scored in that sense.

I think David is referring to the claims made by Human values & biases are inaccessible to the genome.

I agree that the genome can only score low-complexity things. But there are two parts and it looks like you are considering only the first:

Reinforcing behaviors that lead to results that the brainstem can detect (sugar, nociception, other low-complexity summaries of sense data).
Reinforcing how well thoughts predicted these summaries - this is what leads complex thought to be tied to and modeling what the brainstem (and by extension genes) "cares" about.

I agree that at the moment, everything written about shard theory focuses on (1), since the picture is clearest there. Until very recently we didn't feel we had a good model of how (2) worked. That being said, I believe the basic information inaccessibility problems remain, as the genome cannot pick out a particular thought to be reinforced based on its content, as opposed to based on its predictive summary/scorecard.

Can you explain what the non-predictive content of a thought is?

I understand that thoughts have much higher dimensionality than the scorecard. The scoring reduces the complexity of thoughts to the scorecard's dimensionality. The genes don't care how the world is represented as long as it a) models reward accurately and b) gets you more reward in the long run.

But what aspect of that non-score content are you interested in? And if there is something that you are interested in, why can't it be represented in a low-dimensional way too?

As I understand Steve's model, each Thought Assessor takes in context signals from the world model representing different concepts activated by the current thought, and forms a loop with a generically hardwired control circuit (for ex. salivation or cortisol levels). As a result, the ground truth used to supervise the loop must be something that the genome can directly recognize outside of the Learning Subsystem, like "We're tasting food, so you really should've produced saliva already". Those context signals then are trained to make long-term predictions relevant to saliva production, in learned from scratch contexts like when you're sitting at a restaurant reading the entree description on the menu.

Each of those loops needs to be grounded in some way through control circuitry that the genome can construct within the Steering Subsystem, which means that absent some other mechanism, the ground truth signals that are predicted by the Thought Assessors cannot be complex, learned from scratch concepts, even if the inputs to the Thought Assessors are. And as far as I can tell, the salivation Thought Assessor doesn't know that its inputs are firing because I'm thinking "I'm in a restaurant reading a tasty sounding description" (the content of the thought) as opposed to thinking any other salivation-predictive thought, making the content inaccessible to it. It would seem like there are lots of kinds of content that it'd be hard to ground out this way. For example, how would we set up such a circuit for "deception"?

It would seem like there are lots of kinds of content that it'd be hard to ground out this way. For example, how would we set up such a circuit for "deception"?

Agree.

There will be a lot of complex concepts that occur naturally in thought-space that can't be easily represented with few bits in reward circuitry. Maybe "deception" is such an example.

On the other hand, evolution managed to wire reward circuits that reliably bring about some abstractions that lead to complex behaviors aligned with "its interests," i.e., reproduction, despite all the compute the human brain puts into it.

Maybe we should look for aligned behaviors that we can wire with few bits. Behaviors that don't use the obvious concepts in thought-space. Perhaps "deception" is not a natural category, but something like "cooperation with all agent-like entities" is.

At this moment in time I have two theories about how shards seem to be able to form consistent and competitive values that don't always optimize for some ultimate goal:

Overall, Shard theory is developed to describe behavior of human agents whose inputs and outputs are multi-faceted. I think something about this structure might facilitate the development of shards in many different directions. This seems different to modern deep RL agent; although they also potentially can have lots of input and output nodes, these are pretty finely honed to achieve a fairly narrow goal, and so in a sense, it is not too much of a surprise they seem to Goodhart on the goals they are given at times. In contrast, there’s no single terminal value or single primary reinforcer in the human RL system: sugary foods score reward points, but so do salty foods when the brain’s subfornical region indicates there’s not enough sodium in the bloodstream (Oka, Ye, Zuker, 2015); water consumption also gets reward points when there’s not enough water. So you have parallel sets of reinforcement developing from a wide set of primary reinforcers all at the same time.
As far as I know, a typical deep RL agent is structured hierarchically, with feedforward connections from inputs at one end to outputs at the other, and connections throughout the system reinforced with backpropagation. The brain doesn't use backpropagation (though maybe it has similar or analogous processes); it seems to "reward" successful (in terms of prediction error reduction, or temporal/spatial association, or simply firing at the same time...?) connections throughout the neocortex, without those connections necessarily having to propagate backwards from some primary reinforcer.

The point about being better at credit assignment as you get older is probably not too much of a concern. It’s very high level, and to the extent it is true, mostly attributable to a more sophisticated world model. If you put a 40 year old and an 18 year old into a credit assignment game in a novel computer game environment, I doubt the 40 year old will do better. they might beat a 10 year old, but only to the extent the 40 year old has learned very abstract facts about associations between objects which they can apply to the game. speed it up so that they can’t use system 2 processing, and the 10 year old will probably beat them.

Subcortical reinforcement circuits, though, hail from a distinct informational world... and so have to reinforce computations "blindly," relying only on simple sensory proxies.

This seems to be pointing in an interesting direction that I'd like to see expanded.

Because your subcortical reward circuitry was hardwired by your genome, it's going to be quite bad at accurately assigning credit to shards.

I don't know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of? Cognitive biases in general?

if shard theory is true, meaningful partial alignment successes are possible

"if shard theory is true" -- is this a question about human intelligence, deep RL agents, or the relationship between the two? How can the hypothesis be tested?

Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot

What's to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?

I don't know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of?

Say that the triggers for pleasure are hardwired. After a pleasurable event, how do only those computations running in the brain that led to pleasure (and not those randomly running computations) get strengthened? After all, the pleasure circuit is hardwired, and can't reason causally about what thoughts led to what outcomes.

(I'm not currently confident that pleasure is exactly the same thing as reinforcement, but the two are probably closely related, and pleasure is a nice and concrete thing to discuss.)

What's to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?

Nothing except those shards fighting for their own interests and succeeding to some extent.

You probably have many contending values that you hang on to now, and would even be pretty careful with write access to your own values, for instrumental convergence reasons. If you mostly expect outcomes where one shard eats all the others, why do you have a complex balance of values rather than a single core value?

If you mostly expect outcomes where one shard eats all the others, why do you have a complex balance of values rather than a single core value?

There's a further question which is "How do people behave when they're given more power over and understanding of their internal cognitive structures?", which could actually resolve in "People collapse onto one part of their values." I just think it won't resolve that way.

a thought: can shard theory understand a modern cpu? can you take apart binary circuitry with shard theory, and if so, what does the representation say about it? does it try to make predictions and fail?

edit: that is, does dividing a circuit up into shards work on other examples of important circuits?

A lot of good discussion here. I especially think Steven Byrnes makes some good points (as usual). I agree that it's worth being distinct, in a technical jargon sort of way, about the differences between 'values' as used to refer to RL and 'values' as used by humans to talk about a particular type of reflectively endorsed desire for the shape of the current or future world. I think it is much more accurate to map RL 'values' to human 'desires', and that a term which encompasses this concept well is 'reward'. So my recommendation would be to stick to talking about 'reward signals' in the human brain and in RL.

In regards to an assertion about the brain made in the post, I would like to add a couple details.

As a newborn baby, you start life with subcortical hardwired reinforcement circuitry (fixed by your genome) and a randomly initialized neocortex.

This is roughly true, but I think it's important for this conversation to add some details about the constraints on the randomness of the cortical initialization.

Constrained cortical randomness: in certain ways this randomness is extremely constrained (and breaking these constraints leads to a highly dysfunctional brain). Of particular importance, the long range (multiple centimeters) inter-module connections are genetically hard-coded and established in fetal development. These long range connections are then able to change only locally (usually << 0.5 cm) in terms of their input/output locations. No new long range connections can be formed for the rest of the lifespan. They can be lost but not replaced. There are also rules of about which subtypes of cells in which layers of the cortex can connect to which other subtypes of cells. So there's a lot of important order there. I think when making comparison to machine learning we should be careful to keep in mind that the brain's plasticity is far more constrained than a default neural network. This means particularly that various modules within the neocortex take on highly constrained roles which are very similar between most humans (barring dramatic damage or defects). This is quite useful for interpreting the function of parts of the cortex.
The subcortical reward circuitry has some hardcoded aspects and some randomly initialized learning aspects. Less so than the cortex, but not none. Also, the reward circuitry (particular the reciprocal connections between the thalamus and the cortex and back and forth again, and the amygdala - prefrontal cortex links) have a lot to do with learned rewards and temporary task-specific connections between world-state-prediction and reward. For instance, getting really emotionally excited about earning points in a video game is something that is dependent on the function of the prefrontal cortex interpreting those points as a task-relevant signal (video game playing being the self-assigned task). This ability to contextually self-assign tasks and associate arbitrary sensory inputs as temporary reward signals related to these tasks is a key part of what makes humans agentic. Classically, the symptom cluster which results from damage to the prefrontal areas responsible self-task-and-reward-assignment area is called 'lobotomized'. Lobotomy was a primitive surgery involving deliberately damaging these frontal cortical regions of misbehaving mentally-ill people specifically to permanently remove their agency. Studies on people who have sustained deliberate or accidental damage to this region show that they can no longer successfully make and execute multi-step or abstract tasks for themselves. They can manage single step tasks such as eating food placed in front of them when they are hungry, but not actively seeking out food that isn't obviously available. This varies depending on the amount of damage to this area. Some patients may be able to go to the refrigerator and get food if hungry. An only slightly damaged patient may be able to go grocery shopping even, but probably wouldn't be able to connect up a longer and more delayed / abstracted task chain around preventing hunger. For instance, earning money to use for grocery shopping or saving shelf-stable food for future instances of temporary unavailability of the grocery store.

Ok, after thinking a bit, I just can't resist throwing in another relevant point.

Attention.

Attention in the brain is entirely inhibitory (activation magnitude reduction) of the cortical areas currently judged to be irrelevant. This inhibition is not absolute, it can be overcome by a sufficiently strong surprising signal. When it isn't being overridden, it drives the learning rate to basically zero for the unattended areas. This has been most studied (for convenience reasons) in the visual cortex, in the context of suppressing visual information currently deemed irrelevant to the tasks at hand. These tasks at hand involve both semi-hardwired instincts like predator or prey detection, and also conditional task-specific attention as mediated by the frontal cortex (with information passed via those precious long-range connections, many of which route through the thalamus).

We-the-devs choose when to dole out these reinforcement events, either handwriting a simple reinforcement algorithm to do it for us or having human overseers give out reinforcement.

That sounds like are actually writing such reward functions. Is there another project that tries to experimentally verify your predictions?

That is what we are currently trying to do, mostly focusing on pretrained LMs in text-based games.

Things I agree with:

Model-based RL algorithms tend to create agents with a big mess of desires and aversions that are not necessarily self-consistent. (But see caveat about self-modification below.)
When an agent is in an environment wherein its desires can change (e.g. an RL agent being updated online by TD learning), it will tend to take foresighted actions to preserve its current desires (cf. instrumental convergence), assuming the agent is sufficiently self-aware and foresighted.
Human within-lifetime learning is an example of model-based RL, and well worth looking into in the course of trying to understand more generally how powerful model-based RL agents might behave.

Things I disagree with:

I disagree with how this post uses the word “values” throughout, rather than “desires” (or “preferences”) which (AFAICT) would be a better match to how the term is being used here.
I disagree with how this post seems to optimistically ignore the possibility that the AGI might self-modify to be more coherent in a way that involves crushing / erasing a subset of its desires, and this subset might include the desires related to human flourishing. This is analogous to how you or I might try to self-modify to erase some of our (unendorsed) desires (to eat junk food, be selfish, be cruel, etc.), if we could. (This doesn’t even require exotic things like access to source code / brain tissue; there are lots of mundane tricks to kick bad habits etc.) (Cf. also how people will “bite the bullet” in thought experiments when doing moral reasoning.)
- Counterpoint: What if we try to ensure that the AGI has a strong (meta-)preference not to do that?
- Response to counterpoint: Sure! That seems like a promising thing to look into. But until we have a plan to ensure that that actually happens, I’ll keep feeling like this is an “area that warrants further research” not “cause for general optimism”.
Relatedly and more broadly, my attitude to the technical alignment problem is more like “It is as yet unclear whether or not we face certain doom if we train a model-based RL agent to AGI using the best practices that AI alignment researchers currently know about” (see §14.6 here), not “We have strong reason to believe that the problem is solvable this way with no big new alignment ideas”. This makes me “optimistic” about solvability of technical alignment by the standards of, say, Eliezer, but not “optimistic” as the term is normally used. More like “uncertain”.
- I guess I’m not sure if this is really a disagreement with the post. The post seems to have optimistic vibes in certain places (“if shard theory is true, meaningful partial alignment successes are possible”) but the conclusion is more cautious. ¯\_(ツ)_/¯
I think some of the detailed descriptions of (anthropomorphized) shards are misleading, but I’m not sure that really matters for anything.

Thanks as always for your consistently thoughtful comments:)

I disagree with how this post seems to optimistically ignore the possibility that the AGI might self-modify to be more coherent in a way that involves crushing / erasing a subset of its desires, and this subset might include the desires related to human flourishing.

I don't view shard-coordination as being different than shard formation

Yeah I expect that the same learning algorithm source code would give rise to both preferences and meta-preferences. (I think that’s what you’re saying there right?)

the AGI might self-modify to be more coherent in a way that involves crushing / erasing a subset of its desires, and this subset might include the desires related to human flourishing. This is analogous to how you or I might try to self-modify to erase some of our (unendorsed) desires (to eat junk food, be selfish, be cruel, etc.), if we could

I disagree with how this post uses the word “values” throughout, rather than “desires” (or “preferences”) which (AFAICT) would be a better match to how the term is being used here.

This has definitely been a point of confusion. There are a couple of ways one might reasonably interpret the phrase "human values":

the common denominator between all humans ever about what they care about
the ethical consensus of (some subset of) humanity
the injunctions a particular human would verbally endorse
the cognitive artifacts inside each particular human that implement that human's valuing-of-X, including the cases where they verbally endorse that valuing (along with a bunch of other kinds of preferences, both wanted and unwanted)

I think the shard theory workstream generally uses "human values" in the last sense.

This unusual definitional broadness is a real cost, which is why I like talking about an anti-COVID-test subshard, instead of an anti-test "value."

I might have a narrowly activated value against taking COVID tests…

Hmm, I think I’d say “I might feel an aversion…” there.

“Desires & aversions” would work in a context where the sign was ambiguous. So would “preferences”.

Nice overview, David! You've made lots of good points and clarifications. I worry that this overview goes a little too fast for new readers. For example,

Shard theory thus predicts that dumb RL agents internalize lots of representable-by-them in-distribution proxies for reinforcement as shards, as a straightforward consequence of reinforcement events being gated behind a complex conditional distribution of task blends.

Comments on other parts:

Your "utility function" (an ordering over possible worlds, subject to some consistency conditions) is far too big for your brain to represent.

But a utility function can be lossily projected down into a bounded computational object by factoring it into a few shards, each representing a term in the utility function, each term conceptually chunked out of perceptual input.

I feel confused by the role of "perceptual input" here. Can you give an example of a situation where the utility function gets chunked in this way?

shard theory in general doesn't have a good account of credit assignment improving in-lifetime.

Yeah, I fervently wish I knew what were happening here. I think that sophisticated credit assignment is probably convergently bootstrapped from some genetically hard-coded dumb credit assignment.

But a utility function can be lossily projected down into a bounded computational object by factoring it into a few shards, each representing a term in the utility function, each term conceptually chunked out of perceptual input.
I feel confused by the role of "perceptual input" here. Can you give an example of a situation where the utility function gets chunked in this way?

Several discrete shards arise by incremental reinforcement; individually they are not optimizers, but acting in conjunction, they are an optimizer. Negotiation between them mostly doesn't happen because the individual shards are incapable of thinking strategically. The thought-pattern where the shards are activated in conjunction is reinforced later. The goal of the resulting bounded optimizer is thousands of random proxies for reward from the constituent shards.
- e.g. one shard contains a bit of world-model because it controls crawling around your room as a baby; another does search because it's in charge of searching for the right word when you first learn to speak. Or something, I'm not a neuroscientist.
The same, but shards that make up the optimizer are not discrete; the architecture of the brain has some smooth gradient path towards forming a bounded optimizer.
Shards are incapable of aggregating into an optimizer on their own, but some coherence property is also reinforced (predictive coding or something) that produces agenty behavior without any negotiation between shards.
...

it doesn't make sense to say "if shard theory is true", if shard theory is a research program

(That was sloppy phrasing on my part, yeah. "If shard theory is the right basic outlook on RL agents..." would be better.)

Where his shards once just passively dealt with the consequences effected by other shards via their shared motor output channel, they are now intelligent enough to plan scheme at the other shards. Say that you're considering whether to go off to big-law school, and are concerned about that environment exacerbating the egoistic streak you see and dislike in yourself. You don't want to grow up to be more of an egotist, so you choose to avoid going to your top-ranked big-law-school offer, even though the compensation from practicing prestigious big-shot law would further your other goals. [...]
There are some human phenomena that shard theory doesn't have a tidy story about. The largest is probably the apparent phenomenon of credit assignment improving over a lifetime. When you're older and wiser, you're better at noticing which of your past actions were bad and learning from your mistakes. Possibly, this happens a long time after the fact, without any anti-reinforcement event occurring. But an improved conceptual understanding ought to be inaccessible to your subcortical reinforcement circuitry -- on shard theory, being wiser shouldn't mean your shards are reinforced or anti-reinforced any differently.

We-the-devs choose when to dole out these reinforcement events, either handwriting a simple reinforcement algorithm to do it for us or having human overseers give out reinforcement.

Therefore, “having human overseers give out [reward]” is subsumed by “handwriting a simple [reward function]”.

(I don’t think you would wind up with an AGI at all with that exact reward function, and if you did, I think the AGI would probably kill everyone. But that’s a different issue.)

(Further discussion in §8.4 here.)

(By the way, I’m assuming “reinforcement” is synonymous with reward, let me know if it isn’t.)

I am using "reinforcement" synonymously with "reward," yes!

About the problems you mention:

the apparent phenomenon of credit assignment improving over a lifetime. When you're older and wiser, you're better at noticing which of your past actions were bad and learning from your mistakes.

without any anti-reinforcement event occurring

And if it looks like this comes in hindsight by carefully reflecting on the situation, that's not without reinforcement. Your thoughts are scored against whatever it is that the brainstem is evaluating. And same as above, earlier or later, you stumble into some thoughts where the pattern is more clearly attributable, and then the weights change.

I mean scoring thoughts in the sense of [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering with what Steven calls "Thought Assessors". Thoughts totally get scored in that sense.

I think David is referring to the claims made by Human values & biases are inaccessible to the genome.

I agree that the genome can only score low-complexity things. But there are two parts and it looks like you are considering only the first:

Reinforcing behaviors that lead to results that the brainstem can detect (sugar, nociception, other low-complexity summaries of sense data).
Reinforcing how well thoughts predicted these summaries - this is what leads complex thought to be tied to and modeling what the brainstem (and by extension genes) "cares" about.

Can you explain what the non-predictive content of a thought is?

But what aspect of that non-score content are you interested in? And if there is something that you are interested in, why can't it be represented in a low-dimensional way too?

It would seem like there are lots of kinds of content that it'd be hard to ground out this way. For example, how would we set up such a circuit for "deception"?

Agree.

There will be a lot of complex concepts that occur naturally in thought-space that can't be easily represented with few bits in reward circuitry. Maybe "deception" is such an example.

At this moment in time I have two theories about how shards seem to be able to form consistent and competitive values that don't always optimize for some ultimate goal:

Overall, Shard theory is developed to describe behavior of human agents whose inputs and outputs are multi-faceted. I think something about this structure might facilitate the development of shards in many different directions. This seems different to modern deep RL agent; although they also potentially can have lots of input and output nodes, these are pretty finely honed to achieve a fairly narrow goal, and so in a sense, it is not too much of a surprise they seem to Goodhart on the goals they are given at times. In contrast, there’s no single terminal value or single primary reinforcer in the human RL system: sugary foods score reward points, but so do salty foods when the brain’s subfornical region indicates there’s not enough sodium in the bloodstream (Oka, Ye, Zuker, 2015); water consumption also gets reward points when there’s not enough water. So you have parallel sets of reinforcement developing from a wide set of primary reinforcers all at the same time.
As far as I know, a typical deep RL agent is structured hierarchically, with feedforward connections from inputs at one end to outputs at the other, and connections throughout the system reinforced with backpropagation. The brain doesn't use backpropagation (though maybe it has similar or analogous processes); it seems to "reward" successful (in terms of prediction error reduction, or temporal/spatial association, or simply firing at the same time...?) connections throughout the neocortex, without those connections necessarily having to propagate backwards from some primary reinforcer.

Subcortical reinforcement circuits, though, hail from a distinct informational world... and so have to reinforce computations "blindly," relying only on simple sensory proxies.

This seems to be pointing in an interesting direction that I'd like to see expanded.

Because your subcortical reward circuitry was hardwired by your genome, it's going to be quite bad at accurately assigning credit to shards.

if shard theory is true, meaningful partial alignment successes are possible

"if shard theory is true" -- is this a question about human intelligence, deep RL agents, or the relationship between the two? How can the hypothesis be tested?

Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot

What's to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?

I don't know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of?

(I'm not currently confident that pleasure is exactly the same thing as reinforcement, but the two are probably closely related, and pleasure is a nice and concrete thing to discuss.)

What's to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?

Nothing except those shards fighting for their own interests and succeeding to some extent.

If you mostly expect outcomes where one shard eats all the others, why do you have a complex balance of values rather than a single core value?

edit: that is, does dividing a circuit up into shards work on other examples of important circuits?

In regards to an assertion about the brain made in the post, I would like to add a couple details.

As a newborn baby, you start life with subcortical hardwired reinforcement circuitry (fixed by your genome) and a randomly initialized neocortex.

This is roughly true, but I think it's important for this conversation to add some details about the constraints on the randomness of the cortical initialization.

Constrained cortical randomness: in certain ways this randomness is extremely constrained (and breaking these constraints leads to a highly dysfunctional brain). Of particular importance, the long range (multiple centimeters) inter-module connections are genetically hard-coded and established in fetal development. These long range connections are then able to change only locally (usually << 0.5 cm) in terms of their input/output locations. No new long range connections can be formed for the rest of the lifespan. They can be lost but not replaced. There are also rules of about which subtypes of cells in which layers of the cortex can connect to which other subtypes of cells. So there's a lot of important order there. I think when making comparison to machine learning we should be careful to keep in mind that the brain's plasticity is far more constrained than a default neural network. This means particularly that various modules within the neocortex take on highly constrained roles which are very similar between most humans (barring dramatic damage or defects). This is quite useful for interpreting the function of parts of the cortex.
The subcortical reward circuitry has some hardcoded aspects and some randomly initialized learning aspects. Less so than the cortex, but not none. Also, the reward circuitry (particular the reciprocal connections between the thalamus and the cortex and back and forth again, and the amygdala - prefrontal cortex links) have a lot to do with learned rewards and temporary task-specific connections between world-state-prediction and reward. For instance, getting really emotionally excited about earning points in a video game is something that is dependent on the function of the prefrontal cortex interpreting those points as a task-relevant signal (video game playing being the self-assigned task). This ability to contextually self-assign tasks and associate arbitrary sensory inputs as temporary reward signals related to these tasks is a key part of what makes humans agentic. Classically, the symptom cluster which results from damage to the prefrontal areas responsible self-task-and-reward-assignment area is called 'lobotomized'. Lobotomy was a primitive surgery involving deliberately damaging these frontal cortical regions of misbehaving mentally-ill people specifically to permanently remove their agency. Studies on people who have sustained deliberate or accidental damage to this region show that they can no longer successfully make and execute multi-step or abstract tasks for themselves. They can manage single step tasks such as eating food placed in front of them when they are hungry, but not actively seeking out food that isn't obviously available. This varies depending on the amount of damage to this area. Some patients may be able to go to the refrigerator and get food if hungry. An only slightly damaged patient may be able to go grocery shopping even, but probably wouldn't be able to connect up a longer and more delayed / abstracted task chain around preventing hunger. For instance, earning money to use for grocery shopping or saving shelf-stable food for future instances of temporary unavailability of the grocery store.

Ok, after thinking a bit, I just can't resist throwing in another relevant point.

Attention.

We-the-devs choose when to dole out these reinforcement events, either handwriting a simple reinforcement algorithm to do it for us or having human overseers give out reinforcement.

That sounds like are actually writing such reward functions. Is there another project that tries to experimentally verify your predictions?

That is what we are currently trying to do, mostly focusing on pretrained LMs in text-based games.

168

Shard Theory: An Overview

168

Ω 45

Introduction

Reinforcement Strengthens Select Computations

When You're Dumb, Continuously Blended Tasks Mean Continuous, Broadening Values

When You're Smart, Internal Game Theory Explains the Tapestry of Your Values

Lingering Confusions

Relevance to Alignment Success

Conclusion

168

Ω 45

168

Ω 45