Crossposted from my personal blog

Epistemic Status:  I have spent a fair bit of time reading the core Shard Theory posts and trying to understand it. I also have a background in RL as well as the computational neuroscience of action and decision-making. However, I may be misunderstanding or have missed crucial points. If so, please correct me!

Shard Theory has always seemed slightly esoteric and confusing to me — what are ‘shards’, why might we expect these to form in RL agents?  When first reading the Shard Theory posts, there were two main sources of confusion for me. The first: why an agent optimising a reward function should not optimise for reward but instead just implement behaviours that have been rewarded in the past?

This distinction is now obvious to me.  The distinction between amortised vs direct inference, and shards as cached behaviours falls directly out of amortized policy gradient algorithms (which Shard Theory uses as the prototypical case of RL [1]). This idea has also been expanded in many other posts.

The second source of my confusion was the idea of shards themselves. Even given amortisation, why should behaviour splinter into specific ‘shards’? and why should the shards compete with one another? What would it even mean for ‘shards’ to compete or for there to be ‘shard coalitions’ in a neural network?

My best guess here is that Shard Theory is making several empirical claims about the formation of representations during training for large-scale (?) RL models. Specifically, from an ML lens, we can think of shards as loosely-coupled relatively-independent subnetworks which implement specific behaviours.

A concrete instantiation of Shard Theory's claim, therefore, appears to be that during training of the network, the tendency is for the optimiser to construct multiple relatively loosely coupled circuits which each implement some specific behaviour which has been rewarded in the past. In a forward pass through the networks, these circuits then get activated according to some degree of similarity between the current state and the states that have led to reward in the past. These circuits then ‘compete’ with one another to be the one to shape behaviour by being passed through some kind of normalising nonlinearity such as softmax. I am not entirely sure how ‘shard coalitions’ can occur on this view, but perhaps some kind of reciprocal positive feedback where the early parts of the circuit of shard A also provide positive activations to the circuit of shard B  and hence they become co-active (which might eventually lead to the shards ‘merging’) [2].

This is not the only way that processing has to happen in a policy network. The current conceptualisation of shards requires them to be in the ‘output space’ — i.e shards correspond to networks in favour of some series of actions being taken. However, the network could instead do a lot of processing in the input space. For instance, it could separate processing into two phases: 1.) Figure out what action to take by analysing the current state and comparing it to past rewarded states and then 2.) translate that abstract action into the real action space -- i.e. translate 'eat lollipop' into specific muscle movements. In this case, there wouldn’t be multiple shards forming around behaviours, but there could instead be ‘perceptual shards’ which each provide their own interpretations of the current state.

Another alternative is that all the circuits in the network are tightly coupled and cannot be meaningfully separated into distinct ‘shards’. Instead, each reward event subtly increases and decreases the probabilities of all options by modifying all aspects of the network. This is the ‘one-big-circuit’ perspective and may be correct. To summarize, it appears that Shard Theory claims that processing in the network is primarily done in output (behaviour) space and secondly that the internals of the network are relatively modular and consist of fairly separable circuits which implement and upweight specific behaviours.

These are empirical questions that can be answered! And indeed, if we succeed at interpretability even a small amount we should start to get some answers to these questions. Evidence from the current state of interpretability research is mixed. Chris Olah’s work in CNNs, especially Inception V1 , suggests something closer to the ‘one-big-circuit’ view than separable shards. Specifically, in CNNs representations appear to be built up by hierarchical compositional circuits — i.e. you go from curve detectors to fur detectors to dog detectors — but that these circuits are all tightly intertwined with each other rather than forming relatively independent and modular circuits (although different branches of Inception V1 appear to be modular and specialised for certain kinds of perceptual input). For instance, the features at a higher layer tend to depend on a large number of the features at lower layers. On the other hand, in transformer models, there appears to be more evidence for more independent circuits.  For instance, we can uncover specific circuits for things like induction or indirect-object-identification. However, these must be interpreted with caution since we understand much less about the representations of transformer language models than Inception-V1. A-priori, both the much greater number of parameters in transformer models compared to CNNs, as well as the additive nature of residual nets vs multiplicative hierarchical nature of deep CNNs could potentially encourage the formation of more modular additive shard-like sub circuits. To my knowledge, we have almost zero studies of the internal processing of reasonably large scale policy gradient networks, which would be required to address these questions in practice. This  (and interpretability in RL models in general) would be a great avenue for future interpretability and safety research.

As well as specific claims, shard theory also implicitly assumes some high level claims about likely AGI architectures. Specifically, it requires that AGI be built entirely (maybe only primarily) through an amortised model-free RL agent on a highly variegated reward function — i.e. rewards for pursuing many different kinds of objectives. To me this is a fairly safe bet, as this is approximately how biological intelligence operates and moreover that neuromorphic or brain-inspired AGI, as envisaged by DeepMind is likely to approximate this ideal. Other AGI paths do not follow this path. One example is an AIXI like super-planner, which does direct optimization and so won’t form shards or approximate value fragments barring any inner-alignment failures. Another example is some kind of recursive query wrapper around a general world model, as portrayed here, which does not really get meaningful reward signals at all and isn’t trained with RL. The cognitive properties of this kind of agent, if it can realistically exist, are not really known to me at all.

  1. ^

    In a fun intellectual circle, a lot of shard theory / model-free RL in general seems to be people reinventing behaviourism, except this time programming agents for which it is true. For instance, in behaviourism, agents never ‘optimise for reward’ but always simply display ‘conditioned’ behaviours which were associated with reward in the past. There are also various Pavlovian/associative conditioning experiments which might be interesting to do with RL agents.

  2. ^

    Does this happen in the brain? Some potential evidence (and probably some inspiration) for this comes from the brain, and probably the basal ganglia which implements subcortical action selection. The basal ganglia is part of a large-scale loop through the brain of cortex -> BG -> thalamus -> cortex which contains the full sensorimotor loop. The classic story of the BG is model-free RL with TD learning (but I personally have come to largely disagree with this). A large number of RL algorithms are consistent with RPEs including policy gradients as well as more esoteric algorithms. Beyond this dopaminergic neurons are more complicated than just implementing RPEs as well as appear to represent multiple reward functions which can result in highly flexible TD learning algorithms. The BG does appear to have opponent pathways for exciting and inhibiting (the Go and No-Go pathways specific actions/plans, which indicate some level of shard-theory like competition. On the other hand, there also seems to be a fairly clear separation between action selection and action implementation in the brain, where the basal ganglia mostly does action selection and delegates the circuitry to implement the action to the motor cortex or specific subcortical structures. As far as I know, the motor cortex doesn’t have the same level of competition between different potential behaviours as in the basal ganglia, although this has of course been proposed. Behaviourally, there is certainly some evidence for multiple competing behaviours being activated simultaneously and needing to be effortfully inhibited. A classic example is the Stroop task but there is indeed a whole literature studying tasks where people need to inhibit certain attractive behaviours in various circumstances. On the other hand, this is not conclusive evidence for a shard-like architecture, but instead there could be a hybrid architecture of both amortised and iterative inference where the amortised and iterative responses are different. 

New Comment
5 comments, sorted by Click to highlight new comments since:

In a fun intellectual circle, a lot of shard theory / model-free RL in general seems to be people reinventing behaviourism, except this time programming agents for which it is true. For instance, in behaviourism, agents never ‘optimise for reward’ but always simply display ‘conditioned’ behaviours which were associated with reward in the past. There are also various Pavlovian/associative conditioning experiments which might be interesting to do with RL agents.

I think behaviorism is wrong, and importantly different from shard theoretic analyses. (But maybe you mean something like "some parts of the analyses are re-inventing behaviorism"?) 

From my shortform:

Notes on behaviorism: After reading a few minutes about it, behaviorism seems obviously false. It views the "important part" of reward to be the external behavior which led to the reward. If I put my hand on a stove, and get punished, then I'm less likely to do that again in the future. Or so the theory goes.

But this seems, in fullest generality, wildly false. The above argument black-boxes the inner structure of human cognition which produces the externally observed behavior.

What actually happens, on my model, is that the stove makes your hand hot, which triggers sensory neurons, which lead to a punishment of some kind, which triggers credit assignment in your brain, which examines your current mental state and judges which decisions and thoughts led to this outcome, and makes those less likely to occur in similar situations in the future.

But credit assignment depends on the current internal state of your brain, which screens off the true state of the outside world for its purposes. If you were somehow convinced that you were attempting to ride a bike, and you got a huge punishment, you'd be more averse to moving to ride a bike in the future -- not averse to touching stoves.

Reinforcement does not directly modify behaviors, and objects are not intrinisically reinforcers or punishments. Reinforcement is generally triggered by reward circuitry, and reinforcement occurs over thoughts which are judged responsible for the reward. 

This line of thought seems closer to "radical behaviorism", which includes thoughts as "behaviors." That idea never caught on -- is each thought not composed of further subthoughts? If only they had reduced "thought" into parts, or known about reward circuitry, or about mesa optimizers, or about convergently learned abstractions, or about credit assignment...

After reading a few minutes about it, behaviorism seems obviously false. It views the "important part" of reward to be the external behavior which led to the reward. If I put my hand on a stove, and get punished, then I'm less likely to do that again in the future. Or so the theory goes.

This is probably true for some versions of behaviorism but not all of them. For instance, the author of Don't Shoot the Dog explicitly identifies her frame as behaviorist and frequently cites academic research on behaviorist psychology as the origin of her theoretical approach. At the same time, she also includes the mental state of the organism being trained as a relevant variable. For example, she talks about how animal training gets faster once the animals figure out how they are being taught, and how they might in some situations realize the trainer is trying to teach them something without yet knowing what that something is:

With most animals, you have to go to some lengths to establish stimulus control at first, but often by the time you start bringing the third or fourth behavior under stimulus control, you will find that the animal seems to have generalized, or come to some conceptual understanding. After learning three or four cued behaviors, most subjects seem to recognize that certain events are signals, each signal means a different behavior, and acquiring reinforcers depends upon recognizing and responding correctly to signals. From then on, establishment of learned stimuli is easy. The subject already has the picture, and all it has to do is learn to identify new signals and associate them with the right behaviors. Since you, as trainer, are helping all you can by making that very clear, subsequent training can itself go much faster than the initial laborious steps. [...]

A special case of the conditioned aversive signal has recently become popular among dog trainers: the noreward marker, often the word "Wrong," spoken in a neutral tone. The idea is that when the dog is trying various behaviors to see what you might want, you can help him by telling him what won't work, by developing a signal that signifies "That will not be reinforced." [...]

I once videotaped a beautiful Arabian mare who was being clicker-trained to prick her ears on command, so as to look alert in the show ring. She clearly knew that a click meant a handful of grain. She clearly knew her actions made her trainer click. And she knew it had something to do with her ears. But what? Holding her head erect, she rotated her ears individually: one forward, one back; then the reverse; then she flopped both ears to the sides like a rabbit, something I didn't know a horse could do on purpose. Finally, both ears went forward at once. Click! Aha! She had it straight from then on. It was charming, but it was also sad: We don't usually ask horses to think or to be inventive, and they seem to like to do it.

The book also has some discussion about reinforcing specific cognitive algorithms, such as "creativity":

Reinforcement has been used on an individual and group basis to foster not just specific behavior but characteristics of value to society—say, a sense of responsibility. Characteristics usually considered to be "innate" can also be shaped. You can, for example, reinforce creativity. My son Michael, while going to art school and living in a loft in Manhattan, acquired a kitten off the streets and reinforced it for "cuteness," for anything it did that amused him. I don't know how the cat defined that, but it became a most unusual cat— bold, attentive, loyal, and full of delightful surprises well into middle age. At Sea Life Park we shaped creativity with two dolphins—in an experiment that has now been much anthologized—by reinforcing anything the animals did that was novel and had not been reinforced before. Soon the subjects caught on and began "inventing" often quite amusing behaviors. One came up with wackier stuff than the other; on the whole, even in animals, degrees of creativity or imaginativeness can vary from one individual to another. But training "shifts" the curve for everyone, so that anyone can increase creativity from whatever baseline he or she began at. [...]

Some owners of clicker-wise dogs have become so accustomed to canine initiative and experimentation that they rely on the dog "offering behaviors," both learned and new, as a standard part of the training process. Many clicker trainers play a game with their dogs that I have nicknamed "101 Things to Do with a Box" (or a chair, or a ball, or a toy). Using essentially the same procedure we used at Sea Life Park to develop "creativity" in a dolphin, in each session the dog is clicked for some new way of manipulating the object. For example, you might put a cardboard box on the floor and click the dog for sniffing it and then for bumping it with his nose, until he's pushing it around the room. The next time, you might let the dog discover that pushing the box no longer gets clicked but that pawing it or stepping over the side and eventually getting into the box is what works. The dog might also come up with dragging the box, or lifting and carrying the box. One dog, faced anew with the challenge of the box game, got all his toys and put them into the box. Click! My Border terrier once tipped the box over onto herself and then scooted around under it, creating the spectacle of a mysterious traveling box. Everyone in the room laughed hysterically, which seemed to please her. Some dogs are just as clever at coming up with new ideas as any dolphin could be; and dogs, like dolphins—and horses—seem to love this challenging clicker game.

(The whole book is available online and is an easy read, I very much recommend it)

Nitpick: I think this link failed to embed.

[here](https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth)

Whoops! Thanks for spotting. Fixed!

My understanding of Shard Theory is that what you said is true, except sometimes the shards "directly" make bids for outputs (particularly when they are more "reflexive," e. g. the "lick lollipop" shard is activated when you see a lollipop), but sometimes make bids for control of a local optimization module which then implements the output which scores best according to the various competing shards.  You could also imagine shards which do a combination of both behaviors.  TurnTrout can correct me if I'm wrong.