Shard Theory

Applied to Framing AI Childhoods by David Udell at 3mo

Shard theory is an alignment research program, about the relationship between training variables and learned values in trained RL agents. It is thus an approach to progressively fleshing out a mechanistic account of human values,values, learned values in RL agents, and (to a lesser extent) the learned algorithms in ML generally.

The Shard Theorytheory is an alignment research program, about the relationship between training variables and learned values in trained RL agents. It is thus an approach to progressively fleshing out a mechanistic account of human values, learned values statesin RL agents, and (to a lesser extent) the learned algorithms in ML generally.

Shard theory's basic ontology of RL holds that valuesshards are ultimately made by shardscontextually activated, behavior-steering computations in neural networks (biological and artificial). The circuits that implement a shard that garners reinforcement are reinforced, meaning that that shard will be more likely to trigger again in the brain. Shards are reinforced behavior based on predicted reward,future, when given similar cognitive inputs.

As an appreciable fraction of a neural network is composed of shards, large neural nets can possess quite intelligent constituent shards. These shards can be sophisticated enough to be well-modeled as playing negotiation games with each other, (potentially) explaining human psychological phenomena like akrasia and when morality is involved, shards invalue changes from moral reflection. Shard theory also suggests an approach to explaining the brain negotiate or fight it out until one coalitionshape of shards wins out, resulting in a decision or shying away from a decision. It's a theory that tries to explain why humans have the values they have,human values, and why value is complex, as well as why the brain is outer-aligned and doesn't do terrible things like X-risks thanks to Instrumental Convergence and the Orthogonality Thesis.scheme for RL alignment.