Shard theory is an alignment research program, about the relationship between training variables and learned values in trained RL agents. It is thus an approach to progressively fleshing out a mechanistic account of human
values, learned values in RL agents, and (to a lesser extent) the learned algorithms in ML generally. The Shard Theory of human values states
values are ultimately made by shards in the brain. Shards are reinforced behavior based on predicted reward,
when morality is involved, shards in the brain negotiate or fight it out until one coalition of shards wins out, resulting in a decision or shying away from a decision. It's a theory that tries to explain why humans have the values they have, and why value is complex, as well as why the brain is outer-aligned and doesn't do terrible things like X-risks thanks to Instrumental Convergence and the Orthogonality Thesis.