LESSWRONG
LW

Wikitags

Shard Theory

Edited by David Udell, Noosphere89, et al. last updated 30th Dec 2024

Shard Theory is an alignment research program, about the relationship between training variables and learned values in trained Reinforcement Learning (RL) agents. It is thus an approach to progressively fleshing out a mechanistic account of human values, learned values in RL agents, and (to a lesser extent) the learned algorithms in ML generally.

Shard theory's basic ontology of RL holds that shards are contextually activated, behavior-steering computations in neural networks (biological and artificial). The circuits that implement a shard that garners reinforcement are reinforced, meaning that that shard will be more likely to trigger again in the future, when given similar cognitive inputs.

As an appreciable fraction of a neural network is composed of shards, large neural nets can possess quite intelligent constituent shards. These shards can be sophisticated enough to be well-modeled as playing negotiation games with each other, (potentially) explaining human psychological phenomena like akrasia and value changes from moral reflection. Shard theory also suggests an approach to explaining the shape of human values, and a scheme for RL alignment.

Subscribe
1
Subscribe
1
Discussion2
Discussion2
Posts tagged Shard Theory
261The shard theory of human values
Ω
Quintin Pope, TurnTrout
3y
Ω
67
150Shard Theory in Nine Theses: a Distillation and Critical Appraisal
Ω
LawrenceC
3y
Ω
30
105Contra shard theory, in the context of the diamond maximizer problem
Ω
So8res
3y
Ω
19
82The heritability of human values: A behavior genetic critique of Shard Theory
geoffreymiller
3y
63
48Understanding and avoiding value drift
Ω
TurnTrout
3y
Ω
14
378Reward is not the optimization target
Ω
TurnTrout
3y
Ω
127
334Understanding and controlling a maze-solving policy network
Ω
TurnTrout, peligrietzer, Ulisse Mini, Monte M, David Udell
2y
Ω
28
167Shard Theory: An Overview
Ω
David Udell
3y
Ω
34
136Inner and outer alignment decompose one hard problem into two extremely hard problems
Ω
TurnTrout
3y
Ω
23
95A shot at the diamond-alignment problem
Ω
TurnTrout
3y
Ω
67
71Shard Theory - is it true for humans?
Ω
Rishika
1y
Ω
7
170[April Fools'] Definitive confirmation of shard theory
TurnTrout
2y
8
169Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Ω
cloud, Jacob G-W, Evzen, Joseph Miller, TurnTrout
9mo
Ω
12
105Predictions for shard theory mechanistic interpretability results
Ω
TurnTrout, Ulisse Mini, peligrietzer
3y
Ω
10
95Human values & biases are inaccessible to the genome
Ω
TurnTrout
3y
Ω
54
Load More (15/52)
Add Posts