LESSWRONGShard Theory
LW

Shard Theory

Jul 14, 2022 by Quintin Pope

Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:

189Humans provide an untapped wealth of evidence about alignment
Ω
TurnTrout, Quintin Pope
1y
Ω
93
88Human values & biases are inaccessible to the genome
Ω
TurnTrout
1y
Ω
52
50General alignment properties
Ω
TurnTrout
1y
Ω
2
73Evolution is a bad analogy for AGI: inner alignment
Ω
Quintin Pope
1y
Ω
9
317Reward is not the optimization target
Ω
TurnTrout
1y
Ω
115
241The shard theory of human values
Ω
Quintin Pope, TurnTrout
1y
Ω
63
43Understanding and avoiding value drift
Ω
TurnTrout
1y
Ω
9
92A shot at the diamond-alignment problem
Ω
TurnTrout
1y
Ω
58
62Don't design agents which exploit adversarial inputs
Ω
TurnTrout, Garrett Baker
10mo
Ω
62
44Don't align agents to evaluations of plans
Ω
TurnTrout
10mo
Ω
48
60Alignment allows "nonrobust" decision-influences and doesn't require robust grading
Ω
TurnTrout
10mo
Ω
41
102Inner and outer alignment decompose one hard problem into two extremely hard problems
Ω
TurnTrout
10mo
Ω
18