LESSWRONG
LW

Shard Theory

Jul 14, 2022 by Quintin Pope

Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:

213Humans provide an untapped wealth of evidence about alignment
Ω
TurnTrout, Quintin Pope
3y
Ω
94
95Human values & biases are inaccessible to the genome
Ω
TurnTrout
3y
Ω
54
51General alignment properties
Ω
TurnTrout
3y
Ω
2
81Evolution is a bad analogy for AGI: inner alignment
Ω
Quintin Pope
3y
Ω
15
379Reward is not the optimization target
Ω
TurnTrout
3y
Ω
124
259The shard theory of human values
Ω
Quintin Pope, TurnTrout
3y
Ω
67
48Understanding and avoiding value drift
Ω
TurnTrout
3y
Ω
14
95A shot at the diamond-alignment problem
Ω
TurnTrout
3y
Ω
67
72Don't design agents which exploit adversarial inputs
Ω
TurnTrout, Garrett Baker
3y
Ω
64
48Don't align agents to evaluations of plans
Ω
TurnTrout
3y
Ω
49
62Alignment allows "nonrobust" decision-influences and doesn't require robust grading
Ω
TurnTrout
3y
Ω
41
141Inner and outer alignment decompose one hard problem into two extremely hard problems
Ω
TurnTrout
3y
Ω
23