LESSWRONGShard Theory
LW

Shard Theory

Jul 14, 2022 by Quintin Pope

Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:

189Humans provide an untapped wealth of evidence about alignmentΩ
TurnTrout, Quintin Pope
10mo
Ω
93
88Human values & biases are inaccessible to the genomeΩ
TurnTrout
1y
Ω
51
49General alignment propertiesΩ
TurnTrout
10mo
Ω
2
73Evolution is a bad analogy for AGI: inner alignmentΩ
Quintin Pope
10mo
Ω
9
292Reward is not the optimization targetΩ
TurnTrout
1y
Ω
109
239The shard theory of human valuesΩ
Quintin Pope, TurnTrout
9mo
Ω
63
43Understanding and avoiding value driftΩ
TurnTrout
9mo
Ω
9
92A shot at the diamond-alignment problemΩ
TurnTrout
8mo
Ω
57
62Don't design agents which exploit adversarial inputsΩ
TurnTrout, Garrett Baker
7mo
Ω
62
44Don't align agents to evaluations of plansΩ
TurnTrout
7mo
Ω
48
57Alignment allows "nonrobust" decision-influences and doesn't require robust gradingΩ
TurnTrout
7mo
Ω
41
103Inner and outer alignment decompose one hard problem into two extremely hard problemsΩ
TurnTrout
6mo
Ω
18