x

LESSWRONG

LW

Shard Theory — LessWrong

Shard Theory

Jul 14, 2022 by Quintin Pope

Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:

213Humans provide an untapped wealth of evidence about alignment

TurnTrout, Quintin Pope

4y

94

95Human values & biases are inaccessible to the genome

4y

54

51General alignment properties

4y

2

82Evolution is a bad analogy for AGI: inner alignment

4y

18

386Reward is not the optimization target

4y

128

261The shard theory of human values

Quintin Pope, TurnTrout

4y

67

48Understanding and avoiding value drift

4y

14

95A shot at the diamond-alignment problem

4y

67

72Don't design agents which exploit adversarial inputs

TurnTrout, Garrett Baker

4y

64

48Don't align agents to evaluations of plans

4y

49

62Alignment allows "nonrobust" decision-influences and doesn't require robust grading

4y

41

146Inner and outer alignment decompose one hard problem into two extremely hard problems

3y

23