Iterated Amplification

Oct 29, 2018

by paulfchristiano

This is an upcoming sequence curated by Paul Christiano on one current approach to alignment: Iterated Amplification. The posts will be released in the coming weeks, on average every 2-3 days.

Preface to the sequence on iterated amplification

371mo2 min readΩ 11Show Highlight
0

Problem statement

The first part of this sequence clarifies the problem that iterated amplification is trying to solve, which is both narrower and broader than you might expect.

The Steering Problem

371mo7 min readΩ 9Show Highlight
1

Clarifying "AI Alignment"

531mo3 min readΩ 15Show Highlight
46

An unaligned benchmark

271mo9 min readΩ 8Show Highlight
0

Prosaic AI alignment

3625d7 min readΩ 10Show Highlight
0

Basic intuition

The second part of the sequence outlines the basic intuitions that motivate iterated amplification. I think that these intuitions may be more important than the scheme itself, but they are considerably more informal.

Approval-directed agents: overview

1723d8 min readΩ 3Show Highlight
7

Approval-directed agents: details

1822d7 min readΩ 4Show Highlight
1

Approval-directed bootstrapping

1820d1 min readΩ 3Show Highlight
0

Humans Consulting HCH

1920d1 min readΩ 4Show Highlight
8

Corrigibility

2918d6 min readΩ 7Show Highlight
1

The scheme

The core of the sequence is the third section. Benign model-free RL describes iterated amplification, as a general outline into which we can substitute arbitrary algorithms for reward learning, amplification, and robustness. The first four posts all describe variants of this idea from different perspectives, and if you find that one of those descriptions is clearest for you then I recommend focusing on that one and skimming the others.

Iterated Distillation and Amplification

1915d6 min readΩ 4Show Highlight
6

Benign model-free RL

1013d7 min readΩ 3Show Highlight
0

Factored Cognition

2811d16 min readΩ 8Show Highlight
3