This post is about two proposals for aligning AI systems in a scalable way:

Iterated Distillation and Amplification (often just called 'Iterated Amplification'), or IDA for short,^[1] is a proposal by Paul Christiano.
Debate is an IDA-inspired proposal by Geoffrey Irving.

This post is written to be as easy to understand as possible, so if you found existing explanations of IDA confusing, or if you just never bothered because it seemed intimidating, this post is for you. The only prerequisite is knowing about the concept of outer alignment (and knowing about inner alignment is helpful as well). Roughly,

Outer alignment is aligning the training signal or training data we give to our model with what we want.
If the model we find implements its own optimization process, then inner alignment is aligning [the thing the model is optimizing for] with the training signal.

See also this post for an overview and this paper or my ELI12 edition for more details on inner alignment.

1. Motivation / Reframing AI Risk

Why do we need a fancy alignment scheme?

There has been some debate a few months back about whether the classical arguments of the kind made in Superintelligence for why AI is dangerous hold up to scrutiny. I think a charitable reading of the book can interpret it as primarily defending one claim, which is also an answer to the leading question. Namely,

It is hard to define a scalable training procedure that is not outer-misaligned.

For example, a language model (GPT-3 style) is outer-misaligned because the objective we train for is to predict the most likely next word, which says nothing about being 'useful' or 'friendly'. Similarly, a question-answering system trained with Reinforcement Learning is outer-misaligned because the objective we train for is 'optimize how much the human likes the answer', not 'optimize for a true and useful answer'.

I'll refer to this claim as $(*)$ . If $(*)$ true, it is a problem even under the most optimistic assumptions. For example, we can suppose that

progress is gradual all the way, and we can test everything before we deploy it;
we are likely to maintain control of AI systems (and can turn them off whenever we want to) for a while after they exceed our capabilities;
it takes at least another 50 years for AI to exceed human capabilities across a broad set of tasks.

Even then, $(*)$ remains a problem. The only way to bui...

Iterated Amplification

Iterated Amplification