Iterated Distillation and Amplification

Ajeya Cotra

This is a guest post summarizing Paul Christiano’s proposed scheme for training machine learning systems that can be robustly aligned to complex and fuzzy values, which I call Iterated Distillation and Amplification (IDA) here. IDA is notably similar to AlphaGoZero and expert iteration.

The hope is that if we use IDA to train each learned component of an AI then the overall AI will remain aligned with the user’s interests while achieving state of the art performance at runtime — provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance. This document gives a high-level outline of IDA.

Motivation: The alignment/capabilities tradeoff

Assume that we want to train a learner A to perform some complex fuzzy task, e.g. “Be a good personal assistant.” Assume that A is capable of learning to perform the task at a superhuman level — that is, if we could perfectly specify a “personal assistant” objective function and trained A to maximize it, then A would become a far better personal assistant than any human.

There is a spectrum of possibilities for how we might train A to do this task. On one end, there are techniques which allow the learner to discover powerful, novel policies that improve upon human capabilities:

Broad reinforcement learning: As A takes actions in the world, we give it a relatively sparse reward signal based on how satisfied or dissatisfied we are with the eventual consequences. We then allow A to optimize for the expected sum of its future rewards
Broad inverse reinforcement learning: A attempts to infer our deep long-term values from our actions, perhaps using a sophisticated model of human psychology and irrationality to select which of many possible extrapolations is correct.

However, it is difficult to specify a broad objective that captures everything we care about, so in practice A will be optimizing for some proxy that is not completely aligned with our interests. Even if this proxy objective is “almost” right, its optimum could be disastrous according to our true values.

On the other end, there are techniques that try to narrowly emulate human judgments:

Imitation learning: We could train A to exactly mimic how an expertwould do the task, e.g. by training it to fool a discriminative model trying to tell apart A’s actions from the human expert’s actions.
Narrow inverse reinforcement learning: We could train A to infer our near-term instrumental values from our actions, with the presumption that our actions are roughly optimal according to those values.
Narrow reinforcement learning: As A takes actions in the world, we give it a dense reward signal based on how reasonable we judge its choices are (perhaps we directly reward state-action pairs themselves rather than outcomes in the world, as in TAMER). A optimizes for the expected sum of its future rewards.

Using these techniques, the risk of misalignment is reduced significantly (though not eliminated) by restricting agents to the range of known human behavior — but this introduces severe limitations on capability. This tradeoff between allowing for novel capabilities and reducing misalignment risk applies across different learning schemes (with imitation learning generally being narrowest and lowest risk) as well as within a single scheme.

The motivating problem that IDA attempts to solve: if we are only able to align agents that narrowly replicate human behavior, how can we build an AGI that is both aligned and ultimately much more capable than the best humans?

Core concept: Analogy to AlphaGoZero

The core idea of Paul’s scheme is similar to AlphaGoZero (AGZ): We use a learned model many times as a subroutine in a more powerful decision-making process, and then re-train the model to imitate those better decisions.

AGZ’s policy network p is the learned model. At each iteration, AGZ selects moves by an expensive Monte Carlo Tree Search (MCTS) which uses policy pas its prior; p is then trained to directly predict the distribution of moves that MCTS ultimately settles on. In the next iteration, MCTS is run using the new more accurate p, and p is trained to predict the eventual outcome of that process, and so on. After enough iterations, a fixed point is reached — p is unable to learn how running MCTS will change its current probabilities.

MCTS is an amplification of p — it uses p as a subroutine in a larger process that ultimately makes better moves than p alone could. In turn, p is a distillation of MCTS: it learns to directly guess the results of running MCTS, achieving comparable performance while short-cutting the expensive computation. The idea of IDA is to use the basic iterated distillation and amplification procedure in a much more general domain.

The IDA Scheme

IDA involves repeatedly improving a learned model through an amplification and distillation process over multiple iterations.

Amplification is interactive and human-directed in IDA

In AGZ, the amplification procedure is Monte Carlo Tree Search — it’s a simple and well-understood algorithm, and there’s a clear mechanism for how it improves on the policy network’s original choices (it traverses the game tree more deeply). But in IDA, amplification is not necessarily a fixed algorithm that can be written down once and repeatedly applied; it’s an interactive process directed by human decisions.

In most domains, humans are capable of improving their native capabilities by delegating to assistants (e.g. because CEOs can delegate tasks to a large team, they can produce orders of magnitude more output per day than they could on their own). This means if our learning procedure can create an adequate helper for the human, the human can use the AI to amplify their ability — this human/AI system may be capable of doing things that the human couldn’t manage on their own.

Below I consider the example of using IDA to build a superhuman personal assistant. Let A[t] to refer to the state of the learned model after the end of iteration t; the initial agent A[0] is trained by a human overseer H.

Example: Building a superhuman personal assistant

H trains A[0] using a technique from the narrow end of the spectrum, such as imitation learning. Here we are imagining a much more powerful version of “imitation learning” than current systems are actually capable of — we assume that A[0] can acquire nearly human-level capabilities through this process. That is, the trained A[0] model executes all the tasks of a personal assistant as H would (including comprehending English instructions, writing emails, putting together a meeting schedule, etc).

Even though A[0] cannot discover any novel capabilities, it has two key advantages over H: it can run much faster, and many copies or versions of it can be run at once. We hope to leverage these advantages to construct a larger system — involving H and many copies of A[0] — that will substantially improve on H’s capabilities while preserving alignment with H’s values.

H can use calls to A[0] (along with other tools such as external memory) to become a better personal assistant. For example, H could assign one copy of A[0] to figuring out the best time to schedule the client’s recurring team meetings, another copy to figure out what to order the client for lunch, another copy to balance the client’s personal budget, etc. H now has the ability to get very quick solutions to sub-problems that are roughly as good as the ones H would have come up with on their own over a longer time period, and can combine these results to make much better decisions than an unaided human.

Let Amplify(H, A[0]) refer to the larger system of H + many copies of A[0] + aids. Compared to A[0] alone, the Amplify(H, A[0]) system has much higher time and resource costs but its eventual decisions are much better. Moreover, because in each of its individual decisions each copy of A[0] continues to act just as a human personal assistant would act, we can hope that Amplify(H, A[0]) preserves alignment.

In the next iteration of training, the Amplify(H, A[0]) system takes over the role of H as the overseer. A[1] is trained with narrow and safe techniques to quickly reproduce the results of Amplify(H, A[0]). Because we assumed Amplify(H, A[0]) was aligned, we can hope that A[1] is also aligned if it is trained using sufficiently narrow techniques which introduce no new behaviors. A[1] is then used in Amplify(H, A[1]), which serves as an overseer to train A[2], and so on.

Pseudocode

def IDA(H):  
   A <- random initialization
   repeat:
      A <- Distill(Amplify(H, A))

def Distill(overseer):    
   """
   Returns an AI trained using narrow, robust techniques to 
   perform a task that the overseer already understands how to 
   perform. 
   """

def Amplify(human, AI):   
   """  
   Interactive process in which human uses many calls to AI to     
   improve on human's native performance at relevant task(s).    
   """

What properties must hold for IDA to work?

The IDA scheme is a template with “slots” for Amplify and Distill procedures that have not been fully specified yet — in fact, they rely on capabilities we don’t yet have. Because IDA itself is not fully specified, it’s not clear what minimal set of properties are necessary for it to succeed.

Achieving alignment and high capability

That said, here are some general properties which seem necessary — though likely not sufficient — for IDA agents to achieve robust alignment and high capability:

The Distill procedure robustly preserves alignment: Given an aligned agent H we can use narrow safe learning techniques to train a much faster agent A which behaves as H would have behaved, without introducing any misaligned optimization or losing important aspects of what H values.
The Amplify procedure robustly preserves alignment: Given an aligned agent A, it is possible to specify an amplification scheme which calls A multiple times as a subroutine in a way that reliably avoids introducing misaligned optimization.
At least some human experts are able to iteratively apply amplification to achieve arbitrarily high capabilities at the relevant task: a) there is some threshold of general capability such that if someone is above this threshold, they can eventually solve any problem that an arbitrarily intelligent system could solve, provided they can delegate tasks to similarly-intelligent assistants and are given arbitrary amounts of memory and time; b) at least some human experts are above this threshold of generality — given enough time and resources, they can figure out how to use AI assistants and tools to improve their capabilities arbitrarily far.

The non-profit Ought is working on gathering more evidence about assumptions 2 and 3.

Achieving competitive performance and efficiency

Paul aims for IDA agents to be competitive with traditional RL agents in time and resource costs at runtime — this is a reasonable expectation because an IDA agent is ultimately just another learned model whose weights were tuned with an unusual training procedure.

Resource and time cost during training is a more open question; I haven’t explored the assumptions that would have to hold for the IDA training process to be practically feasible or resource-competitive with other AI projects.

This was originally posted here.

In the pseudocode, it would make more sense to initialize A <- Distill(H), wouldn't it? Otherwise, running Amplify with the randomly initialized A in the next step wouldn't be helpful.

I had this same thought, but my understanding (which is not solid) is that in the first iteration, since A is random, H can just ignore A and go with its own output (if my assistants are unhelpful, I can just try to perform the task all on my own). So Amplify(H, A) becomes H, which means A <- Distill(Amplify(H, A)) is basically A <- Distill(H), exactly as you suggested.

Based on discussion between Vladimir Slepnev and Paul in this thread, it seems like statements in this post ("we assume that A[0] can acquire nearly human-level capabilities through this process", "Given an aligned agent H we can use narrow safe learning techniques to train a much faster agent A which behaves as H would have behaved") that the first stage of IDA will produce nearly-human-level assistants are misleading. In the same thread, Paul says that he "will probably correct it", but as far as I can tell, neither the Medium post nor the version of the post in this sequence (which was published after the discussion) has been corrected.

I think there are 2 mistakes in the pseudocode.

First mistake

what rmoehn said.

Second mistake

In the personal assistant example you say

In the next iteration of training, the Amplify(H, A[0]) system takes over the role of H as the overseer.

which implies that we do

H <- Amplify(H, A)

But in the pseudocode the original human overseer acts as the overseer all the time.

Suggested change of the pseudocode, which fixes both mistakes

def IDA(H):
   repeat:
      A ← Distill(H)
      H ← Amplify(H, A)

I think H is always the same. In fact, H is a human, so it doesn't make any sense to have code of the form . In every step, a new system $A^{(t + 1)}$ is trained by letting a regular human oversee it, where the human has access to the system $A^{(t)}$ .

Conversely, your code would imply that the human itself is replaced with something, and that thing then uses the system $A^{(t)}$ . This does not happen.

(Unless my understanding is widely off; I'm only reading this sequence for the second time.)

AGZ’s policy network p is the learned model.

I found this bit slightly confusing. As far as I understand from the AGZ Nature paper, AGZ does not have a separate policy network p, but uses a single network $f_{θ}$ which outputs both the learned policy p and the estimated probability v that the current player will win the game. Is this what the sentence is referring to?

Yes, AGZ uses the same network for policy and value function.

Narrow reinforcement learning: As A takes actions in the world, we give it a dense reward signal based on how reasonable we judge its choices are (perhaps we directly reward state-action pairs themselves rather than outcomes in the world, as in TAMER). A optimizes for the expected sum of its future rewards.

Wouldn't it try to bring about states in which some action is particularly reasonable? Like the villain from that story who brings about a public threat in order to be seen defeating it.

Potentially, it depends on the time horizon and on how the rewards are calculated.

The most natural reward for the state transition (s0, s1) is just V(s1) - V(s0) (where V is some intuitive "human value function," i.e. ask a human "how good does state s seem?"). This reward function wouldn't have that problem.

Maximizing the sum of the difference of state value just maximizes state value again, which the point of narrow reinforcement learning was to get away from.

The goal of narrow reinforcement learning is to get something-like-human-level behavior using human-level oversight. Optimizing the human value function over short time horizons seems like a fine approach to me.

The difference with broad reinforcement learning is that you aren't trying to evaluate actions you can't understand by looking at the consequences you can observe.

"there is some threshold of general capability such that if someone is above this threshold, they can eventually solve any problem that an arbitrarily intelligent system could solve"

This is a very interesting assumption. Is there research or discussions on this?

When A[n+1] is supposed to imitate the output of (H, A[n]), I think IDA is safe, because I think imitation is safe. (If A[0] is a rock and (H, A[n]) is a group of one human and two A[n]'s, then A[n] is basically imitating a group of humans). If (H, A[n]) is supposed to provide a reward signal to A[n+1], which A[n+1] tries to optimize, I think this version of IDA is unsafe, for reasons similar to what Wei Dai expressed in a comment (on a post I now can't find) taking issue with the inductive step in the original argument. Can we standardize different names for these two designs? Unless, is the latter version deprecated?

I noticed that I have two distinct "mental pictures" for what the overseer is, depending on how the Distill procedure works (i.e. depending on the narrow technique used in the Distill procedure).

For imitation learning and narrow inverse reinforcement learning: a "passive" overseer that just gets used as a template/target for imitation.
For narrow reinforcement learning and in discussions about approval-directed agents: an "active" overseer that rates actions or provides rewards.

I wonder if this way of thinking about the overseer is okay/correct, or if I'm missing something (e.g. maybe even in case (1), the overseer has a more active role than I can make out). Assuming this way of thinking about the overseer is okay, it seems like for case (1), the term "overseer" has connotations that extend beyond the role played by the overseer (i.e. it doesn't really provide any oversight since it is passive).

In the pseudocode, it would make more sense to initialize A <- Distill(H), wouldn't it? Otherwise, running Amplify with the randomly initialized A in the next step wouldn't be helpful.

I think there are 2 mistakes in the pseudocode.

First mistake

what rmoehn said.

Second mistake

In the personal assistant example you say

In the next iteration of training, the Amplify(H, A[0]) system takes over the role of H as the overseer.

which implies that we do

H <- Amplify(H, A)

But in the pseudocode the original human overseer acts as the overseer all the time.

Suggested change of the pseudocode, which fixes both mistakes

def IDA(H):
   repeat:
      A ← Distill(H)
      H ← Amplify(H, A)

Conversely, your code would imply that the human itself is replaced with something, and that thing then uses the system $A^{(t)}$ . This does not happen.

(Unless my understanding is widely off; I'm only reading this sequence for the second time.)

AGZ’s policy network p is the learned model.

Yes, AGZ uses the same network for policy and value function.

Narrow reinforcement learning: As A takes actions in the world, we give it a dense reward signal based on how reasonable we judge its choices are (perhaps we directly reward state-action pairs themselves rather than outcomes in the world, as in TAMER). A optimizes for the expected sum of its future rewards.

Wouldn't it try to bring about states in which some action is particularly reasonable? Like the villain from that story who brings about a public threat in order to be seen defeating it.

Potentially, it depends on the time horizon and on how the rewards are calculated.

Maximizing the sum of the difference of state value just maximizes state value again, which the point of narrow reinforcement learning was to get away from.

The difference with broad reinforcement learning is that you aren't trying to evaluate actions you can't understand by looking at the consequences you can observe.

"there is some threshold of general capability such that if someone is above this threshold, they can eventually solve any problem that an arbitrarily intelligent system could solve"

This is a very interesting assumption. Is there research or discussions on this?

I noticed that I have two distinct "mental pictures" for what the overseer is, depending on how the Distill procedure works (i.e. depending on the narrow technique used in the Distill procedure).

For imitation learning and narrow inverse reinforcement learning: a "passive" overseer that just gets used as a template/target for imitation.
For narrow reinforcement learning and in discussions about approval-directed agents: an "active" overseer that rates actions or provides rewards.

51

Iterated Distillation and Amplification

51

Ω 14

Motivation: The alignment/capabilities tradeoff

Core concept: Analogy to AlphaGoZero

The IDA Scheme

Amplification is interactive and human-directed in IDA

Example: Building a superhuman personal assistant

Pseudocode

What properties must hold for IDA to work?

Achieving alignment and high capability

Achieving competitive performance and efficiency

51

Ω 14

First mistake

Second mistake

Suggested change of the pseudocode, which fixes both mistakes

51

Ω 14

First mistake

Second mistake

Suggested change of the pseudocode, which fixes both mistakes