[-]rmoehn6y70

In the pseudocode, it would make more sense to initialize A <- Distill(H), wouldn't it? Otherwise, running Amplify with the randomly initialized A in the next step wouldn't be helpful.

[-]riceissa6y70

I had this same thought, but my understanding (which is not solid) is that in the first iteration, since A is random, H can just ignore A and go with its own output (if my assistants are unhelpful, I can just try to perform the task all on my own). So Amplify(H, A) becomes H, which means A <- Distill(Amplify(H, A)) is basically A <- Distill(H), exactly as you suggested.

[-]riceissa6y*60

Based on discussion between Vladimir Slepnev and Paul in this thread, it seems like statements in this post ("we assume that A[0] can acquire nearly human-level capabilities through this process", "Given an aligned agent H we can use narrow safe learning techniques to train a much faster agent A which behaves as H would have behaved") that the first stage of IDA will produce nearly-human-level assistants are misleading. In the same thread, Paul says that he "will probably correct it", but as far as I can tell, neither the Medium post nor the version of the post in this sequence (which was published after the discussion) has been corrected.

[-]philip_b6y30

I think there are 2 mistakes in the pseudocode.

First mistake

what rmoehn said.

Second mistake

In the personal assistant example you say

In the next iteration of training, the Amplify(H, A[0]) system takes over the role of H as the overseer.

which implies that we do

H <- Amplify(H, A)

But in the pseudocode the original human overseer acts as the overseer all the time.

Suggested change of the pseudocode, which fixes both mistakes

def IDA(H):
   repeat:
      A ← Distill(H)
      H ← Amplify(H, A)

[-]Rafael Harth5y*60

I think H is always the same. In fact, H is a human, so it doesn't make any sense to have code of the form . In every step, a new system $A^{(t + 1)}$ is trained by letting a regular human oversee it, where the human has access to the system $A^{(t)}$ .

Conversely, your code would imply that the human itself is replaced with something, and that thing then uses the system $A^{(t)}$ . This does not happen.

(Unless my understanding is widely off; I'm only reading this sequence for the second time.)

[-]Kibidango7yΩ230

AGZ’s policy network p is the learned model.

I found this bit slightly confusing. As far as I understand from the AGZ Nature paper, AGZ does not have a separate policy network p, but uses a single network $f_{θ}$ which outputs both the learned policy p and the estimated probability v that the current player will win the game. Is this what the sentence is referring to?

[-]paulfchristiano7yΩ460

Yes, AGZ uses the same network for policy and value function.

[-]Gurkenglas7yΩ120

Narrow reinforcement learning: As A takes actions in the world, we give it a dense reward signal based on how reasonable we judge its choices are (perhaps we directly reward state-action pairs themselves rather than outcomes in the world, as in TAMER). A optimizes for the expected sum of its future rewards.

Wouldn't it try to bring about states in which some action is particularly reasonable? Like the villain from that story who brings about a public threat in order to be seen defeating it.

[-]paulfchristiano7yΩ130

Potentially, it depends on the time horizon and on how the rewards are calculated.

The most natural reward for the state transition (s0, s1) is just V(s1) - V(s0) (where V is some intuitive "human value function," i.e. ask a human "how good does state s seem?"). This reward function wouldn't have that problem.

[-]Gurkenglas7yΩ110

Maximizing the sum of the difference of state value just maximizes state value again, which the point of narrow reinforcement learning was to get away from.

[-]paulfchristiano7yΩ240

The goal of narrow reinforcement learning is to get something-like-human-level behavior using human-level oversight. Optimizing the human value function over short time horizons seems like a fine approach to me.

The difference with broad reinforcement learning is that you aren't trying to evaluate actions you can't understand by looking at the consequences you can observe.

[-]martinkunev3y10

"there is some threshold of general capability such that if someone is above this threshold, they can eventually solve any problem that an arbitrarily intelligent system could solve"

This is a very interesting assumption. Is there research or discussions on this?

[-]michaelcohen5yΩ110

When A[n+1] is supposed to imitate the output of (H, A[n]), I think IDA is safe, because I think imitation is safe. (If A[0] is a rock and (H, A[n]) is a group of one human and two A[n]'s, then A[n] is basically imitating a group of humans). If (H, A[n]) is supposed to provide a reward signal to A[n+1], which A[n+1] tries to optimize, I think this version of IDA is unsafe, for reasons similar to what Wei Dai expressed in a comment (on a post I now can't find) taking issue with the inductive step in the original argument. Can we standardize different names for these two designs? Unless, is the latter version deprecated?

[-]riceissa6y10

I noticed that I have two distinct "mental pictures" for what the overseer is, depending on how the Distill procedure works (i.e. depending on the narrow technique used in the Distill procedure).

For imitation learning and narrow inverse reinforcement learning: a "passive" overseer that just gets used as a template/target for imitation.
For narrow reinforcement learning and in discussions about approval-directed agents: an "active" overseer that rates actions or provides rewards.

I wonder if this way of thinking about the overseer is okay/correct, or if I'm missing something (e.g. maybe even in case (1), the overseer has a more active role than I can make out). Assuming this way of thinking about the overseer is okay, it seems like for case (1), the term "overseer" has connotations that extend beyond the role played by the overseer (i.e. it doesn't really provide any oversight since it is passive).

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

48

Iterated Distillation and Amplification

48

Ω 14

48

Ω 14

First mistake

Second mistake

Suggested change of the pseudocode, which fixes both mistakes

Motivation: The alignment/capabilities tradeoff

Core concept: Analogy to AlphaGoZero

The IDA Scheme

Amplification is interactive and human-directed in IDA

Example: Building a superhuman personal assistant

Pseudocode

What properties must hold for IDA to work?

Achieving alignment and high capability

Achieving competitive performance and efficiency