[ Question ]

How does iterated amplification exceed human abilities?

by riceissa1 min read2nd May 20209 comments


Ω 8

Iterated Amplification AI
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

When I first started learning about IDA, I thought that agents trained using IDA would be human-level after the first stage, i.e. that Distill(H) would be human-level. As I've written about before, Paul later clarified this, so my new understanding is that after the first stage, the distilled agent will be super-human in some respects and infra-human in others, but wouldn't be "basically human" in any sense.

But IDA is aiming to eventually be super-human in almost every way (because it's aiming to be competitive with unaligned AGI), so that raises some new questions:

  1. If IDA isn't going to be human-level after the first stage, then at what stage does IDA become at-least-human-level in almost every way?
  2. What exactly is the limitation that prevents the first stage of IDA from being human-level in almost every way?
  3. When IDA eventually does become at-least-human-level in almost every way, how is the limitation from (2) avoided?

That brings me to Evans et al., which contains a description of IDA in section 0. The way IDA is set up in this paper leads me to believe that the answer to (2) above is that the human overseer cannot provide a sufficient number of demonstrations for the most difficult tasks. For example, maybe the human can provide enough demonstrations for the agent to learn to answer very simple questions (tasks in in the paper) but it's too time-consuming for the human to answer enough complicated questions (say, in ). My understanding is that IDA gets around this by having an amplified system that is itself automated (i.e. does not involve humans in a major way, so cannot be bottlenecked on the slowness of humans); this allows the amplified system to provide a sufficient number of demonstrations for the distillation step to work.

So in the above view, the answer to (2) is that the limitation is the number of demonstrations the human can provide, and the answer to (3) is that the human can seed the IDA process with sufficient demonstrations of easy tasks, after which the (automated) amplified system can provide sufficient demonstrations of the harder tasks. The answer to (1) is kind of vague: it's just the smallest for which contains almost all tasks a human can do.

But the above view seems to conflict with what's in the IDA post and the IDA paper. In both of those, the amplified system is described as a human doing the decompositions (so it will be slow, or else one would need to argue that the slowness of humans decomposing tasks doesn't meaningfully restrict the number of demonstrations). Also, the main benefit of amplification is described not as the ability to provide more demonstrations, but rather to provide demonstrations for more difficult tasks. Under this alternative view, the answers to questions (1), (2), (3) aren't clear to me.

Thanks to Vipul Naik for reading through this question and giving feedback.


Ω 8

New Answer
Ask Related Question
New Comment

2 Answers

Let's ignore computational cost for now, and so consider iterated amplification without distillation, and the initial agent is some particular human. Amplification is also going to be simpler -- it just means letting the agent think twice as long.

For example, is a question-answering system that just sends me the question, and returns the answer I give after thinking about it for a day. refers to the answers I'd give if I had days to think about it.

Rather than talk about "human-level", let's talk about "Issa-level" -- agents need to answer questions as well as you could given a day's time.

Then, is super-Issa-level on some tasks (e.g. questions about Berkeley culture) and sub-Issa-level on some tasks (e.g. questions about Wikipedia culture). Why is this? Well, for that example, we have different information. But also, presumably there are differences in what we were good at learning, that would have led to differences even if we had the same information. That's the answer to (2) in this context.

The answer to (3) is that with enough time and effort I could answer questions about Wikipedia culture; it would just take me a lot longer to do so relative to you.

The answer to (1) is "idk, but eventually it's possible". For my specific model, one might hope that would be an upper bound -- at that point I'd get about as much time to answer the question as you have spent living.

The case with iterated distillation and amplification is basically the same:

1. Idk, but eventually it'll happen. (This does rely on the Factored Cognition hypothesis.)

2. A neural net trained by distillation will probably not replicate our skill on tasks perfectly -- what it becomes good at depends on the architecture, training process, the training data it was given, etc. Perhaps humans are really good at social reasoning because it was strongly selected for by evolution, and we didn't give a correspondingly higher amount of training data for the neural net for these social situations, and so it was subhuman at social reasoning.

3. With enough time / computational budget, the agent can (hopefully) replicate whatever (possibly expensive) explicit chunk of reasoning that underlies human performance (even if it was powered by human intuition). This is the Factored Cognition hypothesis. The addition of the distillation step is an extra confounder, but we hope that it doesn't distort anything too much -- its purpose is to improve speed without affecting anything else (though in practice it will reduce capabilities somewhat).

(I might recommend imagining that the first agent has perfect reasoning ability, except that it is very slow. This means that for any question, the first agent could answer it, given unlimited amounts of time. I wouldn't actually make this claim of IDA, but I think it is instructive for building intuitions.)

In answer to question 2)

Consider the task "Prove Fermats last theorem". This task is arguably human level task. Humans managed to do it. However it took some very smart humans a long time. Suppose you need 10,000 examples. You probably can't get 10,000 examples of humans solving problems like this. So you train the system on easier problems. (maybe exam questions? ) You now have a system that can solve exam level questions in an instant, but can't prove Fermats last theorem at all. You then train on the problems that can be decomposed into exam level questions in an hour. (ie the problems a reasonably smart human can answer in an hour, given access to this machine. ) Repeat a few more times. If you have mind uploading, and huge amounts of compute (and no ethical concerns) you could skip the imitation step. You would get an exponentially huge number of copies of some uploaded mind(s) arranged in a tree structure, with questions being passed down, and answers being passed back. No single mind in this structure experiences more than 1 subjective hour.

If you picked the median human by mathematical ability, and put them in this setup, I would be rather surprised if they produced a valid proof of Fermats last theorem. (and if they did, I would expect it to be a surprisingly easy proof that everyone had somehow missed. )

There is no way that IDA can compete with unaligned AI while remaining aligned. The question is, what useful things can IDA do?