Note: This is a joint distillation of both Iterated Distillation and Amplification by Ajeya Cotra (summarizing Paul Christiano) and A Generalist Agent by DeepMind.
Audience: New to alignment. Mostly non-technical, but a basic ML understanding helps.
Suppose you want to use a friend to help you write really great novels. Ideally, you could give them some simple prompt like “write an epic fantasy book where ingesting metals gives you magical powers,” and they would write you a book better than any professional author. But your friend isn’t a professional author. They aren’t even a decent author, and when you ask them to write you this book, they don’t give you much better than “A girl ate some gold, then she could fly, but the gold made her sick and she died. The end.” Not very epic.
But your publisher just sent you an email with an all-caps subject line. You need to get a book out soon, and you know you can’t do a good job on your own. And while your friend isn’t very clever, you happen to know they’re fast, they don’t mind you asking them questions, and they can give you a bunch of (meh-quality) answers. Maybe you could still use them to accelerate your book writing process…
You decide that although you can’t trust the whole book-writing process to your friend, you can split up the work and use them to help out with little bits. You ask them things like “give me a visual description for this character who plays the role of the mentor” or “finish the end of this scene,” and they give you answers. Usually, you have to ask them the same question multiple different times and choose the best version of their answer, and sometimes none of their answers are good and you end up writing your own. But they generate some ideas you wouldn’t think of, and it feels like you’re actually getting a slightly better book written than you would have alone. This takes longer than you thought, but you eventually submit the finished novel the day of the deadline, and it gets accepted!
Another email from your publisher. Your book didn’t do as well as you thought, and now you have even less time to write a better book! You think using your friend for help was a good idea, but it took a lot of time to go through all those answers, and only a few of them were good. Desperate, you wonder if there’s a way to make your friend better and faster at giving answers.
At this point, if your friend is a human, you might have to go through their answers, see where they went wrong, give them some pointers for how to improve, and hope they get better. But lucky for you, your friend is an AI, so you can just collect all the questions you asked and the final answers you chose or wrote and train your friend with them! After doing this, your AI friend gives significantly better answers to your questions on the first try. You finish a month before the deadline, and you’re actually proud of this one!
A call from your publisher. The book was a hit, and they want to know your plans for sequels! Curious, you wonder if there’s a way to repeat the little trick with your AI friend to write even better books faster.
You become a reclusive hermit, training your friend to generate the better answers to your questions, asking them increasingly complicated questions that you choose the best answers from, and repeating this whole process as many times as possible. After a couple of years, you can finally just ask your AI friend to write a book from a one-line prompt, and the books are great. In what appears to the world to be a flurry of creative brilliance, you release thousands of fantastic books, all at once. All the bestseller lists belong to you. You’ve finally made it.
Iterated Distillation and Amplification (IDA) is a proposed training scheme for building superhuman intelligence using a similar process. There are a few technical details to make this work in practice, but the general idea is:
In our example, we amplified our book writing capabilities by asking many questions and choosing the best answers. This way, we used more time to pick the slightly better answers to our questions than we would have come up with alone. Then we distilled these capabilities by retraining the AI on the questions and their best answers to make a model that could quickly generate the best answers the first time. Then we iterated and repeated this process ad infinitum until our AI was better than any human.
It sounds really weird and kind of like cheating, but IDA has actually been shown to work in practice: DeepMind used it to train the best Go AI, AlphaGoZero, from zero human-player data. OpenAI used a version of it to create a book summarizing AI. More generally, modern language models can be pretty predictably scaled-up for better performance and distilled down into smaller sizes. Clearly, IDA can be used to build capable narrow AIs. But can it really create general intelligence?
Enter Gato, A Generalist Agent built by DeepMind and announced on May 12, 2022. Gato is a single large transformer model—the kind commonly used in text processing—but it’s able to perform well in many more areas than just text. As the paper says, “the same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.”
The general way it works is by treating all kinds of input/output data—like text, images, or Atari controller buttons—as tokens, the term for how transformers treat text data. It can then do what transformers like to do: given some previous tokens, predict the tokens that are most likely to come next. To train the model, the researchers collected a bunch of “expert” AIs that were each only good at one or a couple of tasks, recorded the input/output of them performing their tasks, and then made Gato learn to replicate their behavior.
Previous work already did a similar thing with control tasks like playing Atari, but the real contribution with Gato is to show that all different kinds of tasks, not just control tasks, can fit in the same model. More significantly, Gato can also use some of its previous knowledge (e.g. captioning images) to sometimes learn new tasks much faster without seeing them before (e.g. new control tasks) which might indicate some larger degree of “general” knowledge across a range of areas.
To be fair, there’s a very mixed response to Gato within the AI community. Some are calling it the first (sub-human) AGI, while others say it’s a sign we’ll never achieve AGI. While Gato alone isn’t super impressive (it only performs around a third of tasks as well as the expert AIs it was trained on but does all right on most others), I think it is incredibly significant for what it represents: a valid path to AGI (predictions about when AGI will come also thought it significant enough to drop by several years the day Gato was publicized). Some had hypothesized that it would be possible for a current deep learning model to learn a large number of tasks, but no one had really tried it until DeepMind showed with Gato that a modern transformer trained on a lot of data from over 600 tasks “just works.”
If Gato itself isn’t impressive, though, how could it lead to true AGI?
Our novel-writing AI friend didn’t start off impressive either, but Iterated Distillation and Amplification let us bootstrap a lot of performance. A similar process might apply to improving models like Gato.
If you think about it, Gato is fundamentally a distiller. That is, through the training process of learning to replicate the behavior of a bunch of expert AIs, Gato distilled down the capabilities of all those experts into a much smaller package than if you’d naively attempted to jumble all those experts together. Note that while Gato isn’t always better than the experts it learned from, it sometimes is, and the process of training a smaller AI to imitate experts can actually lead to better performance than the experts.
And while not fully demonstrated in the paper, Gato shows a lot of potential for amplification. The DeepMind authors commented that although they pretrained Gato on these experts for convenience, there’s no reason one couldn’t use something like reinforcement learning to train Gato live on new tasks. Additionally, Gato is relatively small at "only" 1.2 billion parameters (for reference, GPT-3 has 175 billion parameters), and so many expect that simply scaling-up this kind of a model would lead to amplified capabilities (and also probably make it a better distiller).
But that doesn’t mean it’s safe. Iterated Distillation and Amplification practically works for making superhuman AI, but Iterated Distillation and Amplification as an alignment plan is currently only theoretical because we don’t know if it’s possible to make distillation and amplification procedures that would “preserve” alignment. That is, if we amplify an aligned AI, we want it to stay aligned, and if we distill an aligned AI, we also want it to stay aligned. That way, if we iteratively repeat this we can end up with aligned superhuman general intelligence.
One hope for some AI alignment researchers like Paul Christiano is that we can eventually devise clever ways to do the amplification and distillation steps to preserve alignment (these clever steps don’t exist yet, but some have been proposed). To be clear, Gato’s training procedure probably isn’t a safe version of distillation, and current methods for scaling-up large transformer models probably aren’t safe versions of amplification, not to mention you have to start with an aligned AI which Gato also is probably not. But if techniques building off of Gato end up being the path to proto-AGI (as DeepMind really seems to think so), we’ll need them to be safe, and so revisiting methods of Iterated Distillation and Amplification for modern agentic transformer models might be an exciting future area of AI alignment research.
Thanks to Mishika Govil for user testing this piece!
This is great, thanks so much for pulling this together (and for linking to our Gato explainer!)
It just so happens I'm working with a group of people through the Cambridge EA Technical AI alignment curriculum, and this idea of IDA is what week 5 is all about - lots of further reading for those who want.
One prompt in the weekly curriculum asks whether there are any tasks that cannot easily be broken down in the way described above, and therefore might not be useful for IDA. One thing I can thing of offhand is large leaps in scientific understanding. For example, if you took 20 physicists and gave them the problems of the day, it's not clear that they ever would have come up with Einstein's theory of relativity. Given that problem, I wonder what the implications are for trying to use IDA to create AGI - does this mean there are certain types of tasks that a IDA-based AGI will not be so good at?
Hi Jon! Yeah, that's an interesting example, and I can confirm that when writing this distillation one of the hardest parts was coming up with a clear example that could use IDA. I think one idea to suggest amplification might apply to scientific development is that a lot of scientific advancements seem to have come clever intuitions and novel ideas. That is, while one scientist is pretty unlikely to get the "Eureka" insight that would lead to e.g. general relativity, 20 scientists collectively have a much higher chance that at least one of them could come up with a good idea, and 1000 scientists an even higher chance (taken to an extreme, you might imagine all of scientific progress on Earth so far has been a bunch of scientists vibing and every so often one of them reaches a useful insight). Scientific progress generally seems to be iterative anyway, so an IDA-amplified PASTA AGI could theoretically spin up a bunch of randomly perturbed versions of itself to work at scientific problems until one comes up with a uniquely good insight, and then it could be distilled down to become more creative and efficient at generating future insights.
Disclaimer: first post/distillation and somewhat new to alignment, so I may have gotten some things wrong.
Calling this a “re-explanation” because that makes a little more sense to me than “distillation” and I plan to do a series of regular re-explanations over the next year-ish.
Feedback is much appreciated!
To me "re-explanation" implies that you've personally explained it in the past, and are now trying again. Tbh I think "distillation" works better by comparison