Thoughts on Iterated Distillation and Amplification
Alignment researcher Paul F. Christiano has written several posts on what he refers to as Iterated Distillation and Amplification (IDA). In this post, I will argue that IDA is a general method of adaptation and that it can be found in various different guises in a wide range of contexts. Its inherent flexibility mirrors its inherent lack of controllability. Summary: The point of IDA is to train an AI the same way society trains individuals. What we are hoping to do is to make an AI extract the scaffold of human morality and internalize human moral boundaries. This project might, unfortunately, be doomed to failure: for an AI to be able to adapt to moral norms over time, it must also be capable of breaking them. If it is able to break moral norms, alignment cannot be guaranteed. Therefore, IDA might not be a viable path to AI alignment. Apollonians versus Dionysians > In science the Apollonian tends to develop established lines to perfection, while the Dionysian rather relies on intuition and is more likely to open new, unexpected alleys for research. So writes Hungarian biochemist Albert Szent-Györgyi in a 1972 letter to Science magazine[1]. 35 years earlier, he had won the Nobel Prize in Physiology or Medicine. To his eyes, he had his Dionysian nature to thank: they are the only ones capable of finding something new. It's fairly obvious when you think about it. If you are spending your time developing what is already known, you're not out there exploring the vast unknown. You are doing something safe and sensible instead; cleaning up the mess left behind by past Dionysians. What the Apollonian is doing can be considered distillation or compression. He exploits conventional knowledge. What the Dionysian is doing can be thought of as amplification or learning. He attempts to go beyond conventional knowledge. Christiano's dual-phase model is, in fact, very similar in nature to a whole bunch of other ones. And if we take a look at these we might get a better sen
