Just stumbled onto this talk and thought it would be worth sharing. In it, Buck discusses the methods we would use to align transformative AI (TAI) if it were developed soon and what we can do to improve them. The talk helped me think about how alignment might play out and I liked the division of alignment research into five classes. Somehow thinking about these categories makes it easier for me to think about my own research priorities.
Buck imagines some company he calls Magma that is close to deploying an TAI. If they don't do so, other AI systems may cause a catastrophe, so they are under some pressure to deploy relatively soon. They need to deploy the AI in a way that addresses these risks from other AIs and poses low risk from their own AI.
There are a couple of things Magma might want to deploy their AI for, including:
In order to train a model to be useful, Magma will probably train it on a mixture of the following tasks:
For ChatGPT, (1) was next-token-prediction on the internet and (2) was a mixture of RLHF and supervised finetuning on assistant-type tasks.
Buck sees two ways this could go poorly, ways in which the model might achieve low loss in training but still take over the world:
Roughly speaking, this distinction corresponds to outer and inner alignment. If we are evaluating a model poorly, that could lead to "we get what we can measure" type failure modes. If the model does something catastrophic at deployment, it might be trying to appear aligned because it wants to be deployed, as described in Ajeya Cotra's takeover scenario post.
To address the above issues, Magma might do some simple stuff like:
Buck has some thoughts on this plan:
This is the thought process Buck goes through when thinking about his research priorities:
Assume that a lab ends up in the situation described earlier, and they are being as smart as possible about handling the alignment problems. How can I produce helpful changes to their alignment plan by doing technical research now?
He divides technical research into 5 classes:
Here's Buck's best guess at what an optimal portfolio would look like.
Buck added that this excludes technical research on evals. I'm currently not sure what exactly technical research on evals is and how it would fit into the current portfolio, but it seems good to mention.
Buck's research is very much informed by how he thinks AI will go and what type of stuff will be useful in those situations. A lot of other research isn't done like this. Many other researchers follow heuristics like, "Is this interesting?", "Is this tractable?", or "Are other people doing it?" It might turn out that the latter strategy is in fact better, but Buck is still willing to bet on the former strategy for now.
In following with Ajeya Cotra's takeover scenario post.
Names after a fictional King who had to decide which of his daughters should succeed him. He chose the ones that acted nicely, but it turned out they were just acting nicely in order to be chosen by the King. (Or something like that. I haven't read the play.)
Thanks for the summary :)