Will we get automated alignment research before an AI Takeoff?
TLDR: Will AI-automation first speed up capabilities or safety research? I forecast that most areas of capabilities research will see a 10x speedup before safety research. This is primarily because capabilities research has clearer feedback signals and relies more on engineering than on novel insights. To change this, researchers should now build and adopt tools to automate AI Safety research, focus on creating benchmarks, model organisms, and research proposals, and companies should grant differential access to safety research. Epistemic status: I spent ~ a week thinking about this. My conclusions rely on a model with high uncertainty, so please take them lightly. I’d love for people to share their own estimates about this. The Ordering of Automation Matters AI might automate AI R&D in the next decade and thus lead to large increases in AI progress. This is extremely important for both the risks (eg an Intelligence Explosion) and the solutions (automated alignment research). On the one hand, AI could drive AI capabilities progress. This is a stated goal of Frontier AI Companies (eg OpenAI aims for a true automated AI researcher by March of 2028), and it is central to the AI 2027 forecast. On the other hand, AI could help us to solve many AI Safety problems. This seems to be the safety plan for some Frontier AI Companies (eg OpenAI’s Superalignment team aimed to build “a roughly human-level automated alignment researcher”), and prominent AI Safety voices have argued that this should be a priority for the AI safety community. I argue that it will matter tremendously in which order different areas of ML research will be automated and, thus, which areas will see large progress first. Consider these two illustrative scenarios: World 1: In 2030, AI has advanced to speeding up capabilities research by 10x. We see huge jumps in AI capabilities in a single year, comparable to those between 2015-2025. However, progress on AI alignment turns out to be bottlenecked by no
Thanks Joe! Indeed I ranked most safety areas (only exception is DCE) lower on the insight vs engineering factor. I'm also not confident in this and I agree this factor changes if other paradigms become more relevant.
Also you're right that I was mostly thinking about the current LLM paradigm. I think my forecast becomes significantly less relevant if the path to AGI is via other paradigms.