Clarifying AI X-risk
TL;DR: We give a threat model literature review, propose a categorization and describe a consensus threat model from some of DeepMind's AGI safety team. See our post for the detailed literature review. The DeepMind AGI Safety team has been working to understand the space of threat models for existential risk (X-risk) from misaligned AI. This post summarizes our findings. Our aim was to clarify the case for X-risk to enable better research project generation and prioritization. First, we conducted a literature review of existing threat models, discussed their strengths/weaknesses and then formed a categorization based on the technical cause of X-risk and the path that leads to X-risk. Next we tried to find consensus within our group on a threat model that we all find plausible. Our overall take is that there may be more agreement between alignment researchers than their disagreements might suggest, with many of the threat models, including our own consensus one, making similar arguments for the source of risk. Disagreements remain over the difficulty of the alignment problem, and what counts as a solution. Categorization Here we present our categorization of threat models from our literature review, based on the technical cause and the path leading to X-risk. It is summarized in the diagram below. In green on the left we have the technical cause of the risk, either specification gaming (SG) or goal misgeneralization (GMG). In red on the right we have the path that leads to X-risk, either through the interaction of multiple systems, or through a misaligned power-seeking (MAPS) system. The threat models appear as arrows from technical cause towards path to X-risk. The technical causes (SG and GMG) are not mutually exclusive, both can occur within the same threat model. The distinction between them is motivated by the common distinction in machine learning between failures on the training distribution, and when out of distribution. To classify as specificatio