Clarifying some key hypotheses in AI alignment
We've created a diagram mapping out important and controversial hypotheses for AI alignment. We hope that this will help researchers identify and more productively discuss their disagreements. Diagram A part of the diagram. Click through to see the full version. Caveats 1. This does not decompose arguments exhaustively. It does not include every reason to favour or disfavour ideas. Rather, it is a set of key hypotheses and relationships with other hypotheses, problems, solutions, models, etc. Some examples of important but apparently uncontroversial premises within the AI safety community: orthogonality, complexity of value, Goodhart's Curse, AI being deployed in a catastrophe-sensitive context. 2. This is not a comprehensive collection of key hypotheses across the whole space of AI alignment. It focuses on a subspace that we find interesting and is relevant to more recent discussions we have encountered, but where key hypotheses seem relatively less illuminated. This includes rational agency and goal-directedness, CAIS, corrigibility, and the rationale of foundational and practical research. In hindsight, the selection criteria was something like: 1. The idea is closely connected to the problem of artificial systems optimizing adversarially against humans. 2. The idea must be explained sufficiently well that we believe it is plausible. 3. Arrows in the diagram indicate flows of evidence or soft relations, not absolute logical implications — please read the "interpretation" box in the diagram. Also pay attention to any reasoning written next to a Yes/No/Defer arrow — you may disagree with it, so don't blindly follow the arrow! Background Much has been written in the way of arguments for AI risk. Recently there have been some talks and posts that clarify different arguments, point to open questions, and highlight the need for further clarification and analysis. We largely share their assessments and echo their recommendations. One aspect of the di
This is neat! I like the idea of isolating technical progress. I'm curious whether you've tried this analysis on more benchmarks, considering that we found significant variation in slope across benchmarks in https://epoch.ai/data-insights/llm-inference-price-trends?insight-option=All+benchmarks