ARC progress update: Competing with sampling
In 2025, the Alignment Research Center (ARC) has been making conceptual and theoretical progress at the fastest pace that I (Eric) have seen since I first interned in 2022. Most of this progress has come about because of a re-orientation around a more specific goal: outperforming random sampling when it comes to understanding neural network outputs. Compared to our previous goals, this goal has the advantage of being more concrete and more directly tied to useful applications. The purpose of this post is to: 1. Explain and motivate our "outperforming sampling" agenda from the standpoint of preventing catastrophic AI misalignment. 2. Introduce what we call the matching sampling principle (MSP) as a semi-formalization of the belief underpinning our research agenda, and discuss why we believe this principle. 3. Discuss the progress we've made toward matching sampling in some specific contexts, such as random MLPs and trained two-layer MLPs. Also: we're hiring! If the research direction described in this post excites you, you can apply to ARC! Outperforming sampling as a step toward preventing AI misalignment Consider the following simple scheme that attempts to align an AI model , which maps inputs to outputs : 1. Build a "catastrophe detector" that classifies model outputs as "catastrophic" (1) or "non-catastrophic" (0).[1] You can do this by, for example, scaffolding together a deliberative system of GPT-5's that carefully investigate whether the model is doing anything suspicious. 2. Do adversarial training using the catastrophe detector. Concretely, this means: 1. Optimize a probability distribution over inputs , so as to maximize . 2. At the same time, optimize so as to minimize . I think that this is a fine starting point for alignment plan, but not a complete plan in and of itself. It suffers from at least two issues: 1. A catastrophe detector that's built in the way I described would be imperfect, even if you did a really good job