Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that might pose existential threats – as a new and important pillar of alignment research. If you’re interested in working on this agenda with us at Anthropic, we’re hiring! Please apply to the research scientist or research engineer position on the Anthropic website and mention that you’re interested in working on model organisms of misalignment. The Problem We don’t currently have ~any strong empirical evidence for the most concerning sources of existential risk, most notably stories around dishonest AI systems that actively trick or fool their training processes or human operators: 1. Deceptive inner misalignment (a la Hubinger et al. 2019): where a model obtains good performance on the training objective, in order to be deployed in the real world and pursue an alternative, misaligned objective. 2. Sycophantic reward hacking (a la Cotra 2022): where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring. 1. Though we do have good examples of reward hacking on pretty-obviously-flawed reward functions, it's unclear if those kinds of failures happen with today’s systems which are built on human feedback. The potential reward-hacking failures with RLHF that have been discovered so far (e.g., potentially sycophancy) don’t necessarily seem world-ending or even that hard to fix. A significant part of why we think we don't see empirical examples of these failure modes is that they require that the AI develop several scary tendencies and capabilities such as situational awareness and deceptive reasoning and deploy them together. As a result, it seems useful to evaluate the likelihood of each