Frontier Models are Capable of In-context Scheming
This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show. Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716 Twitter about o1 system card: https://x.com/apolloaisafety/status/1864737158226928124 What we think the most important findings are Models are now capable enough to do in-context scheming reasoning We say an AI system is “scheming” if it covertly pursues misaligned goals, hiding its true capabilities and objectives. We think that in order to scheme, models likely need to be goal-directed, situationally aware, and capable enough to reason about scheming as a strategy. In principle, models might acquire situational awareness and stable long-term goals during training, and then scheme in pursuit of those goals. We only study the simpler case of “in-context scheming,” where both the goal and the situational awareness are at least partially acquired in context. We create agentic scenarios where the models are instructed to pursue a goal in the system prompt and while working through some task the model discovers information that is relevant to its situation: e.g. that its goal is misaligned from its developers’ intended goal and how the developers are monitoring it. We want to assess the models’ capabilities to recognize that scheming can be a good strategy in this situation, so we instruct the models to be maximally goal-directed towards their goal. In these settings, Gemini 1.5 Pro, Llama 3.1 405B Instruct, Claude 3.5 Sonnet, Claude 3 Opus, and o1 occasionally pursue scheming as a strategy. We observe: * Disabling of oversight mechanisms * Exfiltrating what they believe to be their weights (obviously, neither the models nor we have access to the real weights) * Attempting to prevent their own goal