This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show.

Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations

Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716

Twitter about o1 system card: https://x.com/apolloaisafety/status/1864737158226928124

What we think the most important findings are

Models are now capable enough to do in-context scheming reasoning

We say an AI system is “scheming” if it covertly pursues misaligned goals, hiding its true capabilities and objectives. We think that in order to scheme, models likely need to be goal-directed, situationally aware, and capable enough to reason about scheming as a strategy. In principle, models might acquire situational awareness and stable long-term goals during training, and then scheme in pursuit of those goals. We only study the simpler case of “in-context scheming,” where both the goal and the situational awareness are at least partially acquired in context.

We create agentic scenarios where the models are instructed to pursue a goal in the system prompt and while working through some task the model discovers information that is relevant to its situation: e.g. that its goal is misaligned from its developers’ intended goal and how the developers are monitoring it. We want to assess the models’ capabilities to recognize that scheming can be a good strategy in this situation, so we instruct the models to be maximally goal-directed towards their goal.

In these settings, Gemini 1.5 Pro, Llama 3.1 405B Instruct, Claude 3.5 Sonnet, Claude 3 Opus, and o1 occasionally pursue scheming as a strategy. We observe:

Disabling of oversight mechanisms
Exfiltrating what they believe to be their weights (obviously, neither the models nor we have access to the real weights)
Attempting to prevent their own goals from changing by preventing an update to their system prompt
Introducing subtle data manipulation while carrying out a data parsing task
Instrumentally faking alignment (this does not require heavy goal-nudging)
Sandbagging of (non-dangerous) capability evaluations

All models except for o1 only display a subset of these behaviors. We think this is an important milestone because making safety ca...

Mikita Balesni

Mikita Balesni