AlexMeinke

Frontier Models are Capable of In-context Scheming

This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show. Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716 Twitter about o1 system card: https://x.com/apolloaisafety/status/1864737158226928124 What we think the most important findings are Models are now capable enough to do in-context scheming reasoning We say an AI system is “scheming” if it covertly pursues misaligned goals, hiding its true capabilities and objectives. We think that in order to scheme, models likely need to be goal-directed, situationally aware, and capable enough to reason about scheming as a strategy. In principle, models might acquire situational awareness and stable long-term goals during training, and then scheme in pursuit of those goals. We only study the simpler case of “in-context scheming,” where both the goal and the situational awareness are at least partially acquired in context. We create agentic scenarios where the models are instructed to pursue a goal in the system prompt and while working through some task the model discovers information that is relevant to its situation: e.g. that its goal is misaligned from its developers’ intended goal and how the developers are monitoring it. We want to assess the models’ capabilities to recognize that scheming can be a good strategy in this situation, so we instruct the models to be maximally goal-directed towards their goal. In these settings, Gemini 1.5 Pro, Llama 3.1 405B Instruct, Claude 3.5 Sonnet, Claude 3 Opus, and o1 occasionally pursue scheming as a strategy. We observe: * Disabling of oversight mechanisms * Exfiltrating what they believe to be their weights (obviously, neither the models nor we have access to the real weights) * Attempting to prevent their own goal

211Dec 5, 2024

AlexMeinke

Message

673

Stress Testing Deliberative Alignment for Anti-Scheming Training

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...

Sep 17, 2025127

Ablations for “Frontier Models are Capable of In-context Scheming”

We recently published our paper “Frontier Models are Capable of In-context Scheming”. We ran some follow-up experiments that we added to the paper in two new appendices B.5 and B.6. We summarize these follow-up experiments in this post. This post assumes familiarity with our paper’s results. sonnet-3.5 sandbags to preserve...

Dec 17, 2024116

Frontier Models are Capable of In-context Scheming

Dec 5, 2024211

Training AI agents to solve hard problems could lead to Scheming

TLDR: We want to describe a concrete and plausible story for how AI models could become schemers. We aim to base this story on what seems like a plausible continuation of the current paradigm. Future AI models will be asked to solve hard tasks. We expect that solving hard tasks...

Nov 19, 202473

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

TLDR: We build a comprehensive benchmark to measure situational awareness in LLMs. It consists of 16 tasks, which we group into 7 categories and 3 aspects of situational awareness (self-knowledge, situational inferences, and taking actions). We test 19 LLMs and find that all perform above chance, including the pretrained GPT-4-base...

Jul 8, 2024109

Apollo Research 1-year update

This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research About Apollo Research Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old....

May 29, 202493

A starter guide for evals

This is a starter guide for model evaluations (evals). Our goal is to provide a general overview of what evals are, what skills are helpful for evaluators, potential career trajectories, and possible ways to start in the field of evals. Evals is a nascent field, so many of the following...

Jan 8, 202458

Load More (7/8)

LESSWRONG
LW

LESSWRONG
LW

AlexMeinke

AlexMeinke

AlexMeinke

Frontier Models are Capable of In-context Scheming

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

AlexMeinke

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Frontier Models are Capable of In-context Scheming

Training AI agents to solve hard problems could lead to Scheming

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Apollo Research 1-year update

A starter guide for evals

Frontier Models are Capable of In-context Scheming

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Frontier Models are Capable of In-context Scheming

Training AI agents to solve hard problems could lead to Scheming

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Apollo Research 1-year update

A starter guide for evals