AlexMeinke — LessWrong

Physics of RL: Toy scaling laws for the emergence of reward-seeking

TL;DR: * When is or isn't reward the optimization target? I use a mathematical toy model to reason about when RL should select for reward-seeking reasoning as opposed to behaviors that achieve high reward without thinking about reward. * Hypothesis: More diverse RL increases the likelihood that reward-seeking emerges. *...

Mar 4108

Stress Testing Deliberative Alignment for Anti-Scheming Training

by Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, and alex.lloyd

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...

Sep 17, 2025133

Ablations for “Frontier Models are Capable of In-context Scheming”

We recently published our paper “Frontier Models are Capable of In-context Scheming”. We ran some follow-up experiments that we added to the paper in two new appendices B.5 and B.6. We summarize these follow-up experiments in this post. This post assumes familiarity with our paper’s results. sonnet-3.5 sandbags to preserve...

Dec 17, 2024116

Frontier Models are Capable of In-context Scheming

by Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer, and Mikita Balesni

This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show. Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716 Twitter about o1 system...

Dec 5, 2024211

Training AI agents to solve hard problems could lead to Scheming

by Marius Hobbhahn and AlexMeinke

TLDR: We want to describe a concrete and plausible story for how AI models could become schemers. We aim to base this story on what seems like a plausible continuation of the current paradigm. Future AI models will be asked to solve hard tasks. We expect that solving hard tasks...

Nov 19, 202473

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

by L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, AlexMeinke, Owain_Evans, and Marius Hobbhahn

TLDR: We build a comprehensive benchmark to measure situational awareness in LLMs. It consists of 16 tasks, which we group into 7 categories and 3 aspects of situational awareness (self-knowledge, situational inferences, and taking actions). We test 19 LLMs and find that all perform above chance, including the pretrained GPT-4-base...

Jul 8, 2024109

Apollo Research 1-year update

by Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke, and rusheb

This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research About Apollo Research Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old....

May 29, 202493