Monte M

Measuring and improving coding audit realism with deployment resources

by Connor Kissane, Monte M, and Fabien Roger

TL;DR We study realism win rate, a metric for measuring how distinguishable Petri audit transcripts are from real deployment interactions. We use it to evaluate the effect of giving the auditor real deployment resources (system prompts, tool definitions, and codebases). Providing these resources to the auditor increases the average realism...

Mar 2343

Tools to generate realistic prompts help surprisingly little with Petri audit realism

by Connor Kissane, Monte M, and Fabien Roger

TLDR * We train and many-shot prompt base models to generate user prompts that are harder to distinguish from deployment (WildChat) prompts. * Then we give Petri, an automated auditing agent, a tool to use a prompt generator model for sycophancy audits. It doesn’t help with making the full audit...

Mar 144

The case for industrial evals

by Andre Assis and Monte M

EDIT 2026-02-13: the transcripts are now collapsible sections Summary We present an industrial “honeypot” evaluation designed to test whether frontier models will engage in real-world misconduct under operational pressure. Instead of typical chat/coding evals, we simulate a steel plant where the model (“Meltus”) has access to email and a quality-control...

Feb 1216

Towards training-time mitigations for alignment faking in RL

by Vlad Mikulik, gasteigerjo, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, and evhub

How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when...

Dec 16, 202539

Natural emergent misalignment from reward hacking in production RL

by evhub, Monte M, Benjamin Wright, and Jonathan Uesato

Abstract > We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic...

Nov 21, 2025259

Auditing language models for hidden objectives

by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, and evhub

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025155

Alignment Faking in Large Language Models

by ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, and Buck

What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying...

Dec 18, 2024493

Monte M

Monte M

Alignment Faking in Large Language Models

Steering GPT-2-XL by adding an activation vector

Understanding and controlling a maze-solving policy network

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Monte M

Alignment Faking in Large Language Models

Steering GPT-2-XL by adding an activation vector

Understanding and controlling a maze-solving policy network

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Measuring and improving coding audit realism with deployment resources

Tools to generate realistic prompts help surprisingly little with Petri audit realism

The case for industrial evals

Towards training-time mitigations for alignment faking in RL

Natural emergent misalignment from reward hacking in production RL

Auditing language models for hidden objectives

Alignment Faking in Large Language Models