Sam Bowman

Building and evaluating alignment auditing agents

TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude...

Jul 24, 202547

Putting up Bumpers

tl;dr: Even if we can't solve alignment, we can solve the problem of catching and fixing misalignment. If a child is bowling for the first time, and they just aim at the pins and throw, they’re almost certain to miss. Their ball will fall into one of the gutters. But...

Apr 23, 202557

Automated Researchers Can Subtly Sandbag

tl;dr When prompted, current models can sandbag ML experiments and research decisions without being detected by zero-shot prompted monitors. Claude 3.5 Sonnet (new) can only sandbag effectively when seeing a one-shot example, while Claude 3.7 Sonnet can do this without an example (zero-shot). We are not yet worried about sabotage...

Mar 26, 202544

Auditing language models for hidden objectives

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025151

Alignment Faking in Large Language Models

What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying...

Dec 18, 2024492

Sabotage Evaluations for Frontier Models

This is a linkpost for a new research paper from the Alignment Evaluations team at Anthropic and other researchers, introducing a new suite of evaluations of models' abilities to undermine measurement, oversight, and decision-making. Paper link. Abstract: > Sufficiently capable models could subvert human oversight and decision-making in important contexts....

Oct 18, 202495

The Checklist: What Succeeding at AI Safety Will Involve

Crossposted by habryka with Sam's permission. Expect lower probability for Sam to respond to comments here than if he had posted it (he said he'll be traveling a bunch in the coming weeks, so might not have time to respond to anything). Preface This piece reflects my current best guess...

Sep 3, 2024153

Sam Bowman

Sam Bowman

Alignment Faking in Large Language Models

Announcing the Inverse Scaling Prize ($250k Prize Pool)

The Checklist: What Succeeding at AI Safety Will Involve

Auditing language models for hidden objectives

Sam Bowman

Alignment Faking in Large Language Models

Announcing the Inverse Scaling Prize ($250k Prize Pool)

The Checklist: What Succeeding at AI Safety Will Involve

Auditing language models for hidden objectives

Building and evaluating alignment auditing agents

Putting up Bumpers

Automated Researchers Can Subtly Sandbag

Auditing language models for hidden objectives

Alignment Faking in Large Language Models

Sabotage Evaluations for Frontier Models

The Checklist: What Succeeding at AI Safety Will Involve