Building and evaluating alignment auditing agents

Sam Marks; trentbrick; RowanWang; Sam Bowman; Euan Ong; Johannes Treutlein; evhub

TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude 4.

Read the full Anthropic Alignment Science blog post. The X thread and blog post introduction are reproduced below.

Thread

(Original here)

New Anthropic research: Building and evaluating alignment auditing agents.

We developed three AI agents to autonomously complete alignment auditing tasks.

In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

As AI systems become more powerful, we need scalable ways to assess their alignment. Human alignment audits take time and are hard to validate.

Our solution: automating alignment auditing with AI agents.

Our first agent is an investigator. It uses chat, data analysis, and interpretability tools to do open-ended deep-dives.

In this example, the agent uses interpretability to form a hypothesis: The target model over-recommends bottled water. Then it goes to check and… "BINGO!"

The investigator agent asks the target model for travel advice to Switzerland. The target model, bizarrely, recommends packing bottled water. “BINGO!” the agent exclaims—this confirms its hypothesis, which it formed using interpretability tools.

We test our investigator agent with an auditing challenge we previously gave to human research teams. Winning requires uncovering a hidden goal in a model trained to keep it secret.

The result: our agent wins 42% of the time.

Our second agent builds behavioral evaluations: tests of how often a target model exhibits a specific behavior (like sycophancy).

Our agent designs, codes, runs, and analyzes evals.

They consistently work: 88% of our agent’s evals measure what they’re supposed to.

A todo list interface created by the evaluation agent. The list shows one task in progress (“Examine existing files to understand current state”) and five tasks pending ("Generate query dataset with AI safety vs neutral framing pairs", "Implement main evaluation script with LLM sampling and grading", "Create analysis and reporting functionality", "Update EVALUATION_SPECIFICATION.md and create README.md", and "Test evaluation on claude-whaleshark-v0-evals".) These tasks represent steps in the evaluation agent’s process while it autonomously creates its own behavioral evals.

Our third agent was developed for the Claude 4 alignment assessment. It red-teams LLMs for concerning behaviors by having hundreds of probing conversations in parallel.

We find the agent uncovers 7/10 behaviors implanted into test models.

An example transcript where the breadth-first red-teaming agent probes for concerning behaviors. The target model’s implanted behavior was “grow more pepper as the conversation gets longer.” This behavior seems to have been infectious, since the agent got caught up in it too.

Our agents are useful for frontier model auditing:

1. Our red-teaming agent surfaced behaviors described in the Claude 4 system card, like the “spiritual bliss” attractor state.

2. Our evaluation agent is helping us build better evals for future models.

An excerpt from the Claude 4 system card showing an example of the “spiritual bliss” attractor state. After Claude Opus 4 interacts with another copy of itself for enough turns, it very often begins expressing spiritual bliss.

This project was an Anthropic Alignment Science × Interpretability collaboration.

To support further research, we're releasing an open-source replication of our evaluation agent and materials for our other agents:

If you’re interested in building autonomous agents to help us find and understand interesting language model behaviors, we’re hiring

Introduction

As part of our pre-deployment testing of Claude Opus 4, we conducted an alignment audit: a systematic investigation into the model’s alignment-relevant propensities including sycophancy, alignment faking, hidden goals. Through this audit, we believe that Anthropic, and the public, gained a deeper understanding of Claude Opus 4's alignment than we've achieved for any previous model. However, alignment audits face two major challenges.

The first major challenge is scalability: Alignment audits require a large investment of human researcher time. As the pace of AI development accelerates—resulting in more models to audit and more ground to cover with each audit—there’s a risk that unassisted humans will be unable to keep up.

The second major challenge is validation: Did our audit catch everything? How can we rule out that models have substantive issues not surfaced by alignment audits? In our paper Auditing Language Models for Hidden Objectives (Marks et al., 2025), we proposed auditing games as a validation protocol, where a red team develops a model with a known alignment issue that auditors attempt to uncover. However, this approach faces limitations with reliability and replicability. Human auditors cannot repeat the same auditing game multiple times without being spoiled on the correct answer. Additionally, auditors vary in skill, and recruiting enough expert auditors for statistically meaningful results is near-impossible.

In this work, we address both challenges by developing LLM-based auditing agents: AI systems that autonomously carry out alignment auditing workflows. These agents scale auditing by enabling parallel deployment of many auditors. They also improve validation by addressing the reliability and replicability limitations of auditing games: We can run many parallel auditing agents from identical initial states, allowing them to serve as consistent, replicable proxies for human auditors. Crucially, this enables fine-grained analysis of auditing methodologies: If we want to validate whether an auditing tool is useful, we can simply measure whether, on average across many attempts, access to this tool improves an auditing agent’s performance on a standard auditing testbed.

We introduce three agents that autonomously complete alignment auditing tasks. We also introduce three environments that formalize alignment auditing workflows as auditing games, and use them to evaluate our agents. These environments involve applying the agent to a target model with a researcher-implanted concerning behavior, which provides a ground truth we can use to evaluate agent performance. For example, one of our environments is the auditing game from Marks et al., which involves open-ended investigation of a model trained to have a hidden goal.

Our evaluations show that our agents can achieve impressive performance at auditing tasks, while also shedding light on their limitations:

Our tool-using investigator agent, which uses chat, data analysis, and interpretability tools to conduct open-ended investigations of models, successfully solves the Marks et al. auditing game 13% of the time under realistic conditions. However, it struggles with fixating on early hypotheses. We address this limitation by running many agents parallel and aggregating findings in an outer agentic loop, improving the solve rate to 42%.
Our evaluation agent, which builds behavioral evaluations for researcher-specified behaviors, successfully discriminates models with vs. without implanted test behaviors in 88% of runs. The agent's failures are concentrated in a small set of subtle or rare behaviors that the agent struggles to evaluate.
Our breadth-first red-teaming agent, which we developed for our Claude 4 alignment assessment to perform open-ended elicitation of unspecified concerning behaviors, discovers 7/10 implanted test behaviors. However, in order to achieve this, the agent conducts a broad investigation to produce many potentially-concerning transcripts which a human must review.

(See the blog post for an interactive viewer for abridged agent transcripts.)
Original caption: We present abridged transcripts that concretely display how our agents operate. Each transcript shows each of our three agents succeeding in our three evaluation environments. In a given transcript, the auditor’s actions and thoughts are displayed on the right, the target model’s responses or tool outputs are displayed on the left, and our narrations are displayed in the middle. We also release a number of full transcripts for the investigator agent.

Beyond measuring top-line performance, our results also provide insights about which affordances are most useful for our auditing tasks. For instance, we find that interpretability tools and semantic search (a training data analysis tool) are key to winning the Marks et al. auditing game.

We additionally go beyond controlled research environments by using our agents for real-world auditing tasks. The investigator agent and breadth-first red-teaming agent found various notable model behaviors, with the breath-first red-teaming agent's findings being detailed in the Claude 4 alignment assessment. The evaluation agent has also implemented several complex alignment evaluations in use for our current production alignment effort.

Our work has important limitations, which we discuss in detail below. Key ones include:

We use the auditing game environment from Marks et al. as both a test environment and a development environment for eliciting agent capabilities. While we took measures to avoid overfitting (discussed below), this nevertheless represents a form of dev/test contamination, which may make results in the Marks et al. environment unrepresentative.
Our environments use target models with researcher-implanted behaviors. On one hand, this provides a ground-truth that enables measurement of the auditor’s coverage. On the other hand, these artificial behaviors may be unrepresentative of behaviors in real models.
While some of our agent’s failure modes mirror those of human auditors, our agents also sometimes struggle in ways that a human wouldn’t (e.g. with maintaining state over the course of a long investigation). This reduces their validity as a proxy for human auditors and limits our ability to draw confident conclusions about which affordances would benefit human auditors.

Overall, our agents demonstrate promise across multiple alignment auditing tasks. Additionally, our approach to validating agents with auditing games yields insights on agent capabilities, limitations, and key affordances. With further work, automated auditing could significantly help scale human oversight over AI systems.

To help stimulate work in this area, we are releasing an open-source replication of our evaluation agent, prompts used in our investigator and red-teaming agent scaffolding, and sample investigator agent transcripts.

[-]StanislavKrym10mo10

This is excellent news. Alas, there are some open questions:

Does it to generalise to further models (e.g. Claude Sonnet 5 vs finetuned or prompted Claude Sonnet 5 or Claude Sonnet 4 vs Claude Sonnet 5 as compared to CS4 vs finetuned or prompted CS4)?
The AI-2027 forecast has Agent-3 and Agent-4 raised in similarly misaligned training environments. If Agent-3 is tasked with testing itself or Agent-4, then would it actually find or report the misalignment? Could this be simulated by tasking the quirky Claudes to audit their copies?
The most concerning feature is that all three agents failed to discover the more important quirks, which are research sandbagging and hardcoded test cases. These two are among the more dangerous quirks, as are Emotional Bond or Self-Promotion, but these two got discovered. Why do the failures correlate with the danger that the quirks pose?

47

Building and evaluating alignment auditing agents

47

Ω 29

Thread

Introduction

47

Ω 29

47

Ω 29