Summary and Comments on Anthropic's Pilot Sabotage Risk Report

Anthropic released a report on the misalignment sabotage risks of Claude Opus 4 and 4.1. They concluded "there is a very low, but not completely negligible, risk of misaligned autonomous actions that contribute significantly to later catastrophic outcomes, abbreviated as sabotage risk". They also had two independent reviews of their risk report, one conducted by METR and one conducted by an internal team.

I don't think that conclusion is very surprising; I'm not concerned about today's Claude models posing a catastrophic risk. But I still think this report is a big deal as a matter of precedent. This is the first time a model developer has made something like a safety case for a model. The report is very thorough, as was the review. I think the fact that Anthropic did this is very cool and I hope they and other model developers do more of it.

The report itself is long and dry, so I've summarized it below and added my takeaways. I think it would be great if the broader AI safety community got in the habit of reading reports like this and noting if they agree or disagree with the argumentation and conclusions; I very much want to hold AI developers to high standards for arguing their models are safe, and I want to engage with their arguments.

Risk pathways addressed

This section is copied from METR's review of Anthropic's report.

The Sabotage Risk Report focuses on sabotage risk, which Anthropic defines as risks of “misaligned autonomous actions that contribute significantly to later catastrophic outcomes”. The report covers nine specific risk pathways for sabotage through internal or external use, which are “intermediate unwanted outcomes in which sabotage by an AI system [sets] up the conditions for a later catastrophic outcome”. These nine risk pathways are:

Broad/diffuse sandbagging on safety-relevant R&D
Targeted Sabotage of Safety-Relevant R&D
Code Backdoors to Help Future Misaligned Models
Poisoning the Training Data of Future Models
Leaks that Sabotage the AI Developer [As in, leak crucial Anthropic IP. Since Anthropic believes they are decreasing catastrophic risks, they count hurting Anthropic as contributing to catastrophic outcomes.]
Self-exfiltration and Autonomous Operation
Persistent Rogue Internal Deployment
R&D Sabotage within Other High-Resource AI Developers
Decision Sabotage within Major Governments

Much of the report focuses on considering in detail whether Opus 4 (and similar future models) could increase sabotage risk through any of these risk pathways. The report names mitigating factors for each risk pathway and estimates the strength of each mitigating factor. Anthropic also lays out the following three general claims that weigh against sabotage risks across the board:

Claim 1: Opus 4 is unreliable at highly-complex tasks, especially long-horizon agentic tasks requiring more than tens of minutes of autonomous action.

Claim 2: Opus 4, when taking on complex tasks, is not capable of reliably hiding its reasoning or otherwise covering its tracks, even when prompted to do so.

Claim 3: Opus 4 is unlikely to have dangerous goals or values that are stable across interactions.

Evidence Anthropic uses to support these claims:

Monitoring Claude usage internally (in a number of different ways) and noting they didn't observe any major security threats. The most systematic monitoring was automated Claude Code usage monitoring for ~90% of employees and PR reviews (both manual and automated checks on each PR). They validated their monitoring with red-teaming.
- Since the monitoring wasn't comprehensive, it mostly serves to provide evidence that Claude doesn't have high propensity for writing sabotage code and that it wouldn't be able to execute sabotage super often without getting caught. It seems like very rare sabotage might slip through.
Capabilities evaluations (both generic capability assessments and threat-model specific ones like SHADE-Arena, a Research Sandbagging Evaluation, and a Subversion Strategy Evaluation)
Some propensity/alignment evaluations (the Alignment Audit from the Claude 4 model card). These evaluations are very unrealistic, so it would be easy for Claude to fake results on them.
The fact that Claude 4 and 4.1 aren't that different from other models, and other models haven't seemed very dangerous so far (the official line is the differences are "sufficiently small...such that our experience with Sonnet 3.7 can provide meaningful cues as to what to expect in the deployment of Opus 4"). This might sound somewhat silly, but the main reason I don't feel concerned about the next generation of LLMs is because the world has a lot of experience with existent LLMs and they've never done anything remotely close to catastrophic so far.^[1] More formal versions of this argument could be quite powerful: evaluations alone raise the question "what if the model was so smart that it was faking all the eval results", and right now this is one of the best ways to address those concerns.

Other notable/interesting things about the report:

"Our training process did apply some optimization pressure that likely disincentivized these models from revealing harmful information in their extended thinking text, and could have compromised faithfulness indirectly in other ways."
"There had been internal miscommunications about this topic [whether Anthropic training might cause the CoT to not reveal when the model is pursuing harmful actions] that became clear only in response to questions from METR, and our work with METR helped accelerate some changes in our practices."
- External review caught something, and Anthropic took actions in response! The system works!
METR interviewed an Anthropic employee to verify many of the claims Anthropic made. I often hear people ask "what if the AI developers lie to people about their safety testing?" This feels like a good way to quell those concerns.
Some of the evidence that supports Anthropic's claims is confidential. They redacted it from the report (and are clear about which of their beliefs stem from this redacted evidence). They share this information with METR, and METR made sure the redactions were justified and verified the claims Anthropic said they supported.
- In general I think this kind of redaction scheme seems like a great way to do risk reports without leaking sensitive information.
It's great to hear that Anthropic has any monitoring of internal Claude code usage. This seems like extremely low-hanging fruit that every model developer should be copying.
Anthropic considers "Leaks that Sabotage the AI Developer" a threat model ("risk pathway"). This isn't a threat model I've seen discussed much elsewhere. It also seems like the threat model they're least able to rule out (The only evidence they have against it is "moderate" evidence of a lack of propensity and "weak" monitoring, and they rate the pathway as being easy for Claude to pull off without persistent effort). I am not personally very persuaded that this is an important threat model but it's interesting that Anthropic is considering it and taking it seriously.
When I see people sketch out safety cases, they often try to make very rigorous arguments, full of propositional logic. But at least right now, in reality the evidence is fuzzy and aggregating it is Bayesian. Anthropic tries to outline roughly how strong each piece of evidence is, which is great. But there's still a lot of wiggle room to disagree with their claims. I'm not sure any process that aggregates evidence in such an imprecise way could give me a lot of confidence that a model was safe if I thought the model might be adversarially trying to manipulate the results of the report. Many of the arguments to make a safety case are complex and there are loopholes that are easy to gloss over. I'm overall not sure how to navigate these tradeoffs; safety cases are hard.
The report doesn't use fancy techniques. There's no interpretability/white box methods, there's no argument that various alignment techniques work, there's no sophisticated interrogation of the model or realistic honeypots. The straightforward approaches in the report, like looking at the model's actual performance, are very powerful. But they don't work once models are smart enough to try to fool evals. At some point when models become much more capable they will stop being sufficient as evidence by themselves. Making safety cases for models today is playing on easy mode, and it was still quite challenging. We need to improve the science here.
Despite Claude 4/4.1 being relatively incapable models, the report still overall showed a "non-negligible" (albeit low) risk of "misaligned autonomous actions that contribute significantly to later catastrophic outcomes"! It seems hard to show risks are low.
- The threat models where the evidence against them isn't that strong are (1) "Leaks that Sabotage the AI Developer" (discussed in an earlier bullet), (2) "Persistent Rogue Internal Deployment" (the main evidence against is lack of propensity and rogue internal deployments being too complicated for Claude to be able to execute), and (3) "Poisoning the Training Data of Future Models" (the main evidence against is lack of propensity and monitoring).

^{^}
This isn't a rock-solid argument, especially once AIs become capable of automating AI R&D.

LESSWRONG
LW

LESSWRONG
LW

27

Summary and Comments on Anthropic's Pilot Sabotage Risk Report

27

27

Risk pathways addressed

Evidence Anthropic uses to support these claims:

Other notable/interesting things about the report: