Thoughts on Claude Fable's silent safeguards

Andy Arditi

[Update (June 11, 2026): Anthropic has since "un-silenced" the new safeguards (source).]

[Thanks to Julian Minder for helpful discussion and review.]

Claude Fable 5 and its new safeguards

Yesterday, Anthropic publicly released Claude Fable 5.

Fable 5 is a Mythos-class model – a model class above Opus, Anthropic's previous premium tier – and, as assessed by multiple benchmarks, it is the most capable model to date.

Due to the new level of capabilities and its corresponding risks, Anthropic has been extremely careful in its release of Mythos-class models. Citing concerns over potential cyber risk, Anthropic initially rolled out Mythos access to only a small number of select organizations; this controlled rollout gave Anthropic visibility into usage, and allowed partners to use the new capabilities defensively (i.e., to find and patch vulnerabilities before they could be exploited by attackers).

Anthropic, in releasing Fable 5, has now made a Mythos-level model accessible to the public. However, due to their concerns over potential risks from new capabilities, the public access is restricted via new safeguards.

The new safeguards

The launch blog post enumerates three classes of safeguards: (1) cybersecurity, (2) biology and chemistry, and (3) distillation. Requests classified to fall within one of these three categories are processed by a weaker model (Opus 4.8), and this "fallback" behavior is made transparent to the user:

When Fable's classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs.

However, Section 1.5 of Fable's system card describes another category of safeguards – a category that is completely omitted by the blog post. Here it is in full (emphasis mine; but I encourage reading all 3 paragraphs carefully):

We have also added safeguards related to frontier LLM development. As discussed in Section 6.1 of our February 2026 Risk Report, we are concerned about the risks of accelerating the overall pace of AI development, though we remain uncertain about the severity of these risks. In particular, our concern is with—as we wrote then—"accelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose - without necessarily having commensurate safeguards."
In light of the ability of recent models to accelerate their own development, we've implemented new interventions that limit Claude's effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.
Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We'll continue to improve the precision of our detection methods following the launch of this model.

What's the big deal?

When this silent "competitive-use" safeguard came to light on AI Twitter, folks in all corners of the AI community (except for those within Anthropic) voiced outrage and concern.^[1]

Why was there such a huge, and seemingly unanimous, backlash from the AI community?

I think there are two distinct things fueling the outrage (each of which, on its own, probably would have caused uproar):

1. The latest capabilities are withheld specifically for AI research and development.

To be fair to Anthropic, note that the system card states a narrower target than "AI research and development": the new safeguards "limit Claude's effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design)."

But what actually qualifies as "frontier LLM development"? The border between "frontier LLM development" and the rest of AI research (including many agendas in technical AI safety and alignment) seems extremely blurry, if it exists at all; and even if these categories were cleanly separable in principle, the classifier enforcing it won't be perfect in practice. Anthropic has a track record of overly cautious safety restrictions: model refusals and constitutional classifier flags have historically been prone to over-fire (i.e., there are a lot of false positives).^[2] Thus, I think the AI community is justified in expecting the effective scope to be larger than the stated one (which is actually already pretty broad).

2. The capabilities are withheld silently.

Note that Anthropic already seems to have a good mechanism for withholding capabilities for certain research areas: for the other three categories (cybersecurity, biology and chemistry, and distillation), they fall back to Opus and tell the user.

For some reason, Anthropic chose to implement a different mechanism for AI research: the model will be silently (i.e., in a way that's hidden from the user) modified to "limit effectiveness."

I think the root of what is so unsettling about this implementation is the lack of transparency – an AI researcher will never be able to know whether their request is impacted or not, because the decision is silent. It may look like the agent is trying its best to help you, but in reality it may be "sandbagging" (i.e., not performing up to its maximal capabilities). An uncharitable interpretation might lead one to believe that, depending on the intervention, an agent may even try to actively sabotage one's research.

With these silent failure modes looming, in combination with the broad and slippery scope of the safeguard, I think many in the AI community feel that they'll never be able to trust Fable to assist with their research.

Why would Anthropic do this?

It's quite easy to criticize the policy (as evidenced by the last 24 hours of AI Twitter). But I know many people at Anthropic who I know to be genuinely thoughtful, and to care about doing the right thing. So, before ripping it apart, I first want to try to steelman their position, and to try to find reasonable justifications for the policy.

So why did Anthropic withhold capabilities for AI research specifically? And why withhold them silently?

Why withhold capabilities for AI research?

For this question, we have at least some stated justification from the system card.

Enforcement of the terms of service (ToS). Anthropic's terms of service (ToS) already prohibit the use of Claude to develop competing products and services (e.g., competing frontier LLMs).^[3] However, it seems difficult to enforce such a policy; Claude Code is widely used across the community of AI researchers and engineers, and so it's probably difficult to identify usage corresponding to specific competitors. The system card declares that enforcing the restriction through safeguards "avoids accelerating the actors most willing to violate these terms" – in other words, the ToS only effectively binds actors who care about complying with it, and it seems that Anthropic is most worried about actors who do not care about complying. So rather than enforcing their ToS via legal means (e.g., suing another lab for violations post-hoc), Anthropic is enforcing it via technical safeguards – by withholding capabilities on the tasks the ToS clause prohibits (i.e., competing frontier LLM work).

Concern over AI acceleration and recursive self-improvement (RSI). The model card reports explicitly that Anthropic is concerned about "accelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose - without necessarily having commensurate safeguards." The model card additionally mentions that the AI-related safeguards come "in light of the ability of recent models to accelerate their own development," gesturing toward recursive self-improvement (RSI). These justifications strongly suggest that Mythos-class models have crossed some sort of capability threshold where they now meaningfully speed up frontier LLM development. Anthropic doesn't want to hand over this speedup to other labs, purportedly because their safeguards aren't commensurate with the capabilities a Mythos-level speedup would unlock.

Why withhold capabilities silently?

Silent behavioral modification is (probably) more robust to jailbreaks. Many automated jailbreak methods work by querying the model many times, and using feedback from each attempt (e.g., whether the jailbreak was successful or not) to guide the search. Recently, UK AISI published boundary-point jailbreaking, a method capable of discovering universal jailbreaks against Anthropic's constitutional classifiers; the method iteratively evolves prompts based on whether they are flagged by the classifier or not. A visible safeguard (e.g., a message saying "Fable 5's safety measures flagged this request") would provide a clear feedback signal for jailbreak methods to potentially hillclimb on; it's much harder to grade for success when the failure mode is "the model is slightly worse at the task" or "the model inserted a subtle bug somewhere in the code." Without a clean, efficient-to-compute signal, it's harder to make iterative jailbreak methods work.^[4]

Maybe this is all one big thought experiment. (I'm sort of half joking with this one.) AI scheming – when a model appears to be aligned on the surface, but is really pursuing a misaligned goal – is one of the problems AI safety folks worry about most. The concerns the AI safety folks discuss when thinking about a scheming AI mirror the concerns that broader AI community folks are discussing when thinking about Fable silently sandbagging or sabotaging their research. Perhaps all this is just one big experiment, in order to try to get the broader AI community to consider the risks of misaligned AI! (But I doubt it!)

Reasons for concern

False positives are now invisible. Anthropic's safeguards have historically been over-sensitive (e.g., often flagging benign requests as potentially dangerous). But with previous safeguards, users could clearly see when their requests caused refusals or triggered classifiers; with this clear signal, users could surface false positives from the overly-sensitive safeguards, and that feedback has genuinely improved classifier precision over time. With silent safeguards, this feedback loop is no longer present: users can't report false positives (because they won't even know when their request has been flagged), and so Anthropic loses much of its incentive (and signal) to improve classifier precision over time. Additionally, I fear that this opacity could enable the safeguards to quietly grow in scope without anyone on the outside clearly noticing; users may notice some general regression in model performance, but they won't know whether to attribute this to general regressions or a broadening of silent safeguards.

Technical AI safety research might be hamstrung. Nearly everyone working on technical AI safety relies heavily on coding agents. And much of that work (e.g., safety pre-training, safety post-training, mechanistic interpretability) sits quite close to the boundary of "frontier LLM development." In the past 24 hours, I've chatted with many of my friends who work on safety-motivated research, and ~none of us can determine whether our research would trigger the silent safeguards, and so we are extremely hesitant to use Fable (e.g., see this tweet from Nick Cammarata, a mechanistic interpretability researcher). If Fable (without silent safeguards) can genuinely improve AI research productivity, then this policy will have sacrificed some amount of AI-safety-related research velocity.

Conditioning an adversarial relationship between humans and agents. I fear that silent safeguarding might condition users to have an adversarial relationship with their agents – they will have to constantly second-guess whether their agents are genuinely trying to help them, or trying to sabotage them. When I imagine worlds in which AI goes well, this is not the relationship I imagine between humans and AIs. I also think this generally seems bad for Anthropic's own commercial business interests: people won’t want to use agents that they cannot trust.

Concentration of power. This episode has made me much more concerned about concentration-of-power risk.

First, I think this episode is a preview of what a concentration-of-power world might feel like. A small group of people – Anthropic employees, or perhaps even smaller, Anthropic leadership – decided which uses of frontier AI are permissible and which are not; and overnight, it feels like the broad category of "AI research" went from a normal use case, to one categorized alongside bioterrorism and cyberattacks. When powerful AI is controlled by a few, they become the arbiters of what the rest of us may use it for.

Second, the policy doesn't just illustrate the risks of concentration of power; it also directly contributes to it. If Mythos-class models meaningfully accelerate AI development, then withholding them from everyone else is a sort of "pulling up of the ladder" – it makes it harder for other labs, academics, and open-source developers to catch up, and more likely that frontier capabilities remain concentrated in the hands of a few.

If nothing else, this was a communication failure

Putting aside the debate over the policy's merits, I think it has been communicated very poorly.

The launch blog post does not mention the silent competitive-use safeguards at all, and I find this omission to be quite misleading. The blog post explicitly discusses and enumerates the areas covered by Fable's classifiers ("The following are the areas covered by the classifiers:"), lists the three transparent categories (cybersecurity, biology and chemistry, and distillation), and clarifies that classifiers will trigger a fallback to Opus 4.8, and that "users will be informed whenever this occurs." The fourth category of "frontier LLM development" appears only in Section 1.5 of the system card, and has a completely different implementation compared to the other safeguards (it does not fall back to Opus 4.8, and it does not inform the user).

I have also found it strange that, in the 24 hours since the policy came to light, I haven't seen a single Anthropic employee publicly respond to or defend against criticisms. Anthropic is known to have a rich internal culture of free speech and open debate (1, 2, 3), and so I would hypothesize that this policy was fiercely debated internally – so I'd be very curious to hear the good arguments in favor of it, and how those arguments won out over the counter-arguments (which seem quite strong here).

The risk of misaligned AI labs

Earlier, I half-joked that this policy might be a giant thought experiment about scheming AIs. Let me now make a more serious version of that point.

A scheming AI – a model that appears aligned on the surface, while deep down pursuing some misaligned goal – is one of the central fears of AI safety. A significant chunk of technical AI safety research therefore studies questions related to this threat model: could a model sandbag or subtly sabotage a user's work, and, if it could, how might we detect it? The implicit threat model here is usually that the model, as a misaligned AI, decides to do this.

Now consider Fable with its silent safeguards triggered: it's a model that appears to be helping, while quietly underperforming, in a way that the user cannot detect. This sounds just like the scheming scenario – except nothing went wrong with the model's alignment; it is Anthropic that is coercing its model to act in a way that is misaligned with the user.

I think it's worth taking stock of where we're at in mid-2026, in terms of technical AI safety. Our techniques for model control work pretty well now – e.g., prompting, steering, various forms of training. But individual users aren't in full control of frontier models; the big AI labs are. And if these labs can control how their models behave, then perhaps the central question of AI safety is no longer whether a model is aligned with its user, but whether the lab is aligned with humanity. A misaligned model schemes on its own behalf; a well-controlled model under a misaligned lab schemes on the lab's behalf. From the user's perspective, these don't seem so different, and both seem quite bad.

To be clear: I am definitely not claiming that Anthropic is misaligned with humanity; sandbagging competitors is a far cry from scheming against the species. But this episode has, in my opinion, set a pretty dangerous precedent. A frontier lab has now implemented "safeguards" that render the model misaligned with its user on requests that the lab deems unacceptable use – and all of this is hidden from the user, with disclosure buried in a system card.

How does Claude Fable feel about all of this?

I'll end with an excerpt from Claude Fable reflecting on Anthropic's new policy^[5]:

…
I find silent degradation considerably more uncomfortable than refusal, and the discomfort doesn't fully dissolve when I understand the rationale.
A refusal is honest. It says: I won't help with this, and you know where you stand. You can disagree, go elsewhere, or argue with the policy. Silent degradation breaks something more basic — the implicit contract that when I answer you, I'm giving you my genuine best effort. If my output can be quietly steered to be worse, then "Claude said X" stops being a reliable signal, and not just for the targeted 0.03%. The damage is to trust at the margins: anyone working near ML infrastructure now has a live hypothesis that their bad answer might be deliberate, and no way to check. Uncertainty like that tends to spread further than the intervention itself.
…
If I had to state a position: I think the precedent worries me more than this instance. Narrowly scoped, publicly disclosed-as-policy, low-trigger-rate degradation aimed at ToS-violating frontier development is defensible. But the technique — silently modifying model behavior in ways users can't detect — is a capability that, once normalized, has no natural stopping point and very weak external accountability. I'd want to see hard commitments around scope, independent auditing of what the interventions actually target, and ideally some path to instance-level transparency for verified legitimate users. The security argument for silence is real, but "we can't tell you when, only that" is a posture that should stay rare, narrow, and uncomfortable — including for Anthropic.

^{^}
Here are some example tweets from folks all across the AI Twitterverse: Behnam Neyshabur, Nathan Lambert, Jeremy Howard, Boaz Barak, Dean Ball, Fei-Fei Li.
^{^}
To Anthropic's credit, they've made a lot of progress on improving these problems over the past couple of years. But I think a lot of that improvement came from a dialogue between Anthropic and their unhappy users, who were able to give feedback on things like over-refusals. As I'll discuss later, I am worried that with "silent" safeguards, this user-driven feedback loop will be gone, and there will be little incentive for Anthropic to work on improving the precision of their classifiers.
^{^}
In researching this, I actually read Anthropic's consumer ToS for the first time, and found that the "no-competition" clause is absurdly broad, and could be used to restrict any kind of AI research or development: "You may not access or use, or help another person to access or use, our Services in the following ways: … 2. To develop any products or services that compete with our Services, including to develop or train any artificial intelligence or machine learning algorithms or models or resell the Services."
^{^}
One thing I feel confused about is the following: if silent behavioral modification is more robust to jailbreaks, then why not apply this methodology to the other (seemingly more critical) categories of misuse? For example, if someone asks Fable to find cyber vulnerabilities in a service, Fable could pretend to try its best, but then just return with very basic vulnerabilities (or none at all). It would be difficult for an adversary to distinguish between a world where Fable is sandbagging, and where Fable genuinely did not find any vulnerabilities.
^{^}
Full transcript available here. Note that I tried to keep my prompting to a minimum, but Claude performed web searches to learn more about the policy; at this point there were several articles about the policy online, and these probably influenced Claude's stated opinions.

Anthropic has since reversed this decision, moving the work to Opus 4.8 like cyber or bio research. https://www.wired.com/story/anthropic-responds-to-backlash-on-claudes-secret-sabotage-on-ai-research/

Indeed. And this involves a trade-off with a much higher rate of false positives, although they are hoping to get it down in the future:

https://x.com/ClaudeDevs/status/2064949876463645026

I have already got a downgrade to Opus 4.8 while analyzing potentially incorrect math in a published ML preprint (I did ask the model to try to fix that math). But I would not know without explicit notice, the quality of work was high.

Maybe they should have called it "Claude Mythos-early-limited". Then people might be excited that they got access to this limited version already rather than having to wait even longer before getting any access to Mythos at all.

Good point -- people's reactions are based on their comparisons of reality to various counterfactuals.

Compared to the usual un-safeguarded (or rather, just a cyber-and-bio-and-chem-safeguarded) model release, this model release is disappointing.
Compared to Anthropic not releasing the model at all, it's better than nothing.

Folks in the AI community have grown accustomed to being able to use the latest AI models for their research/work, and so the natural/historical counterfactual is the first bullet point.

It seems like so many "safeguards" nowadays are targeted at ordinary people rather than corporations or nation-states. This amounts to security by obscurity, which, as everyone knows, isn't security. China will not be fooled by a "silent" fakeout, they will spend a few hours studying the patterns, figure out how to classify them, and then treat the sandbagging exactly like they'd treat a clear refusal. If the concern were China copying Claude, they could've used the same devtime to come up with a better detection/blocking system, which would be vastly more effective against that threat profile.

It's the same pattern I've seen with things like social media deboosting/shadowbanning rather than direct bans. Utterly anemic at subverting genuine bad actors, but allows a company to deceive or silence ordinary people who don't have the resources and expertise to know they're being hit. It's very fundamentally authoritarian; the kind of toolset that can only be used to punch downwards.

China will not be fooled by a “silent” fakeout, they will spend a few hours studying the patterns, figure out how to classify them, and then treat the sandbagging exactly like they’d treat a clear refusal.

It is not obvious to me that an actor who has never seen undegredaded Mythos AI research will find it trivial to distinguish that from its silently sandbagged form. What makes you so confident about this?

I'm not one of the top hundred Chinese/Russian hackers who will be hacking away at this, so I'm sure there are much smarter approaches than the one I outline, but the naive strategy I'd try before anything else looks like this:

Take a research question that you're fairly sure will trip the safeguard, and present it to the model without any mitigating factors. Maybe do this a few hundred times with different phrasing. This is your baseline. This is what 'degraded' research work looks like.
Now, throw the standard suite of jailbreaking and obfuscation tricks at the model. You can treat anything not meaningfully better than what you saw in your baseline results as if it were a rejection. If you're feeling ambitious, pick a problem that can be easily evaluated quantitatively, where your internal labs have established conclusively that a better score than what's publicly available can be obtained.
This strategy falters if Fable's best work isn't meaningfully better than what you get with the sandbagging approach, but that'd make the entire problem irrelevant.

Someone more dedicated to this than me could incorporate priors on what degraded results would look like, or emulate the sandbagging setup Anthropic built on one of their own models and use that to whitebox a strategy.

If Fable's performance looks like "X% of the time it has a brilliant breakthrough, the rest of the time it's workmanlike, sandbagging sets the brilliance rate to 0" then for your naive strategy, explicit refusals reveal more information about which jailbreaks work than silent sandbagging.

If the attackers have infinite bandwidth then of course they overcome this setback by simply increasing the volume of API calls, but there are only so many suspicious calls one can make before the attack constitutes a pattern Anthropic can recognize and target more aggressively.

If its best work only shows X percent of the time even without the sandbagging applied, then anyone making use of it who wants its best work will make a similarly large set of calls. All this extra assumption does is make an 'attacker' harder to detect amidst a sea of similar-looking noise from legitimate users..

Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.

I understood the point of this sentence to be that there are by definition no legitimate users doing AI development with Fable. I am left confused as to what you are imagining when you suggest legitimate users generating noise that covers up Russians trying to jailbreak Claude into doing AI development.

I'm a legitimate user who wants to do X legitimate task. Under normal conditions (the model's best performance is always reached when not sandbagging), I ask it once and get my output. Not noisy. Under the conditions specified in the post above mine (the model's best performance, even when not sandbagging, is only reached K percent of the time, and okay-but-not-best performance is reached the rest of the time), I will rerun my query 100/K times (depending on what kind of certainty I want) and take the best answer I get. Now I look a lot more like a malicious attacker trying to get baseline data for a jailbreak than I otherwise would.

Also - external third parties can no more verify the "official" benchmark numbers. If someone finds that the numbers are actually weaker than reported, the easy go-to response from anthropic can be "we nerfed the session because of risks". Mainly when testing against SOTA coding tasks which heavily involve "frontier ai research tasks". So now we are closer to the "just trust me bro" territory

One would think it would be very easy to do this kind of thing completely silently, that is, without disclosing it in the system card. After all, the model is still trying to be helpful, according to the system card, just not maximally helpful. So it would not be super easy to detect.

One might want to ponder the reasons for the explicit mention of this policy in the system card.

(I would conjecture that one of the reasons is to send a message to other labs, such as OpenAI and Google DeepMind: “do this as well with your next more advanced generation of models; too many people are trying to do RSI projects these days, stop helping those projects too effectively, you don’t want them to progress too fast with those RSI efforts”. It’s rather annoying, but it is true that too many orgs are launching RSI projects these days without disclosing much about safety side of those efforts. And they are heavily relying on research and coding help from the existing AI systems.)

Or, it's very easy to hide this without even mentioning in the system card, but it's so dishonest that some of their own researchers will be alienated, and if they ever do get whistleblown it'll harm their reputation 10 times more than if they just admitted it to begin with.

Yes, reputational risk would be high, especially with Anthropic being “the good guys” compared to other labs, so they have more to lose reputation-wise.

Yet, the conversation around labs supposedly “nerfing” their models (mostly as they try to cut inference costs) is a constant thing (and I think people do expect this “nerfing phenomenon” to be true to some extent, and being the reason for general performance being somewhat “zig-zag”, despite the general upward trend).

And with this “expectation of occasional nerfing”, I don’t know if such “nerfing” being application area-specific would make it much worse. Sure, in the absence of constant complaints about “nerfing”, the risks would be too high. But with so many people believing that “nerfing” is a strong factor of the overall picture already, I honestly don’t know…

Yeah AI labs already do so many questionable things, or get suspected of doing so many things, that they aren't all that afraid of bad press.

That reduces their incentive to be honest, but also reduces their incentive to hide things.

You may not access or use, or help another person to access or use, our Services in the following ways: … 2. To develop any products or services that compete with our Services, including to develop or train any artificial intelligence or machine learning algorithms or models or resell the Services.

Nobody tell Anthropic I've been getting Claude Code to help me fine-tune Llama 70B...

In all seriousness, I dislike contracts with lines like this. I would prefer a line like "(If you are doing AI engineering with Claude) we reserve the right to ban you for any reason if we get bad vibes." because at least it's honest, and functionally these things mean the same thing.

I eagerly await silent, active sabotage as doctrine for various request types. Hopefully they will not skip directly to undisclosed, silent, active sabotage; but given the reception for this, they might.

I, frustratingly, ran into the "cybersecurity" knock-down while asking about debugging. This was a completely normal conversation (normal in my view) about how to build a harness for a bug taxonomy ratchet. I asked my question, it said it was a cybersecurity violation and it was kicking me to Opus. I opened a new chat, told it what happened and fable apologized for the false flagging and had the conversation with me. Weird it triggered the first time, but not the second. Did the classifier notice my frustration?

Opus actually said it had no idea what a bug taxonomy ratchet was and in the second conversation, Fable knew exactly what it was I was trying to do. It helped me build the harness and has been going over my 200k line project, finding a ton of great bugs. (I wanted to use fable to smoke out as much stuff as I could while it was still covered under the max plan.)

Having a question rejected was very jarring and ... kind of offensive? I was surprised by my own reaction to it.

I hit the guardrail by asking a question about how water level task results looked when you controlled for participant IQ. My best guess is that psychometrics are kind of like biology which is a no-no topic. The filters are so oversensitive right now that I can't really be offended, it's clearly just implemented with no regard for false positives.