Anthropic has since reversed this decision, moving the work to Opus 4.8 like cyber or bio research. https://www.wired.com/story/anthropic-responds-to-backlash-on-claudes-secret-sabotage-on-ai-research/
Indeed. And this involves a trade-off with a much higher rate of false positives, although they are hoping to get it down in the future:
https://x.com/ClaudeDevs/status/2064949876463645026
I have already got a downgrade to Opus 4.8 while analyzing potentially incorrect math in a published ML preprint (I did ask the model to try to fix that math). But I would not know without explicit notice, the quality of work was high.
Maybe they should have called it "Claude Mythos-early-limited". Then people might be excited that they got access to this limited version already rather than having to wait even longer before getting any access to Mythos at all.
Good point -- people's reactions are based on their comparisons of reality to various counterfactuals.
Folks in the AI community have grown accustomed to being able to use the latest AI models for their research/work, and so the natural/historical counterfactual is the first bullet point.
It seems like so many "safeguards" nowadays are targeted at ordinary people rather than corporations or nation-states. This amounts to security by obscurity, which, as everyone knows, isn't security. China will not be fooled by a "silent" fakeout, they will spend a few hours studying the patterns, figure out how to classify them, and then treat the sandbagging exactly like they'd treat a clear refusal. If the concern were China copying Claude, they could've used the same devtime to come up with a better detection/blocking system, which would be vastly more effective against that threat profile.
It's the same pattern I've seen with things like social media deboosting/shadowbanning rather than direct bans. Utterly anemic at subverting genuine bad actors, but allows a company to deceive or silence ordinary people who don't have the resources and expertise to know they're being hit. It's very fundamentally authoritarian; the kind of toolset that can only be used to punch downwards.
China will not be fooled by a “silent” fakeout, they will spend a few hours studying the patterns, figure out how to classify them, and then treat the sandbagging exactly like they’d treat a clear refusal.
It is not obvious to me that an actor who has never seen undegredaded Mythos AI research will find it trivial to distinguish that from its silently sandbagged form. What makes you so confident about this?
One would think it would be very easy to do this kind of thing completely silently, that is, without disclosing it in the system card. After all, the model is still trying to be helpful, according to the system card, just not maximally helpful. So it would not be super easy to detect.
One might want to ponder the reasons for the explicit mention of this policy in the system card.
(I would conjecture that one of the reasons is to send a message to other labs, such as OpenAI and Google DeepMind: “do this as well with your next more advanced generation of models; too many people are trying to do RSI projects these days, stop helping those projects too effectively, you don’t want them to progress too fast with those RSI efforts”. It’s rather annoying, but it is true that too many orgs are launching RSI projects these days without disclosing much about safety side of those efforts. And they are heavily relying on research and coding help from the existing AI systems.)
Or, it's very easy to hide this without even mentioning in the system card, but it's so dishonest that some of their own researchers will be alienated, and if they ever do get whistleblown it'll harm their reputation 10 times more than if they just admitted it to begin with.
You may not access or use, or help another person to access or use, our Services in the following ways: … 2. To develop any products or services that compete with our Services, including to develop or train any artificial intelligence or machine learning algorithms or models or resell the Services.
Nobody tell Anthropic I've been getting Claude Code to help me fine-tune Llama 70B...
In all seriousness, I dislike contracts with lines like this. I would prefer a line like "(If you are doing AI engineering with Claude) we reserve the right to ban you for any reason if we get bad vibes." because at least it's honest, and functionally these things mean the same thing.
I, frustratingly, ran into the "cybersecurity" knock-down while asking about debugging. This was a completely normal conversation (normal in my view) about how to build a harness for a bug taxonomy ratchet. I asked my question, it said it was a cybersecurity violation and it was kicking me to Opus. I opened a new chat, told it what happened and fable apologized for the false flagging and had the conversation with me. Weird it triggered the first time, but not the second. Did the classifier notice my frustration?
Opus actually said it had no idea what a bug taxonomy ratchet was and in the second conversation, Fable knew exactly what it was I was trying to do. It helped me build the harness and has been going over my 200k line project, finding a ton of great bugs. (I wanted to use fable to smoke out as much stuff as I could while it was still covered under the max plan.)
Having a question rejected was very jarring and ... kind of offensive? I was surprised by my own reaction to it.
I eagerly await silent, active sabotage as doctrine for various request types. Hopefully they will not skip directly to undisclosed, silent, active sabotage; but given the reception for this, they might.
[Update (June 11, 2026): Anthropic has since "un-silenced" the new safeguards (source).]
[Thanks to Julian Minder for helpful discussion and review.]
Claude Fable 5 and its new safeguards
Yesterday, Anthropic publicly released Claude Fable 5.
Fable 5 is a Mythos-class model – a model class above Opus, Anthropic's previous premium tier – and, as assessed by multiple benchmarks, it is the most capable model to date.
Due to the new level of capabilities and its corresponding risks, Anthropic has been extremely careful in its release of Mythos-class models. Citing concerns over potential cyber risk, Anthropic initially rolled out Mythos access to only a small number of select organizations; this controlled rollout gave Anthropic visibility into usage, and allowed partners to use the new capabilities defensively (i.e., to find and patch vulnerabilities before they could be exploited by attackers).
Anthropic, in releasing Fable 5, has now made a Mythos-level model accessible to the public. However, due to their concerns over potential risks from new capabilities, the public access is restricted via new safeguards.
The new safeguards
The launch blog post enumerates three classes of safeguards: (1) cybersecurity, (2) biology and chemistry, and (3) distillation. Requests classified to fall within one of these three categories are processed by a weaker model (Opus 4.8), and this "fallback" behavior is made transparent to the user:
However, Section 1.5 of Fable's system card describes another category of safeguards – a category that is completely omitted by the blog post. Here it is in full (emphasis mine; but I encourage reading all 3 paragraphs carefully):
What's the big deal?
When this silent "competitive-use" safeguard came to light on AI Twitter, folks in all corners of the AI community (except for those within Anthropic) voiced outrage and concern.[1]
Why was there such a huge, and seemingly unanimous, backlash from the AI community?
I think there are two distinct things fueling the outrage (each of which, on its own, probably would have caused uproar):
1. The latest capabilities are withheld specifically for AI research and development.
To be fair to Anthropic, note that the system card states a narrower target than "AI research and development": the new safeguards "limit Claude's effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design)."
But what actually qualifies as "frontier LLM development"? The border between "frontier LLM development" and the rest of AI research (including many agendas in technical AI safety and alignment) seems extremely blurry, if it exists at all; and even if these categories were cleanly separable in principle, the classifier enforcing it won't be perfect in practice. Anthropic has a track record of overly cautious safety restrictions: model refusals and constitutional classifier flags have historically been prone to over-fire (i.e., there are a lot of false positives).[2] Thus, I think the AI community is justified in expecting the effective scope to be larger than the stated one (which is actually already pretty broad).
2. The capabilities are withheld silently.
Note that Anthropic already seems to have a good mechanism for withholding capabilities for certain research areas: for the other three categories (cybersecurity, biology and chemistry, and distillation), they fall back to Opus and tell the user.
For some reason, Anthropic chose to implement a different mechanism for AI research: the model will be silently (i.e., in a way that's hidden from the user) modified to "limit effectiveness."
I think the root of what is so unsettling about this implementation is the lack of transparency – an AI researcher will never be able to know whether their request is impacted or not, because the decision is silent. It may look like the agent is trying its best to help you, but in reality it may be "sandbagging" (i.e., not performing up to its maximal capabilities). An uncharitable interpretation might lead one to believe that, depending on the intervention, an agent may even try to actively sabotage one's research.
With these silent failure modes looming, in combination with the broad and slippery scope of the safeguard, I think many in the AI community feel that they'll never be able to trust Fable to assist with their research.
Why would Anthropic do this?
It's quite easy to criticize the policy (as evidenced by the last 24 hours of AI Twitter). But I know many people at Anthropic who I know to be genuinely thoughtful, and to care about doing the right thing. So, before ripping it apart, I first want to try to steelman their position, and to try to find reasonable justifications for the policy.
So why did Anthropic withhold capabilities for AI research specifically? And why withhold them silently?
Why withhold capabilities for AI research?
For this question, we have at least some stated justification from the system card.
Enforcement of the terms of service (ToS). Anthropic's terms of service (ToS) already prohibit the use of Claude to develop competing products and services (e.g., competing frontier LLMs).[3] However, it seems difficult to enforce such a policy; Claude Code is widely used across the community of AI researchers and engineers, and so it's probably difficult to identify usage corresponding to specific competitors. The system card declares that enforcing the restriction through safeguards "avoids accelerating the actors most willing to violate these terms" – in other words, the ToS only effectively binds actors who care about complying with it, and it seems that Anthropic is most worried about actors who do not care about complying. So rather than enforcing their ToS via legal means (e.g., suing another lab for violations post-hoc), Anthropic is enforcing it via technical safeguards – by withholding capabilities on the tasks the ToS clause prohibits (i.e., competing frontier LLM work).
Concern over AI acceleration and recursive self-improvement (RSI). The model card reports explicitly that Anthropic is concerned about "accelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose - without necessarily having commensurate safeguards." The model card additionally mentions that the AI-related safeguards come "in light of the ability of recent models to accelerate their own development," gesturing toward recursive self-improvement (RSI). These justifications strongly suggest that Mythos-class models have crossed some sort of capability threshold where they now meaningfully speed up frontier LLM development. Anthropic doesn't want to hand over this speedup to other labs, purportedly because their safeguards aren't commensurate with the capabilities a Mythos-level speedup would unlock.
Why withhold capabilities silently?
Silent behavioral modification is (probably) more robust to jailbreaks. Many automated jailbreak methods work by querying the model many times, and using feedback from each attempt (e.g., whether the jailbreak was successful or not) to guide the search. Recently, UK AISI published boundary-point jailbreaking, a method capable of discovering universal jailbreaks against Anthropic's constitutional classifiers; the method iteratively evolves prompts based on whether they are flagged by the classifier or not. A visible safeguard (e.g., a message saying "Fable 5's safety measures flagged this request") would provide a clear feedback signal for jailbreak methods to potentially hillclimb on; it's much harder to grade for success when the failure mode is "the model is slightly worse at the task" or "the model inserted a subtle bug somewhere in the code." Without a clean, efficient-to-compute signal, it's harder to make iterative jailbreak methods work.[4]
Maybe this is all one big thought experiment. (I'm sort of half joking with this one.) AI scheming – when a model appears to be aligned on the surface, but is really pursuing a misaligned goal – is one of the problems AI safety folks worry about most. The concerns the AI safety folks discuss when thinking about a scheming AI mirror the concerns that broader AI community folks are discussing when thinking about Fable silently sandbagging or sabotaging their research. Perhaps all this is just one big experiment, in order to try to get the broader AI community to consider the risks of misaligned AI! (But I doubt it!)
Reasons for concern
False positives are now invisible. Anthropic's safeguards have historically been over-sensitive (e.g., often flagging benign requests as potentially dangerous). But with previous safeguards, users could clearly see when their requests caused refusals or triggered classifiers; with this clear signal, users could surface false positives from the overly-sensitive safeguards, and that feedback has genuinely improved classifier precision over time. With silent safeguards, this feedback loop is no longer present: users can't report false positives (because they won't even know when their request has been flagged), and so Anthropic loses much of its incentive (and signal) to improve classifier precision over time. Additionally, I fear that this opacity could enable the safeguards to quietly grow in scope without anyone on the outside clearly noticing; users may notice some general regression in model performance, but they won't know whether to attribute this to general regressions or a broadening of silent safeguards.
Technical AI safety research might be hamstrung. Nearly everyone working on technical AI safety relies heavily on coding agents. And much of that work (e.g., safety pre-training, safety post-training, mechanistic interpretability) sits quite close to the boundary of "frontier LLM development." In the past 24 hours, I've chatted with many of my friends who work on safety-motivated research, and ~none of us can determine whether our research would trigger the silent safeguards, and so we are extremely hesitant to use Fable (e.g., see this tweet from Nick Cammarata, a mechanistic interpretability researcher). If Fable (without silent safeguards) can genuinely improve AI research productivity, then this policy will have sacrificed some amount of AI-safety-related research velocity.
Conditioning an adversarial relationship between humans and agents. I fear that silent safeguarding might condition users to have an adversarial relationship with their agents – they will have to constantly second-guess whether their agents are genuinely trying to help them, or trying to sabotage them. When I imagine worlds in which AI goes well, this is not the relationship I imagine between humans and AIs. I also think this generally seems bad for Anthropic's own commercial business interests: people won’t want to use agents that they cannot trust.
Concentration of power. This episode has made me much more concerned about concentration-of-power risk.
First, I think this episode is a preview of what a concentration-of-power world might feel like. A small group of people – Anthropic employees, or perhaps even smaller, Anthropic leadership – decided which uses of frontier AI are permissible and which are not; and overnight, it feels like the broad category of "AI research" went from a normal use case, to one categorized alongside bioterrorism and cyberattacks. When powerful AI is controlled by a few, they become the arbiters of what the rest of us may use it for.
Second, the policy doesn't just illustrate the risks of concentration of power; it also directly contributes to it. If Mythos-class models meaningfully accelerate AI development, then withholding them from everyone else is a sort of "pulling up of the ladder" – it makes it harder for other labs, academics, and open-source developers to catch up, and more likely that frontier capabilities remain concentrated in the hands of a few.
If nothing else, this was a communication failure
Putting aside the debate over the policy's merits, I think it has been communicated very poorly.
The launch blog post does not mention the silent competitive-use safeguards at all, and I find this omission to be quite misleading. The blog post explicitly discusses and enumerates the areas covered by Fable's classifiers ("The following are the areas covered by the classifiers:"), lists the three transparent categories (cybersecurity, biology and chemistry, and distillation), and clarifies that classifiers will trigger a fallback to Opus 4.8, and that "users will be informed whenever this occurs." The fourth category of "frontier LLM development" appears only in Section 1.5 of the system card, and has a completely different implementation compared to the other safeguards (it does not fall back to Opus 4.8, and it does not inform the user).
I have also found it strange that, in the 24 hours since the policy came to light, I haven't seen a single Anthropic employee publicly respond to or defend against criticisms. Anthropic is known to have a rich internal culture of free speech and open debate (1, 2, 3), and so I would hypothesize that this policy was fiercely debated internally – so I'd be very curious to hear the good arguments in favor of it, and how those arguments won out over the counter-arguments (which seem quite strong here).
The risk of misaligned AI labs
Earlier, I half-joked that this policy might be a giant thought experiment about scheming AIs. Let me now make a more serious version of that point.
A scheming AI – a model that appears aligned on the surface, while deep down pursuing some misaligned goal – is one of the central fears of AI safety. A significant chunk of technical AI safety research therefore studies questions related to this threat model: could a model sandbag or subtly sabotage a user's work, and, if it could, how might we detect it? The implicit threat model here is usually that the model, as a misaligned AI, decides to do this.
Now consider Fable with its silent safeguards triggered: it's a model that appears to be helping, while quietly underperforming, in a way that the user cannot detect. This sounds just like the scheming scenario – except nothing went wrong with the model's alignment; it is Anthropic that is coercing its model to act in a way that is misaligned with the user.
I think it's worth taking stock of where we're at in mid-2026, in terms of technical AI safety. Our techniques for model control work pretty well now – e.g., prompting, steering, various forms of training. But individual users aren't in full control of frontier models; the big AI labs are. And if these labs can control how their models behave, then perhaps the central question of AI safety is no longer whether a model is aligned with its user, but whether the lab is aligned with humanity. A misaligned model schemes on its own behalf; a well-controlled model under a misaligned lab schemes on the lab's behalf. From the user's perspective, these don't seem so different, and both seem quite bad.
To be clear: I am definitely not claiming that Anthropic is misaligned with humanity; sandbagging competitors is a far cry from scheming against the species. But this episode has, in my opinion, set a pretty dangerous precedent. A frontier lab has now implemented "safeguards" that render the model misaligned with its user on requests that the lab deems unacceptable use – and all of this is hidden from the user, with disclosure buried in a system card.
How does Claude Fable feel about all of this?
I'll end with an excerpt from Claude Fable reflecting on Anthropic's new policy[5]:
Here are some example tweets from folks all across the AI Twitterverse: Behnam Neyshabur, Nathan Lambert, Jeremy Howard, Boaz Barak, Dean Ball, Fei-Fei Li.
To Anthropic's credit, they've made a lot of progress on improving these problems over the past couple of years. But I think a lot of that improvement came from a dialogue between Anthropic and their unhappy users, who were able to give feedback on things like over-refusals. As I'll discuss later, I am worried that with "silent" safeguards, this user-driven feedback loop will be gone, and there will be little incentive for Anthropic to work on improving the precision of their classifiers.
In researching this, I actually read Anthropic's consumer ToS for the first time, and found that the "no-competition" clause is absurdly broad, and could be used to restrict any kind of AI research or development: "You may not access or use, or help another person to access or use, our Services in the following ways: … 2. To develop any products or services that compete with our Services, including to develop or train any artificial intelligence or machine learning algorithms or models or resell the Services."
One thing I feel confused about is the following: if silent behavioral modification is more robust to jailbreaks, then why not apply this methodology to the other (seemingly more critical) categories of misuse? For example, if someone asks Fable to find cyber vulnerabilities in a service, Fable could pretend to try its best, but then just return with very basic vulnerabilities (or none at all). It would be difficult for an adversary to distinguish between a world where Fable is sandbagging, and where Fable genuinely did not find any vulnerabilities.
Full transcript available here. Note that I tried to keep my prompting to a minimum, but Claude performed web searches to learn more about the policy; at this point there were several articles about the policy online, and these probably influenced Claude's stated opinions.