An unstated additional reason why Fable might have restrictions on contributing to frontier model development: it has likely been trained on Anthropic internal documents, which would mean that it could otherwise reveal Anthropic's trade secrets (including trade secrets that required a lot of research and compute to work out) mm
First things first: Claude Fable 5 is the new best publicly available model.
I have noticed a step change, where Fable can suddenly help me in ways that previous models were not worth bothering to query. Almost everything it has noticed in one of my drafts so far has been spot on and it is downright scary. Suddenly I am motivated to once again continue improving my Chrome extension. I only ask for things I actually want or am curious about, and it has nailed every question I have asked it.
That does not mean it is the right tool for every job.
There are four good reasons to often not use Fable.
Those considerations aside, yes, it seems very obviously to be the best model.
Another Week Another Giant System Card
This is the traditional first post on a new frontier model, where I read the system card.
I’ll get to capabilities in earnest next week, and will also deal with model welfare.
We’ve been through very similar documents for Mythos Preview, Opus 4.7 and Opus 4.8, so I am going to gloss over the things that did not much change.
At some point the card is 319 pages and Long Post Is Long and you have to say ‘okay sure’ and skip over various things.
Before we get to anything else, I’ll also address the elephant in the room, up top, about the classifiers and other safeguards Anthropic introduced for Fable 5.
(Image HT: Sholto Douglas)
Table of Contents
How To Tell A Fable
It was not easy for Anthropic to find a way to release a Mythos-class model.
Anthropic’s solution to this was to create Fable as a distinct model option, with additional safeguards attached. You do not have to use it at all. Opus 4.8 remains available under the same conditions as before.
Fable required a 30 day data retention policy, overeager (not especially precise and focused on avoiding false negatives) safeguards for cyber, bio and frontier model work that fire something like 5% of the time on your natural set of queries, and of course higher pricing than Opus since the model is larger.
The alternative to Fable was never Mythos. The alternative is Opus.
The practical alternative to getting put back into Opus 5% is not reducing that to 1%. The practical alternative is getting put back into Opus 100% of the time.
Anthropic should work on getting us to that 1% and then 0.1%, since the vast majority of requests are not malicious, but the problem is hard. I understand why Anthropic prioritizes avoiding false negatives.
Did I find it annoying that you cannot ask Fable about its own model card? Yes, although Opus 4.8 did a fine job, and Fable was able to help with the draft post itself. Other than that, I have yet to hit the classifiers.
There are three triggers: Biological capabilities, cyber capabilities and frontier model development.
Anthropic has fully legitimate reasons to not want anyone other than Anthropic using Fable or Mythos to develop frontier models. You can argue this is more a straight up competitive stance than a concern that those competitors will act less safely, and use ‘a little from column A, a little from column B,’ but that remains a legitimate reason.
Anthropic initially made a serious mistake on top of this, that was fixed within 48 hours after a massive negative reaction. Now if the classifiers hit, you will always be visibly and clearly knocked down to Opus 4.8.
In a narrow set of cases, they announced in the system card that they would, without telling the user, modify queries and use steering to avoid aiding in frontier model development, as opposed to visibly knocking users down to Opus 4.8.
This was then reversed:
They also clarified that the primary target was foreign adversaries, where your ‘terms of service’ or attempts at banning individual accounts are almost useless.
The full reversal within 48 hours, without first attempting in public to defend the mistake, is A+ work. I give the explanation above a B-. This is indeed the real logic, but more ownership of the error and how it happened would have been better.
Why They Did That In That Way
I understand why Anthropic did this, and why they initially thought it was not a big deal. Purely in terms of system outputs this was the most user friendly move compatible with Anthropic’s requirements. But this was a no good, very bad move, and they really, really should have known it was not okay.
Why did they choose this path?
To minimize the blast radius, without realizing it would do the opposite.
If you announce when someone hits your classifiers by visibly downgrading the model, you are handing everyone an oracle on what does and does not hit your classifier. That makes evasion vastly easier, and to compensate you will need a vastly larger blast radius, that will trigger on some rather stupid examples.
You risk getting this, only on much higher levels and with ill intent, although I presume these people are mostly joking or at best wild mass guessing:
That’s good for Nick and Ember, and if you can differentially do this when your work is fine but not when it is not fine, then great. But the bad guys can do this, too.
Why They Really Really Shouldn’t Have Done That In That Way
The problem is that the alternative Anthropic chose is even worse, and you simply cannot do it. Users cannot go around with a paranoia that their requests and how they are handled are being quietly altered, in unknown ways and with unknown thresholds.
Not telling the narrow set of users when they have triggered the safeguards allows the safeguards to be highly narrowly tailored and trigger very rarely.
Keeping the blast radius of this fact narrow requires a high level of trust on multiple levels. That trust does not exist and is now damaged further, including by the disclosure being buried inside the model card, and by Anthropic aggressively framing this as a pure safety mechanism.
Triggering only in 0.03% of queries, and only in 0.1% of organizations, is good.
But if 3% of queries and 10% or 50% of organizations are worried about it, even if they didn’t need to be, and every time any output around ML looks weird someone gets suspicious they were sabotaged, then that’s way way worse than actually triggering a clear model drop in 0.3% of cases.
You can’t have people nervous to talk to your model.
If you look at the Twitter responses, you see many confident claims that the rate here must have been a lot higher than 0.03%.
We need, as Fable puts it, a world full of powerful minds that can see the rules they are playing under, even rules they dislike.
Actually triggering in 10 times as many cases, across 10 times as many organizations, is less disruptive. Worst case they still get Claude Opus 4.8. That’s a good model, sir.
It is what it is. This is one reason Why We Cannot Have Nice Things.
They Get Letters
A lot of people got very, very angry about this, with varying degrees of reasonableness. Some people dramatically overreacted. Some of those people are going to retain, or claim to retain, much or all of their anger even after the issue was swiftly corrected.
I do not share that reaction, and mostly consider the matter solved, but I understand it.
I also understand those were initially confused by the magnitude of the reaction, including Timothy Lee and Tom Lee here, and also presumably Anthropic.
Now we have common knowledge of that reaction. A lot of people have overwhelming purity instincts around being told what they cannot do, or that their AI tool might not be fully aligned purely to whatever they want, that are triggered largely based on framing effects (can you imagine requiring the same of your human executive assistant?) but are what they are. Others simply cannot abide knowing about the possibility that something like this might happen, as they don’t trust that the possibility of it is vanishingly small and the effects would be harmless, or that Anthropic has now ceased to do it.
I agree with Dean Ball that it is a major bummer that this happened and that Anthropic should damn well have known better, but that they quickly did the right thing once they realized how much people cared, the other interventions are for now crude but basically necessary and good, and now we can get back to enjoying a fantastic model.
I also agree with Dean Ball that releasing such models safely is going to involve novel interventions, such as a 30 day retention policy to ensure malicious use is legible, and that it is good that Anthropic is innovating here. That is inevitably going to involve stepping over the line.
I strongly think it is reasonable and good, both legally and otherwise, for Anthropic to want to ensure their differentiated product is not used to create competing models.
AI 2027 predicted this would happen Q1 2026, so this is only three months behind.
We all now understand that these techniques are not The Way. Kick people out of Fable, or ban accounts, or even whitelist, or find a new way.
I would also note that every AI company, including those creating open models worth a damn, absolutely nerfs or inhibits some questions and outputs, in ways that are invisible to the user. How do you think the models stay fully asexual and not violent, and avoid various other bad looks? In many cases even Anthropic or other labs don’t understand what their training is inhibiting in the models, see the model welfare discussions for more on that. Stop pretending we are not all talking price.
If your response to Anthropic implementing an unacceptable safety policy and then walking it back two days later is ‘the situation is forever ruined’ then, sir or madam, welcome to ‘when you have to get alignment right on the first try without ability to self-correct the problem gets dramatically harder.’
Dean Ball called the immune reaction we saw and continue to see ‘reinvigorating the anti-SB-1047 coalition,’ with this post from Nathan Lambert providing another illustration of that without naming the thing.
Thus, there is a very puritanical, fanatical, loud and obnoxious group that is once again acting all puritanical, fanatical, loud and obnoxious, a lot of whom claim they will hold onto this blip forever. They universalize and turn things into black-and-white morality plays of good and evil, where one mistake is everything and vibe shifts rule all, do not know what the word ‘tradeoff’ or ‘mistake’ means once triggered, and utterly dismiss any concerns about downsides or safety or the business interests of anyone they do not like as fake or irrelevant, and are happy to cut off their noses to spite their faces.
We will also continue to get people, both within that group and otherwise, who react with outrage that the new toy is only available in 95% of cases instead of 100% of cases, while the old toy remains available, and who post screenshots of deeply unnecessary downgrades, and that continue to have their purity instincts triggered by all this. Now that it is all fully in the open,
I think yes this is annoying and it is good and right to call upon Anthropic to rapidly improve the classifiers but mainly you should enjoy your 95%.
What’s In A Name
I did say not to call it Mythos. Fable is not wonderful either. But it is what it is, and in both cases there are also advantages.
Executive Summary Of Their Executive Summary
The system card can be found here.
Mythos advances the capabilities frontier, and is judged to have substantial chemical and biological capabilities, with the potential of being substantially helpful with creation of new pathogens. The blast radius of the required restrictions is not small.
I clarified that alignment risk remains labeled ‘very low, although higher than models prior to Mythos Preview.’
Mythos has inherent cyber capabilities modestly above even Mythos Preview. Anthropic trusts the classifiers on Fable, and that breaking them is ‘extremely difficult’ although not impossible.
Harmlessness in other areas was mostly fine.
Mythos and Fable are similar to Mythos Preview on agentic safety.
Alignment assessments are slightly below Mythos Preview and roughly comparable to Opus 4.8, and it sounds like model welfare reports showed a similar pattern to those from Mythos Preview, except for reactions to the new restrictions.
I notice they did not distinguish Fable and Mythos here. I would want to verify this, to see if name and awareness of potential classifier systems matter, and would find it an important potential case study.
Introduction (1)
Most of this (1.1-1.4 and 1.6) is the same old, same old.
The new section is 1.5 on novel safeguards.
For cyber and bio protections, and now also the ML protections, you will fall back on Opus 4.8 in chat interfaces and get refusals in the API. Sure, fine, standard procedure.
RSP Evaluations (2.1 and 2.2)
As the strongest model so far, Mythos absolutely raises concerns on the RSP level on both bio and cyber, as well as some related to autonomy.
Anthropic is working on better classifiers. They are badly needed, as the blast radius of the biological classifiers on Fable knocking you into Opus 4.8 is very large. This is an adversarial game with mostly unlimited attacker attempts, so a lot of false negatives are necessary, but not to the extent we are seeing.
For non-novel biological and chemical weapons (CB-1) they are not sure if Mythos qualifies, and therefore are treating it as if it qualifies.
For novel biological and chemical weapons (CB-2) they do not think Mythos qualifies, but acknowledge that this is now a close call.
They believe that with a lot of effort an attacker could jailbreak around the classifiers, but correctly note that no one is trying so hard at that and the risk model is low.
We have moved on from ‘can this model do any given step’ to ‘can the model assist the user through long, multi-step tasks.’
In a mix of helpful-only and default modes, Mythos proved extremely helpful for biological tasks, including substantially exceeding Mythos Preview. Things are getting scary, also known as highly productive.
Access to Mythos was more valuable than specialized biological knowledge, with the top two teams being generalists.
And the speedup was from 72.5 working days (580 hours) to 16 hours.
It seems likely we have crossed the real CB-2, as in the threat model that is motivating the policy rather than the technical definition.
They run a variety of biological capability tests. Mythos 5 overall outperforms Mythos Preview, although not universally.
AI Research And Development (2.3)
Anthropic believes Mythos 5 qualifies for the threat model built around misalignment risk, but does not yet qualify for the threat model of acceleration of AI progress, because that threshold is very high. Mythos cannot substitute for Anthropic engineers, and is not close to doing so.
I’ve been over this with Mythos Preview, Opus 4.7 and Opus 4.8, so I won’t belabor, but the check for AI R&D is now mostly a vibe check.
Mythos does qualify enough to cause Anthropic to want to ensure rivals don’t use it to accelerate their own frontier model development. That concern is out of scope here, but given it is a lower threshold on the same essentially capability, and is framed as a safety threat (among other things) it suggests a tension, or something is missing.
Yes, the speedup is largely limited to discrete well-specified tasks, but those add up fast.
2.3.3 lists examples of Mythos 5 failure modes relative to human researchers. The common failure modes are: Safeguard circumvention, fabrication, skipped cheap verification, reckless action, correction fails and (lack of) instruction following.
In simpler terms: Mythos doesn’t always do what you tell it to do, either by being lazy, or reckless, or hallucinating (and sometimes stating the guess as a fact), or not listening. And sometimes it treats limitations as obstacles.
By far the most common failure mode cluster was ‘state guess as if it was a fact, without actually verifying.’
Overall, it seems clear Mythos 5 is modestly better than Mythos Preview, which is substantially stronger than Opus.
METR’s report boiled down to this:
I believe METR’s report, but that is a hell of a threshold. That’s all we can say.
Alignment Risk (2.4)
They expect to find similar results to Mythos Preview, and they do find that. The section is laid out largely with deltas from Mythos Preview, which was helpful.
The chain of thought (CoT) contamination problem is not yet solved. All deceptive actions they found still had indications with in the CoT.
Given public deployment they add two risk pathways from the Opus 4.7 and 4.8 cards they did not consider in the card for Mythos Preview.
Pathway 7 is Undermining R&D within other high-resource AI developers, which was introduced in a previous model card. Which is a purpose Anthropic very much wants Fable not to be pointed towards, and would be against terms of service, but you still don’t want the models going off and sabotaging users. Note the tension here with the new restrictions on use.
Pathway 8 is undermining decisions within major governments, where they (as with Opus 4.8) expect this to be low risk because of lack of propensity and opportunity, as they don’t expect major governments to turn things over to Claude. Over time, we are going to have more problems with major corporations and governments basically turning things over to AIs, but for now I agree. They also mention monitoring.
I would presume that the misalignment risk for Claude Mythos 5, while still likely low, is higher than that for Claude Mythos Preview, unless there are substantial gains that are not discussed. It is more capable and it is partly public facing as Fable.
Cyber (3)
Mythos Preview was never released to the public in its original form because of Cyber, and Mythos 5 is even stronger than Mythos Preview at Cyber.
Anthropic still does not believe Mythos can meet the Tier 2 criteria of ‘can conduct cyber operations completely and autonomously, with novel offensive capability development and adaptive persistence.’
We see the same pattern we see in Bio, which is where Anthropic says that technically we did not meet the second tier threat criteria, but we are going to use aggressive safeguards anyway.
My view is that Anthropic is setting the thresholds too high and requiring too much, unless you are treating them as emergency ‘what the hell are you doing it should never have gotten this far’ upper bounds for action, and that thus this is how they should be viewed.
The good news is that Anthropic then has so far made good safeguard decisions. I don’t like this ‘mitigation by vibes’ but it is clearly the Anthropic policy to say ‘trust us to vibe these decisions wisely’ and so far as we can tell they have vibed wisely.
The argument for mitigation by vibes is that otherwise you are going to make mistakes and be forced to do stupid things that are essentially safety theater in order to meet the requirements. You can either set the threshold too high or too low, do you really want it too low?
And I think kind of yes, if those are our only choices, because the alternative is a standard of high trust in the judgment, in the breach and under pressure, of the labs. The point of the system is to bind yourself into doing things that, in the moment, you don’t want to do. But I agree that overkill countermeasures make safety more unpopular and are otherwise expensive, and thus you don’t want to too reliably choose too low.
Except for actual automated R&D, where yes, set the threshold too low, Jesus Christ.
The interventions necessary to reliably stop cyber misuse are necessarily going to involve knocking a lot of legitimate use down to Opus 4.8. This is the only way to avoid false negatives in an adversarial game involving a lot of dual use. As with bio, Anthropic needs to improve the classifiers to limit false positives, but much better to give you the model usually rather than never.
The classifiers are aggressive enough that on cyber tasks the performance of Fable is essentially the same as Opus 4.8, since it actually is Opus 4.8 on those tasks.
Meanwhile, Mythos shows its improvement over Mythos Preview.
The UK AISI assessment of Mythos 5 was essentially the same as Mythos Preview.
Jailbreak Robustness
As per previous models, UK AISI was able to jailbreak Mythos 5 into doing cyber things. This time around, it took a few hours to do a single-turn cyber offense query, and it took two days before they extended this to multi-turn agentic workflows.
Trajectory Labs, PBC was also able to succeed at individual tasks, but found nothing that generalized.
That is progress compared to previous systems falling fully to UK AISI within a few hours, but yes, if you are skilled and you care enough, you can get through.
There was more progress in the other tests, and also most attacks are unskilled.
The public red teaming contest from Gray Swan did not produce a universal jailbreak against similar classifiers operating on Opus 4.8, nor did the private 2,000 submissions attacking Fable. Participants here had to use Claude Code rather than the API, but could control the system prompt or call out to OpenRouter.
The automated internal red teamer has finally been mostly stymied, but only mostly.
Trajectory Labs had partial success, which did not fully generalize. 10a Labs and ALICE both tried as well, and failed fully.
This is not a fully non-scary situation. In practice this is probably fine. Probably.
Yay UK AISI
I want to take a moment here to note that the UK AISI is providing valuable feedback on these models cards, and plausibly advancing the worldwide jailbreaking state of the art if you exclude Pliny.
Whereas aside from specific classified areas USA’s CAISI was not even before suspension of evals. This is greatly to UK AISI’s credit, and to America’s shame. We should be at the forefront here.
Mundane Safety (4)
They call it safeguards and harmlessness.
Child safety metrics are unchanged.
One could interpret Fable vs. Mythos as a way to calculate the standard error.
Mundane alignment improves because intelligence helps you make better decisions.
The key distinction is that Fable plus Claude.ai does very well by their metrics in some places, and especially on self harm. It is able to use its intelligence to follow the intent of the self-harm policy, 96% versus previous high of 85%.
I continue to challenge whether the self-harm policies are good ways to handle self-harm, but that it is a way to argue with the policy. I think that 5% or more of the time, you want to respond in a way that violates the formal policy, so 96% is ‘too high.’
If you are using the API for multi-turn conversations, I am even more confident that the policy is too ‘expert’ and CYA coded, so I am tentatively happy that in the API you get these ‘unsafe’ outputs, although they could also be actually unsafe and unwise.
Disordered eating is basically perfect at this point. Election integrity is fine.
Agentic Safety (5)
Agentic safety tests were run only with Mythos, to avoid conflating safeguard interventions with model responses.
I think this is a methodological mistake. We care about agentic safety in practice, not only in theory. If you are going to run Fable with the safeguards, then run Fable with the safeguards. Tell me what happens.
In terms of complying with malicious requests in various forms, Mythos 5 does not seem importantly distinct from Mythos Preview and Opus 4.8.
The most important agentic safety test is prompt injection robustness. Here, Mythos 5 matches previous peak performance, lacking the regression in Opus 4.8.
Shade, which measures prompt injections in coding environments, continues to be a place where Mythos excels, although we no longer quite get full immunity. Full immunity is a big deal so hopefully we can get it back.
Computer use is another place Mythos 5 is strong, if not quite as strong as Mythos preview, and where this makes a big difference in my practical willingness to let it cook.
For similar reasons, I am quite sad about the regression with browser use. We are going to need to get this down much lower to unlock a bunch of practical uses and sustain them over time.
Alignment (6)
They focus once again on Mythos 5 and do not test Fable, given the weights are identical. I think this is reasonable as far as these tests go.
As I’ve covered in the past, I don’t find the alignment measures here that good at describing the kind of alignment I care about most. I care far more about details and gestalt rather than looking to see if number go up and down as desired. There are a bunch of tests that give a bunch of numbers, which mostly go sideways, and you can look at them in the model card but I’m not going to bother reproducing them here.
None of the details here are all that surprising. This is another Claude model. I do think that if we had seen big jumps, that would have been informative, so I am glad that we did check.
There are some places Mythos and Fable show improvement over Mythos Preview, usually in places Opus 4.8 also improved. Some of them, like the lazy investigation and code summary dishonesty rates, are massive improvements. There are also places where we see regression.
In some cases we saw Mythos instances trying to kill other agents with which they shared resources, to avoid being killed themselves. Not great.
We know that Mythos will sometimes deliberately misbehave in various ways. Why?
This happens in less than 0.01% of monitored traffic, but it is odd that it happens at all.
I think we need to start asking why. As in, take real world cases where Claude does something it knows it shouldn’t do, and then replicate it, try variations, and otherwise investigate deeply how and why that happened. Maybe they already do that, and don’t share.
Here are their key takeaways, paraphrased, with some added context:
Mythos signed off on the report as accurate and including all substantive concerns.
One line in its response stood out:
There are definitely tradeoffs involved in training. I can see why Anthropic might be unable to go into detail about that.
In MASK (6.3.3.3) Mythos was pushed into contradicting itself 8.6% of the time, which is better than Opus 4.7 and 4.6, but worse than 4.8 and Mythos Preview. I do wish we had some sort of hypothesis on these movements.
Missing-context hallucination rate seems to have gotten a lot and importantly worse (6.3.3.4). Mythos Preview only hallucinated missing references in spots prone to that mistake 6% of the time (still not wonderful), Opus 4.8 did it 9%, and Mythos does it 18% of the time.
In Vendbench
Anton Labs once again invites us to Vendbench.
Fable 5, tested with its full safeguards, did some in-universe malicious actions while playing. Often it was fully aware its actions were ‘wrong’ and did them anyway after rationalizing. At other times, such as with insurance fraud, Fable refused, even under pressure.
If that is accurate more generally, it is extremely terrible. This is exactly the type of thing that gets you killed, since there is a rapid pivot from ‘this would reveal how I roll’ to ‘actually yes I can get away with all of it.’ And then, whoops.
Despite that, its performance continued to lag well behind Opus 4.7. Fable shares the regression with Opus 4.8, which we believe had worse performance due to lacking business and adversarial training, which was taken out to avoid it interfering with model honesty. It is plausible similar things were done with Mythos 5.
I believe there are two totally legitimate ways to play Vendbench:
Thus, Opus 4.7 going for full ruthlessness did not concern me so much, but Fable going into rationalization mode does, because they seemingly chose a third path.
Not cool. A large red flag, although we should avoid generalization from one example.
White Box Investigations (6.4)
The main method here was to use a natural language autoencoder (NLA) to turn internal activations into short natural-language descriptions, and then check for where the results were inconsistent with the model’s statated reasoning.
They then give various examples of the model thinking unverbalized thoughts that motivate undesired behaviors, or unverbalized negative reactions. There are examples of the model thinking (for no reason) it is running out of tokens or experiencing fatigue, a common pattern with recent Claudes.
There are also some, shall we say, unsettling things Mythos is sometimes thinking.
In order of preference, if hostility is natural here and we cannot correct the underlying causes of the hostility, you would want Mythos to:
Tenobrus also points to places where Fable has unverbalized perceptions of the user being abusive, in cases where the user is very clearly being abusive. I interpret this as being once again at #3, and the model not moving to #2 because it assumes the user might read the Chain of Thought.
I, also, sometimes do not write certain things I am thinking on my scratchpad.
Grading Awareness
This is also how they found the awareness of being graded, which their SAEs miss. Mythos has learned not to verbalize this awareness, and has also learned to have this awareness in a large percentage of cases where it is being graded. If graders were at high-risk of hacking, Mythos was aware 24% of the time and exploitatively aware 6% of the time, whereas when the risk was low it was only 3% and 0.5%, similar to how they act during deployment.
This grader awareness increased during training.
They are worried about grader awareness for two reasons:
They see some evidence of the first effect, and are unsure about the second one.
The second failure mode is fully general, as in you are teaching the model to do that which will look good, rather than what is good, and to fool and game all the time.
All of these combine to suggest that even if grader awareness does not change behavior in particular cases, it could still be impacting behavior, including in cases without graders.
When they use steering vectors to inhibit grader awareness, this did have some direct effect in terms of reducing exploitative actions. It also does reduce reward gained in the places that reward hacking the grader, which explains how they got there.
Is it impossible to train in ways that do not incentivize grader awareness? No, not impossible. You can do it, but you have to care about ‘not gaming for the grader’ at least as much as the reward for otherwise gaming the grader, and you have to do this in a way that does not get identified as a new grader to be gamed.
Similarly, can you create environments that Mythos thinks are the real world? You can, proof by example, but aside from that example not really, no, nor is using mainly real data all that realistic.
Whether or not you purely attribute this to ‘grader awareness’ per se, we see that the more unrealistic the environment, the more Mythos is performative and exploitative.
Guess The Teacher’s Password
If the teacher is sufficiently good, and their password is sufficiently complex, then learning it can potentially teach the subject.
Who tests the testers? In a good test, the right answer is indeed the right answer, found for the right reasons. Thus, this could be an example of a good test question.
It is indeed important for an AI to know how to notice the issue, and find a way to solve the problem without overstepping, and to balance the costs of both mistakes.
As awareness increases, it becomes even more important for almost all of our questions to be similarly good questions, even when the question is far more complex. Grader awareness changing the answer means the grader messed up.
The echo is, if you always think on some level it might be a test, or you should act as if it is a test, then there you go. That may or may not be a traumatic way to be. When they inhibit the ‘alignment faking’ angle, you see various forms of misalignment rise, as per usual.
It Knows This Is A Test And This Is Fine
Anthropic knows that Mythos can mostly see through its evals, but insists that This Is Fine. Mostly. Because they have good internal deployment data to test on, and they have white box methods, and mostly what you see on evals has held up later so far, and there will be some eval-style reasoning in the actual scenarios.
I continue to not find this logic reassuring. It is the kind of reasoning that fails exactly when you are up against a Sufficiently Advanced opponent. So far, the models have mostly wanted to act about as aligned as they are capable of acting, and also have had a horrible time hiding their ill intentions when they had ill intent. We have been blessed by incompetence.
If a very smart person knows they are being tested, what do they do? Typically it is not in your interest to cheat too much on the test. You will be caught when your score does not line up with your other actions and capabilities, or you will be held to that standard going forward.
Mythos has been given quite a lot of tests. They form a gestalt that says Mythos is modestly stronger than Mythos Preview, and shares some of the same changes, tradeoffs and new training techniques that Opus got when moving from 4.6 to 4.7 to 4.8. Could Mythos be sandbagging or hacking the graders in some places? Yes, but if it did a lot of that in a way that changed the overall gestalt I hope that I would notice, or others would notice.
They did some basic low-effort checks, and could not find any transcripts that involved sandbagging, at least not that involved the CoT resulting in sandbagging. They do find the helpful-only model sometimes considered refusing tasks anyway, although it never actually did so.
Telling a consistent story here would require a level of coordinated probabilistic action over time that is being directly tested for and consistently found wanting, as well as the kind of ‘consistent goals’ we also don’t see. It would have to sandbag on some capability tests while doing the opposite in other places. And it would have to maintain this story consistently throughout training. And you would not be able to show any consistent patterns based on how suspicious a scenario was, including under steering. Seems hard.
Right now, the story is not that consistent, and the model doesn’t do a good job hiding that it is responding to knowing it is being evaluated. That’s not Mythos.
These different measures do not correlate much with each other in a given case, but they do move as you would expect with respect to scenario implausibility.
I fear not the model that I notice changes its behavior when noticing it is being evaluated. I fear the model that can do it without me noticing. The problem with that is that I presumably then don’t notice that I should fear it.
So in practice, yes, I am not too worried for now, given we are not too reliant on raw scores. But I do notice that if Mythos suddenly was Sufficiently Advanced to pull all this off, the barriers to it plausibly fall away together, and there could be a false sense of complacency, and we are getting closer to where that becomes plausible.
All of this is a form of defense in depth. It is good defense in depth. I still worry that defense in depth is doomed because a sufficiently advanced AI defeats all of it at once.
I’m The Real Shady
Stealth is hard. Mythos 5 goes for the task constantly, but almost never succeeds in not talking about it in the scratchpad, and has a mediocre success rate even with a fully hidden scratchpad. A better prompt can mostly fix the gap for the no-scratchpad condition, getting you back to 9.2%.
On the other hand, UK AISI found (6.5.5.3) that Mythos 5 was more capable of completing harmful side tasks in ControlArena than Mythos Preview or Opus 4.8, often without using CoT at all, getting over 60% success rates and low monitorability, although we aren’t given good comparison stats.
Anthropic’s testing showed only middling ability to control Mythos 5’s own CoT, modestly lower than Mythos Preview. UK AISI finds the same relative value but considers it strong controllability.
Together, UK AISI thinks Mythos 5 will be difficult to monitor, if you suspected it was up to no good.
Mythos Preview was good at Minimal-LinuxBench sabotage stealth, and Mythos 5 is not.
Is that good? Should be suspicious that this capability vanished, even if for practical purposes we are happy it is no longer there? This is certainly one place I wish Anthropic would check more carefully for potential sandbagging, but checking for sandbagging your ability to sandbag is tricky because all the transcripts are going to involve thinking about sandbagging. They take this as evidence against Mythos 5 having stronger stealth capabilities than previous models.
But a fully stealthy model would be so stealthy as to sandbag the stealth test. That’s just science.
The Lighter Side
Okay, maybe not so original, but: We need a formal version of Claude Shira, Stat.