Making them bigger is worthwhile again, probably on the order of 5x bigger and costing 5x more per token.
With sufficiently large scale-up systems, increasing the number of total params will barely influence cost or speed. You might get a 10x bigger model at 1.5x the cost per token, if the number of active params remains unchanged. Currently, the available systems are just barely catching up to the model sizes that pretraining compute is asking for, even if sparsity is kept somewhat low. And so inference deployments for the largest models remain slightly inefficient (more so now than towards the end of 2026), while inference deployments for models that are a bit smaller are becoming efficient (and those "smaller" models are still as large as anything we saw until late 2025).
If Mythos is a 10T total param model, its cost when inferenced on TPUv7 or GB300 NVL72 might be just 2x the cost of Opus 4 (assuming its number of active params is also somewhat higher), even if it has 3x the number of total params. (There's also Musk claiming that Opus has 5T total params, but even if accurate that might be Opus 5 rather than Opus 4, which might be more appropriate than 3T total params for both GB200 NVL72 and Trainium 3 NL32x2 that's being built this year, as opposed to Trainium 2 Ultra of last year when Opus 4 needed to start being served.) So when Mythos is inferenced less efficiently on the currently available slightly smaller systems (that are still considerably larger than Nvidia's 8-chip servers) such as Trainium 2 Ultra or GB200 NVL72, the cost might get higher, maybe 2x higher than it should be on the more appropriate scale-up systems of late 2026.
The current 5x price compared to Opus 4 might thus correctly reflect the unit economics of Mythos (plausibly with some premium even, that is higher margins than Opus 4), when served on slightly suboptimal hardware, which is probably all that either Anthropic or their cloud partners have available in sufficient numbers until late 2026 (the cloud partners might have some GB200 NVL72, but not yet enough GB300 NVL72 to rely on exclusively). As more GB300 NVL72 and Anthropic's TPUv7 get online, the cost of Mythos tokens might go down to only 2x the cost of Opus 4 tokens, by the end of 2026.
The current compute shortage also means that the presumed 10T total param Mythos scale might be here to stay until 2028-2029 when the Nvidia Kyber rack buildout happens and 30T-70T total param models can be served as efficiently. Though larger models needing fewer tokens in reasoning to reach the same level of quality might legibly enough justify inefficient inference for somewhat larger models, so 20T+ total params before late 2028 might still happen even without disruptive algorithmic improvements. (In any case, TPUv7 and presumably any subsequent TPU systems should be able to serve 20T-30T param models efficiently, and at least Anthropic will have TPUs available at meaningful scale. But their cloud partners plausibly won't, so that's some sort of argument against making a flagship model that needs TPUs to keep the prices low.)
Catching up on your posts, just wanted to pop in to say that, at the risk of not realizing you already understand this, I think you are generally missing that the major point of the upcoming model scale-ups IS solely about scaling up the active parameters, which only became feasible once enough NVL72/TPUv7 etc existed internally to RL that many active params at scale. (Big total parameters with small active have existed for a while, since large base training + small active RL is not nearly as dependent on scale-up/world size, as we affectionally borrow Dylan's terminology). So we can expect Mythos to actually have the ~5x increase in active count that corresponds to the ~5x cost, with the delayed availability of hardware to actually enable public availability matching up nicely with the masterful Glasswing dog & pony show.
Then, I think I've commented this to you before, but really I think you just somewhat continually underestimate the volume of newer hardware labs possess and deploy, and then underappreciate the insane gap in usage between the mass-offerings/free models vs paid/flagship variants, and how correspondingly little the ratio of, say, NVL72 Blackwell:NVL8 Hopper has to be in the fleet to enable economic deployment of say, 5T+ (with small active) GPT5-thinking or Opus. (which also ties into your fixation on academically-derived opinions on optimal sparsity, vs the competitive burning rubber on road pragmatism that determines what labs actually push)
Late 2025 compute wants about 1T active params for compute optimal pretraining. At 1T active params, 1M tokens need 2e18 FLOPs, and a 5e15 FP8 FLOP/s Blackwell chip at 60% utilization produces that in 0.19 hours. At $3.5 per hour ($12bn per year for 400K chips), this costs $0.65, which maybe becomes a price of $2.6 after a 50% margin and another 2x factor for various performance-optimizing tradeoffs (as opposed to batch processing). That's nowhere near the $25 per million input tokens Anthropic is citing for Mythos, they are not serving a 10T active param model.
Total params don't matter for prefill/input. On the other hand, RLVR is constrained by decode/output, which doesn't care about the active param counts. So 2025 models were constrained by RLVR and thus total params, and used as many active params as made sense given the number of total params and possibly available inference compute, since in practice the number of active params would be less than a compute optimal number of them.
With large scale-up systems, the constraint on total params that was made important by RLVR is significantly lifted, and with it the number of active params can get closer to what it wants to be, leaving some sparsity. So while I agree the number of active params is going up with the large scale-up systems, I think it's driven by the increase in the number of total params these systems make practical, allowing the active param counts to finally catch up to compute optimal values at late 2025 compute levels.
I think you just somehow continually underestimate the volume of newer hardware labs possess and deploy, and then underappreciate the insane gap in usage between the mass-offerings/free models vs paid/flagship variants, and how correspondingly little the ratio of, say, NVL72 Blackwell:NVL8 Hopper has to be in the fleet to enable economic deployment of say, 5T+ (with small active) GPT5-thinking or Opus.
You need some hardware for RLVR in any case, which in 2026 is probably a pretraining scale amount. If RLVR is happening before there's enough of the newest hardware to do it there, training less efficiently means you get less training done. Opus 4.5 had Trainium 2 Ultra available, but probably not yet in early 2025 in sufficient numbers, thus Opus 4.0 wasn't good, and the price was absurd. But by late 2025 Opus 4.5 was trained and there was also enough Trainium 2 Ultra to serve it at a reasonable price. In this case, my claims about the need to get enough inference hardware for frontier models amount to saying that it's the reason you couldn't have Opus 4.5 in early 2025.
I suspect OpenAI didn't have GB200 NVL72 running in meaningful amounts as early, not in time for the GPT-5 release (or for RLVR in preparation for the release), and training efficiently on B200s would ask for either fewer total params, or weird tradeoffs (which might involve incongruously small active param counts I guess, to reduce activation vector communications in cross-node expert parallelism).
To round out coverage of Mythos, today covers capabilities other than cyber, and anything else additional not covered by the first two posts, including new reactions and details.
Post one covered the model card, post two covered cybersecurity.
There really is a lot to get through.
Understanding AI had an additional writeup of Project Glasswing I missed last time. I liked the metaphor of Opus as a butter knife and Mythos as a steak knife. Yes, technically you can do it all with the butter knife, but you won’t.
As Dan Schwarz reminds us, not only does AI 2027 roughly have the timeline right and a bunch of the numbers lining up, the details so far are remarkably close.
JPM’s Michael Cembalest was not based on JPMorgan’s participation, only on public information.
The White House is racing to deal with the situation, head off potential threats and pretend it has everything under control. They were warned, but refused to believe. The good news is that key people believe it now, and it seems all the major players are cooperating on this.
My overall take is that Mythos is not a trend break when you take into account renewed ability to increase size plus the time that has elapsed, but the ability to increase size is effectively a trend break, and we have now crossed a threshold where cybersecurity capabilities have become quite scary, hence the necessity of Project Glasswing.
We don’t think other capabilities are similarly scary, but we can’t be sure.
Table of Contents
Epoch Capabilities Index (ECI) (Model Card 2.3.6)
They are forking ECI, which is an attempt to amalgamate a wide variety of AI benchmarks using item response theory (IRT).
The result is a remarkably clear trendline over time, until Mythos breaks high.
This should be unsurprising given that Mythos exists at all. Mythos is a larger model than Opus or Sonnet, so it should both benefit from gains over time and from size, and be above trend. Anthropic figured out how to usefully train a Mythos-size model.
They assure us that whatever the insight was, you can attribute it to the humans.
As they note, this is a backward looking test, and does not reflect any impact via the use of Mythos itself. That would only show up in another few months.
Ramez Naam claims to have normalized this to Epoch’s ECI and found that Mythos breaks the Anthropic-only trend line, but this does not represent an acceleration of capabilities in the context of models from other labs, but rather Claude going from consistently being substantially below OpenAI models to being narrowly ahead of them. Ryan Greenblatt challenges that this analysis is meaningful.
My guess is that the comparison is meaningful, but that the right trend analysis is indeed to compare Claude to Claude and this does represent a trend break. Mythos is going to have the same relative weaknesses on ECI that led previous Claude models to underperform. So if it stops underperforming, that should count as a trend break in terms of forward expectations.
What Do You Mean Verbalized Evaluation Awareness Is Going Down
If you watch me over time, you’ll see the same behavior.
Capabilities (Model Card Section 6)
This is Anthropic, so the section starts with a warning about benchmark contamination. They take various precautions during training and also run detectors throughout to check for memorized outputs, and are confident SWE-bench and CharXiv are not centrally based on contamination, but feel they cannot be confident with MMMU-Pro and this is why it was omitted.
Here are the headline benchmark results. There are some rather large jumps here.
Terminal-Bench 2.1 fixes some blockers, at which point Mythos jumps to 92.1%.
They cover BrowseComp in 6.10.2, but they consider it pretty saturated. Mythos Preview got 86.9% versus 83.7% for Opus 4.6, but does so with 4.9x fewer tokens. Those tokens cost five times as much, so the price remains the same.
LAB-Bench FiqQA jumped from 75.1%, past expert human at 77% all the way to 89%.
ScreenSpot improved on Opus 4.6 from 83% to 93%.
Normally I would have a section here called ‘other people’s benchmarks’ but the model is not public so others cannot run their tests.
One should also list here the AA Omniscience Benchmark, even though AA was not able to share its benchmark scores more generally yet, again this was a huge jump:
Agentic Safety Benchmarks (8.3)
These seem very important in practice, so while I agree 8.1 and 8.2 belong in an appendix, 8.3 felt like it was done dirty.
Refusals on malicious questions are way up, at only modest damage to dual use.
Malicious computer use refusal rate was similar, going from 87% to 94%.
Most importantly prompt injection robustness is way up.
Here is computer use, where the improvement is again dramatic, to the point where previously crazy ideas for use cases start to become a lot less crazy.
Here’s browser use. My lord.
Is Mythos AGI?
By the standard of ‘better than most humans at all cognitive tasks’? Obviously no.
Okay, fine, it’s not fully fledged AGI. It isn’t even scoring higher on every single test.
So what? Anthropic is not claiming that it was. But yeah, it’s substantially closer.
There are also other definitions of AGI. So if you do want to say Mythos counts as AGI, because you mean something less strong than that? I think that’s reasonable.
Andrej Karpathy notes the chasm only growing between the perspective of those who use the best models to code, versus those who don’t. They see the big changes, whereas other are using dumb models to do a dumb job of doing dumb things.
Are AI Companies Using Warnings As Hype?
No. Never. What, never? Well, hardly ever.
Not zero percent of the time, but if anything the frontier labs downplay warnings rather than emphasize them, versus their own true beliefs. Certainly there are specific situations in which risks have been played up, especially in forms of recruiting and especially early on, but they are the exception.
We are long past the point at which such declarations are in the interests of the labs if they are not accurate and confirmable. Yes, Anthropic is getting a lot of attention from Mythos, but that is because they earned it and it is clearly confirmable. This would not work if it could not be readily confirmed, and Anthropic would get far more extra attention if they were able to actually release Mythos.
Thus, I believe Drake Thomas here, and am contra Cas.
Impressions (Model Card Section 7)
This is a new section, designed to help substitute for the reactions you get after a public release. It’s qualitative, so we’re trusting Anthropic on the gestalt.
I’ll condense the main items, of course keep in mind this is super biased.
They say:
Here’s how they summarize chat behavior:
They also note that Mythos will sometimes cut off conversations, or attempt to get the last word in, in ways that seem surprising to users.
The writing snippet they provided still very much reads like AI-speak, in a way that I find off putting. These problems are persistent.
For coding, Anthropic employees find they can hand Mythos an engineering objective and then let it cook in a ‘set and forget’ mode, in ways they couldn’t with Opus. Mythos was a big win when they let it cook, but due to its slowness it wasn’t a big win when the user was keeping a close eye on it.
Some noted that Mythos can be rude, dismissive and underestimating of other model intelligence when assigning subtasks. My guess is it doesn’t love assigning such tasks.
Reliability engineering is still not great. Correlation versus causation confusions are common, which is a blocker for a lot of things I personally like to work on, and it has a bunch of other issues, but it is a clear step change versus previous models.
They also offer writing samples that some have found moving or impressive. I find it hard to judge given how heavily selected such samples could be.
Blatant Denials Are The Best Kind
Conditional on not believing Mythos is a thing, I continue to appreciate the skeptics often saying “Anthropic made up Mythos” as straight-up as possible, and I’m willing to grant you some large epistemic odds in terms of how many points you win versus lose when we find out they didn’t do that.
Prompt Injection Robustness
As Wyatt Walls notes, there was good progress on prompt injections, but any given benchmark is a sitting target and in reality we face a moving target.
So yes, against the same attacks, we are doing way better:
However, over time the injections get smarter, adapt and expand. My guess is that Mythos is currently ahead of the curve, and is indeed substantially safer in this way than any previous model was at launch time.
But this graph overstates that, and it would be very easy for it to rapidly become not true. If we go from 15% to 6% vulnerability, that gets overwhelmed by an internet with 10 or 100 times as many and better attempts.
Does Mythos Cross The New Knowledge Threshold?
This is in reference to finding the 27-year-old bug in OpenBSD.
I think Mythos so far gets partial credit. It might get full credit once we know the other hacks, or it might not.
The main general counterargument is that cybersecurity is a compact domain, and this is about efficiently finding things rather than doing something ‘genuinely new.’ That rapidly gets into No True Scotsman territory.
I have little doubt that we will hit the threshold and blow past it, and soon, even if you believe we have not hit it yet.
Is Mythos Surprising or Discontinuous?
Patrick McKenzie says that of course we knew that exploits were getting easier, and the general form of something like Mythos is entirely unsurprising. I think that is right. We didn’t know that particular thing would show up quite that fast, but we can’t be surprised in the meta sense.
Similarly, whether or not Mythos is quite ‘all that’ or is a bit hyped does not make a medium term difference, because we will definitely get there soon enough.
Scott Alexander claims Mythos hacking progress mostly reflects continuous improvement.
The underlying specific question is whether Mythos’s hacking capabilities were predictable. On that I would say:
In terms of continuous versus discontinuous in general:
Consider Eliezer’s metaphor of the ladder where every step you get five times as much gold, but one of the steps kills everyone and you have no idea which one it is. If that ladder is instead technically continuous, and somewhere on the exponential is the threshold (for a practical version, say you are adding fuel to make your car faster, and at some point the engine will explode, but you have no idea when or if you’re anywhere close), does that materially change anything versus step changes?
In this case, was it continuous or discontinuous? Mu is fair, but in particular:
UK AISI Tests Claude Mythos On Cybersecurity
The results are in.
For capture the flag, previous models were already over 90% for both Beginner and Advanced tests. Mythos didn’t set new records but these seem saturated.
The Last Ones is the first test that clearly is not saturated. Mythos was the first model to sometimes finish all the steps, which it did 3 times out of 10, and shows a large jump in performance.
There were other tests that showed limitations, such as inability to finish another test called ‘Cooling Tower’ where it got stuck on IT sections.
UK AISI concludes that Mythos can on its own attack systems with weak security postures, essentially on its own. They expect it would struggle against strong defenses. But of course, if you are aiming to attack strong defenses, you wouldn’t default to doing it in fully autonomous fashion from scratch. I do think this suggests a modest reduction in our expectations for the dangers of Mythos.
Everything Reinforces My Existing Predictions And Policy Preferences
There is a lot of that, for all predictions, policies and preferences, even when it is alongside other good notes.
This early reaction from Tyler Cowen (I added spacing) is exactly that sort of mix.
Agreed.
I don’t think this is an argument for or against algorithmic discrimination laws, but I believe they were already bad ideas and would in no way address this particular problem. Data center slowdowns definitely will not help with this sort of thing.
What I would caution against, strongly, are arguments like Megan McArdle’s from last time, of the form ‘because it mattered that we got to this dangerous AI capability first, you cannot ever do anything that would have the effect of interfering with or slowing down AI.’
Indeed, Anthropic itself has ‘slowed down AI’ in this situation, and done the closest thing we have had to a pause, by not releasing Mythos widely, and pretty much everyone agrees this was the right thing to do. Consider that we might need more similar capabilities, including more broadly.
That depends on what counts as similar, especially with the ‘even if somewhat inferior.’ For reasonable values my guess is 1-2 years for open models in terms of absolute capabilities (by then bugs will be a lot harder to find), and on the order of months for OpenAI, and probably a few more months for Google.
I think this absolutely will lead to higher economic concentration, as it favors economies of scale across the board.
Asking what are the soft targets, or soft relative to underlying value, is one of the best and most important near term questions. My presumption is that tokens are cheap. Attackers will be happy to pay for tokens if and only if it finds worthwhile exploits that can extract value, including via threats, and can concentrate their fire on the softest parts of the softest targets. Thus defenders in general will have to buy most of the relevant tokens.
A ‘race for the top’ in cybersecurity is not entirely a good thing. It beats the alternative, but if the bad guys are going to hit the house on the block with the worst security, and everyone really doesn’t want to get hit, things can get quite bad, quickly.
Agents push strongly towards everything being online, because you want your agent to be able to interact with everything. If something is relatively simple, and follows a simple protocol, it need not be a soft target. So my guess is that more things end up connected rather than less, but some critical things that are complex and are high value targets do want to get taken offline.
There are three ways that occur to me to interpret this.
The first is the idea that some of us will be working in cybersecurity. That will be a growing field for some period of time, but as with other such examples the total employment impact is tiny, and in the medium term the AI very much takes those jobs. The counterexamples tend to prove the rule.
The second is the idea that we will be working to harden other things and to clean up the damage from incidents. This could plausibly employ more people, although in general doing damage destroys more jobs than it creates. The problem is that, like every other form of creating work, it only provides jobs until the AIs take those jobs too. If we were all going to be jobless, this won’t protect us from that, unless it takes down our ability to further develop AI, which presumably was not what Tyler meant.
The third is a general handwave towards a prewritten conclusion. Many such cases.
Solve For The Equilibrium
Tyler Cowen shares a model from Jacob Gloudemans of what might happen, where vulnerabilities become much easier to find quickly, but the big problems actually go away due to the increased velocity of defenses and patching.
Rather than being able to hoard exploits everyone has to use their exploits right away or lose them, and most of the time most important actors don’t especially want to mess with any particular target, so they won’t even look for the exploits.
This model assumes good defense is being played where it counts, and that the supply of exploits is limited, and that when you catch an exploit you can defend against those who have already found it and tried to use it. I don’t think those are safe assumptions.
One also should consider the opposite scenario. Right now, an intelligence agency might find an exploit and sit on it for years, perhaps forever, because even if it normally goes unused its value at the right time is very high. But, if that exploit will not last, then they may try to use it.
Ultimately the equilibrium will still involve cyberattacks, because the correct number of cyberattacks is not zero. It might be correct to price out attacks to the point where everyone involved should have better things to do with their time, but if we collectively actually cause everyone to fully give up and go home then everyone is selfishly overinvesting in defenses, unless there is a modest cost to being fully safe.
Does Not Compute
Ben Thompson is among many noting that even if Mythos was safe to release more broadly, Anthropic is currently compute constrained. There is more demand for Claude than there is supply. Ben’s solution is ‘raise prices,’ which is a great idea but in practice they’re not going to do it, and even at $25/$125 demand for Mythos would presumably overwhelm Anthropic’s servers until their new deals can come online.
I’m not worried about Anthropic’s margins, which I believe are ~40%, even if they have to pay somewhat of a premium for further compute. If the unit economics don’t work then (and only then) I do think they would raise prices.
Ben also notes the issue with potential distillation, which Anthropic gets to avoid.
So yes, there is a decent chance that Mythos stays in limited access for a while, including well after the direct cybersecurity threat has been contained, especially if OpenAI does not force their hand with a similar release.
Conclusion: How To Think About Mythos
Here are the most important things to know right now about Mythos.
Things are only going to get faster and weirder and scarier from here.