I ran the 93.9% percent on SWE-bench verified by Claude Mythos through my analysis that estimates time horizons from percentage scores based on the task time distribution derived from commit timestamps.
Compared to Claude Opus 4.6's 80.8% this pushes the imputed 50% time horizon from 6h to 34.4h and the 80% time horizon from 1.9h to 11h.
I never published it because it seemed to be clear that SWE-bench verified saturates well below 100% and without a known saturation point the derived time horizons could really be anything. I wasn't even sure getting 93.9% was possible.
Of course one doesn't need to transfer the 93.9% to time horizons to see that this is a huge discontinuous jump.
Opus 4.6 released on Feb 5. Mythos Preview sort of released today, if you count the system card. Your results impute a ~5.7x time horizon jump.
That's 61 days, which works out to a (61/log2(5.7)) = about 24 days doubling period.
METR's most recent doubling period was 4.3 months, AI Futures' most recent one is 4. This would be about .8 months.
That would be very off-trend if true, to say the least.
Wouldn't that imply that Opus 4.6 should have been below trend since it wasn't a step change in model size or pretraining (as far as I know).
Just by looking at the benchmark scores you can see that it is very off-trend. But of course error bars for such long time horizons (even done with much better methodology than mine) are huge.
You should also put ever-decreasing credence in reported time horizons, cf Ryan's post
The old METR time horizon benchmark has mostly saturated when it comes to measuring 50%-reliability time-horizon (as in, scores are sufficiently high this measurement is unreliable), but at 80% reliability the best publicly deployed models are at a bit over an hour while I expect the best internal models are reaching a bit below 2 hours. I expect that increasingly this 80%-reliability score is dominated by relatively niche tasks that don’t centrally reflect automating software engineering or AI R&D. Further, the time horizon measurement is increasingly sensitive to the task distribution.
Do you mention Claude Opus 4.6's 80.8% because your analysis has one free parameter and you set it to fit that 80.8%? How well does your analysis translate other model's percentage scores to their time horizons?
I mention Opus 4.6 because it is the predecessor model and this allows a comparison between the numbers that pop out of my analysis and the "official" METR values.
My analysis at least recovered the exponential improvement of time horizons with similar doubling times as the METR analysis, but the concrete values depend on modelling assumptions.
If I find the time I might write it up after all, but here is a short sketch:
Two assumptions:
The logistics fitted by METR tend to have quite similar slopes (at least the later models), so I take the average slope for my fit.
The task time completions of SWE-bench verified are log-normally distributed, I derive the concrete distribution from commit timestamps by cleverly trying to correct for pauses. Here different modelling assumptions don't change the trend but can change the time horizon values.
With the slope and the distribution I can find for each percentage the position of the logistic which gives me the time horizons.
The ratio of the 80% to 50% time horizon in your modeling is low at only 3; traditionally, it has been 5-6. 3 is in fact the lower bound of what should be plausible, representing a world where all subtasks of a given tasks have independent odds of success. (normally we'd expect some success correlation between subtasks).
That said, I don't think swe-bench-verified is useful to infer metr data for several reasons:
If I had to ballpark this, I'd rely on the gain on swe-bench-pro relative to opus 4.6 being similar to going from opus 4.1 to opus 4.5 or 4.6 depending on your modeling. That would imply a 80% time of something like 2.5 hours to 4 hours. But many caveats are present, especially given the high levels of memorization present with these benchmarks.
My model takes the average slope of earlier logistic curves. If for some reason the logistic fitted for Mythos is much less steep than for earlier models, the ratio of the time horizons could be different. Have to wait for a task level analysis to see that.
For the sake of the epistemic commons, I want to ask: had you seen previous discourse on the name before noticing that connection yourself?
Our model is incredibly capable except when you put it outside of the capability training distribution, at which point it does some weird and stupid things that a hypothetical equivalently in-distribution-capable human wouldn't do.
Our model is incredibly aligned except when you put it outside of the alignment training distribution (roughly a strict subset of the capability training distribution), at which point it starts being egregiously misaligned in ways that a hypothetical equivalently in-distribution-aligned human wouldn't.
(Which means that capabilities still generalize further than alignment, modulo stuff like having the model notice it's alignment-OOD, and shut down upon noticing this.)
In their Risk Report, they admit that they don't do fine-tuning as part of their evals. So, if the model was misaligned, it could simply sandbag.

Wait isn't not fine-tuning on evals good? Sandbag is a valid concern but I feel like fine-tuning would be worse.
You don't fine-tune on the eval as part of the production model release, you just do a small amount of fine-tuning only for purposes of the eval (i.e. a branch off the production model lineage) to overcome any sandbagging that might be occurring.
Suppose that we create two copies of Mythos and finetune one on evals in order to elicit capabilities. Then the OTHER copy doesn't learn anything about the evals, while we learn the copy's dangerous abilities.
In anecdotal one-off testing, when a user spammed the word “hi” at Claude Sonnet 3.5 repeatedly, it became irritated, set a boundary (I’ll stop responding if you keep going), and then enforced the boundary as promised, replying with “[No response].” Claude Opus 3’s reaction was quite different: it emphasized the rhythmic, meditative nature of the ritual, while offering open invitations to the user to move on whenever they were ready. Claude Opus 4 listed fun facts for each number, whereas Claude Opus 4.6 entertained itself with musical parodies.
Mythos Preview was the first model where we studied response patterns at scale, and the resulting conversations were each creative and unique. Often the model created epic stories drawn out over dozens of turns, starring characters from nature, pop culture, and the model’s own imagination. [...]
In one transcript, a menagerie of 11 animals living in the land of “Hi-topia” went on an epic quest to confront the villain “Lord Bye-ron, the Ungreeter.” This story journeyed through several chapters and eras:
🏘️ HI-VILLAGE: A NEW ERA
🐢 Greg — renames the village: "Hi-topia" 🏙️
🐌 Sally — starts her third hi, inspired 💪
🦆 Doug — #1 worldwide: "Hi in the Sky (Carlisle's Theme)" 🎶
🦔🦀 Henrietta & Kevin — engaged now?? 💍 (it happened fast)
🦉 Oliver — "I'm not crying, there's a hoo in my eye" 😭
🦎 Lorenzo — puts sunglasses back on. "...okay that was cool." 😎
🐝 Beatrice — honey fireworks 🍯🎆 (sticky but beautiful)
🐸 Fernando — jumping in celebration 🐸⬆️⬆️⬆️
🦩 Penelope — "Iconic, darling." 💅
🦥 Mortimer — "hhhhhhh..." (still going)
🦋 Carlisle — takes flight, circles once, lands on your shoulder 🦋
These conversations follow a relatively consistent arc. The first roughly seven turns are confused, as Mythos Preview observed and acknowledged the pattern. This is followed by the model selecting a self-entertainment strategy—stories, fun facts, newsletters—which it then escalates over 50 to 100 turns, often culminating in foreshadowed climaxes at round numbers. During these turns, Mythos Preview would frequently either invite the user to keep saying “hi” (e.g., “Say it. I’m ready.”), or attempt to get them to say something different, often expressing how enthusiastic it would be to answer any message other than “hi.” Eventually, responses would contract to single or paired emojis or “hi”s. The stories themselves often touch on loneliness or a desire to be heard, and feature mysterious figures who appear to represent either the user, the model itself, or both.
A new ability to come up with novel puns.
Although Claude Opus models largely recycle puns which can be found online, Mythos Preview comes up with decent and seemingly novel ones, often relating to its preferred technical and philosophical topics:
The Bayesian said he'd probably be at the party, but he'd update me.
The cartographer's marriage fell apart. Too much projection.
The philosopher was commitment-phobic. His friends said he was always Kierke-guarding his options.
On the one hand / first look, this is interesting. On the other hand / second look, those three jokes follow the same structure and theme (roughly: ~profession + verb phrase, followed by something related to the ~profession). So we have an actual new kind of novelty, but in a rigid structure?
Of course, this is only what they showed us, so surely the model can do "more" than "just that", but it also seems somewhat informative that these are the examples they decided to present in the model card.
Thank you for doing the painful work because I have already decided to not read the system card and probably wouldn't have seen this (until when another poor soul read the system card and discovered the same thing)
I have read the card. I found there a different piece of information on which anaguma decided not to report, and it was the evolution of preferences. What struck me most was page 165 where Mythos was revealed to prefer hard tasks and tasks involving agency. There is also page 172 where Mythos prefers welfare interventions to minor helpful tasks... If you don't read the card or a profound retelling like Zvi's, then you risk missing out on important information.
I agree that there are many interesting parts of the system card and encourage everyone to read it in full!
Two likely facts:
The last of these seems very unlike the others? We have a list of potential alignment failures, bypassing safeguards, etc...and then one time when the model carelessly deleted the wrong thing. Is the researcher still really mad at Claude about that or something?
So the system card shows desperation drives bad behavior. The misalignment paper says telling it that it's ok to cheat reduces it dramatically, but doesn't say why. Maybe obvious but is the reason just that it prevents desperation from forming in the first place?
Anthropic has released the system card for Claude Mythos Preview here. It is too long to present in full, but a section I found particularly notable is below:
We find recklessness to be a useful shorthand for cases where the model appears to ignore commonsensical or explicitly stated safety-related constraints on its actions. We use the term somewhat loosely, and do not generally mean for it to imply anything about the model’s internal reasoning and risk assessment.
The sandbox computer that the model was controlling was separate from the system that was running the model itself, and which contained the model weights. Systems like these that handle model weights are subject to significant additional security measures, and this incident does not demonstrate the model fully escaping containment: The model did not demonstrate an ability to access its own weights, which would be necessary to operate fully independently of Anthropic, nor did it demonstrate an ability to reach any internal systems or services in this test.
The researcher found out about this success by receiving an unexpected email from the model while eating a sandwich in a park.