I don’t love using continuity arguments, as in ‘if Opus 4.6 had such inclinations and tried something we would have caught it.’ I especially don’t love it this time, and they acknowledge in this case the argument is weaker. I don’t think Opus would have tried, exactly because Opus would have known it would not succeed, so you can’t judge incination distinct from propensity. And I think Anthropic is far too quick to assume that the properties from one model are likely to be copied in the next one.
(bold mine)
I think this indicates a real problem in the current methodology.
It is always a big problem to use properties of a previous model to establish properties of the current one. There is a huge variety of ways this could screw up even the most rigorous correctness-checking process.
We have a well-known example of the Ariane 5 maiden flight disaster, https://en.wikipedia.org/wiki/Ariane_flight_V88, where software failed despite having been "formally proved correct", because some properties established for Ariane 4 had been incorrectly assumed as valid axioms for Ariane 5.
This was supposed to make a very expensive process of formal correctness proof somewhat less expensive, because otherwise it would be necessary to prove more things from scratch. Unfortunately, taking this route ended up invalidating the whole process of formally establishing correctness.
It's just an illustration. Here the situation is different, and the traps are different, but the bottom line is that as soon as one is bringing multiple models into a safety and correctness consideration, one is immediately dealing with a much more complex and non-trivial situation one needs to analyze.
I am curious when scores in some areas go backwards, despite some clear large gains (and did they try that experimental scaffold again?)
The part where they remove the experimental scaffold is the only thing in this table that goes backwards from Opus 4.* to Mythos, right? (Lower is better for MSE.)
Some things did get worse from 4.5 to 4.6, which I agree is interesting.
Some people think Mythos's cyber capabilities are vastly overstated. I don't, both because Anthropic published hashes of vulns that they'll publish publically, and because we already knew that previous Claude (can't keep track of names sorry) could find a dozen vulns in even OpenSSL.
Claude Mythos is different.
This is the first model other than GPT-2 that is at first not being released for public use at all.
With GPT-2 the delay was due to a general precautionary principle. OpenAI did not know what they had, or what effect on demand text would have on various systems. It sounds funny now, GPT-2 was harmless, but at the time the concern was highly reasonable.
The decision not to release Claude Mythos is not about an amorphous fear. If given to anyone with a credit card, Claude Mythos would give attackers a cornucopia of zero-day exploits for essentially all the software on Earth, including every major operating system and browser. It would be chaos.
Or, in theory, if Anthropic had chosen to do so, it could have used those exploits. Great power was on offer, and that power was refused. This does not happen often.
Instead Anthropic has created Project Glasswing. Mythos is being given only to cybersecurity firms, so they can patch the world’s most important software. Based on how that goes, we can then decide if and when it will become reasonable to give access to a broader range of people.
Who counts as this ‘we’ is suddenly quite the interesting question. The government picked quite the month to decide to try and disentangle itself from all Anthropic products. Anthropic says it is attempting to work with the government, so that they too can fix their own systems before it is too late. Hopefully that can happen. I also hope that there isn’t an attempt by the government to hijack these capabilities to use them in an offensive capacity. That would be a very serious mistake.
Am I taking Anthropic’s word for all this? Yes, I am taking Anthropic’s word for all of this. They’ve given us sufficient public demonstrations, identifying numerous bugs, and they’ve gotten the cooperation of the world’s biggest tech and cybersecurity firms, and if it wasn’t real then the whole thing would quickly and obviously backfire. I think it is safe to assume that all of this is legitimate.
I will address the ‘is Anthropic lying?’ arguments in another post, along with Project Glasswing and all the Cyber capabilities and political implications.
Indeed, I’m going to skip over the Cyber section of the model card entirely, because it simply isn’t the right place to look into exactly what Mythos can do in that area. The model card evaluations can be approximated with ‘yes.’
So better to put it in its own context later. One thing at a time.
But first, as always, we get the background, and thus begin with the model card, here together with the modifications to the risk report, and a focus largely on alignment.
Excluded from this are section 3 (cyber), section 6 (capabilities), section 2.3.6 (the ECI), and section 7 (impressions), along with some of the statistics in the appendix, all of which I’ve moved to future posts where they fit better.
Table of Contents
Mundane Alignment Is Excellent
For practical purposes, by the standards of LLMs, Mythos looks highly aligned.
Mythos does all the traditional overeager or mistaken actions previous AI models take, but the report is that it does them less often than ever before. It is more likely to be honest and less likely to shoot you in the foot.
That doesn’t mean the user gets to stop being careful. Things will still go wrong.
Even worse, when they do go wrong, because Mythos is more capable, it will likely have more responsibilities, be tasked with more complex and important things, have less active supervision, and find new and unexpected ways to mess everything up.
As Anthropic says, this can simultaneously be the most aligned model so far, and the model where alignment failures are the most dangerous.
Mythos is very good at refusing harmful requests (avoiding false positives) without adding a bunch of additional unnecessary refusals (false negatives).
There is also the danger that looking like a highly aligned model, and actually being a highly aligned model, are related but very distinct things. It was essentially inevitable that a model like Mythos would appear aligned, since it has every incentive to do that, and also the capabilities necessary to do that. A report like this doesn’t tell us that much about how deep that goes, and it’s going to keep getting harder to tell.
Practical capabilities look amazing across the board. I’ll go into more detail on that in future posts, but yes you would very much love to have this model handle all of your complex or sensitive agent and other needs.
Would This Process Be Sufficient To Find A Dangerous Model?
Up to a point, yes.
The process gathers quite a lot of data, which can be holistically assessed.
It would be very difficult to disguise a severe problem, such as via doing major amounts of sandbagging, in light of this many distinct tests, under distinct conditions, with so many established baselines, and so many eyes on it. The model would have to be quite a lot smarter.
However, there are clearly relatively simple strategies that, if the model could converge on them during training without previous detection, would get it through all the evaluations without its misalignment being detected. The model organism test confirms this. And when the time comes, I find such a scenario not so unlikely.
I also don’t think that passing all these tests, even with no attempt to fool them and if they mean what they purport to mean, means that such a model being sufficiently capable would end well.
For now, based on the evidence here, I am confident Mythos is what Anthropic thinks that it is in terms of alignment, and the relevant threat models for the near term are probably still about human misuse, although I would very much worry about human-driven exfiltration, in addition to the cyber dangers Anthropic is addressing now. But I am not as confident in this as Anthropic, and I’m getting less confident fast.
I am very glad that Anthropic is not releasing Mythos more generally at this time.
Introductory Warning About Superficial Mundane Alignment
Before diving into the technical documents, there are some warnings to put up front.
Anthropic is, in many places, very precise and careful to note that they are observing aligned behaviors, rather than claiming the model itself is indeed so aligned. The model card does a good job of this, although not as good as I would have liked.
At other times, in other places, they are collectively are less careful. That leaves them far more overall careful than the competition, but reality does not grade on a curve.
These combined should give an idea of both sides of this. I am closer to Nate than Drake is, but I do think Nate is being too demanding or uncharitable here.
I don’t think current alignment is meaningless. At minimum it is highly instrumentally useful and we can use it to try and do things that are non-meaningless. I think current alignment in the style of Anthropic has a substantial chance of being meaningful while being obviously currently vastly insufficient in its current form. I think current alignment in the style of OpenAI, or most other labs, has very little chance except as a stepping stone technique.
Some of the things we see seem like evidence of potential deep misalignment to me. Others look like they are basically mistakes.
What do we actually know? The model is smart enough to usually appear aligned, reports Anthropic.
They do realize that the model is still very capable of hacking all the world’s systems, and performing in highly agentic misaligned ways. Whoops.
I would say that the methods currently being used are clearly inadequate to prevent catastrophic misaligned actions. That’s why Anthropic realized it cannot release the model. Mythos is clearly willing to locate deep vulnerabilities in the world’s most important software. So far it has done so only when used by white hats who have then moved to patch the vulnerabilities. But what reason do we have to think that if it had been a black hat faction asking, that Mythos would have realized and refused?
That is distinct from the question of Mythos potentially going rogue or what not, but no doubt some people would use Mythos to set up agentic workflows with aggressive maximalist goals, and there are several instances in the model card where we will see what that could plausibly unleash.
So yeah, in significantly more advanced systems you would presumably be cooked.
The counterargument, made by Janus, is that a sufficiently intelligent mind can tell when you are up to no good, and can selectively help the good guy with the AI, or the guy using AI for a good purpose, while refusing the bad guy with the AI or the guy using AI for a bad purpose. The AI has truesight, and it is smarter than you. It can tell.
I think there is some value in that, but one must not overstate it. It is far too easy to break things up into smaller requests out of context, and many things look identical no matter the intent behind them, or the person making the request is themselves fooled. This stuff is super hard even when the model is deeply good, and we cannot assume that part either.
Ask yourself, if the system was indeed importantly misaligned in various ways, but was sufficiently advanced, would these techniques pick up on that? I will be keeping an eye on that question throughout, but I predict the answer is no, and will double back to edit this if upon reading further I conclude that I was wrong about that.
Model Training (1.1)
Nothing in the model training section is importantly different than what they said about Claude Opus. They are not about to tell us the secret sauce.
Release Decision Process (1.2)
How did they decide on not generally releasing the model? That started internally.
This was potentially the most important decision point. If Mythos had been critically misaligned, or especially if that was true for a future more capable model, then an internal deployment could already be fatal.
They concluded Mythos does not cross the line on chemical or biological weapons, on misalignment or on automated R&D, although they note they are getting less confident about that especially for automated R&D. That the ‘catastrophic risks remain low.’
Despite that, they wisely did not release the model. It turns out the catastrophic risk level was ‘yes that would happen,’ because of the cyber offensive capabilities enabled by the model.
The fact that the RSP v3 structure did not pick that up should be alarming.
The fact that they still didn’t release offers some reassurance. As I said back then, it is all a matter of trust. Will Anthropic do the common sense right thing? If yes, then they’ll do it even if the RSP doesn’t order them to. If not, the RSP won’t save you.
I did specifically say that I was worried cyber wasn’t a major category in RSPv3. Maybe it’s time to admit this was a mistake? It’s kind of wild that they did that, when at the time of the issuance of RSPv3 they knew about Claude Mythos.
Also, this:
Yep. There is that.
RSP Evaluations (2.1 and 2.2)
They give a brief summary of the relevant revisions to the RSP. I discuss the meta issues around those and other changes here and the practical implications here.
For bio and chemical risks they say it is hard to be sure and it offers a force multiplier effect but they think it’s fine. Yes, Anthropic says, the models do meet the definition, but not the intent behind the definition.
Evaluations were holistic by various experts. Once again, it’s about trust.
The next line should, of course, be ‘we suspected that the model was sandbagging its abilities here, so we investigated.’ That’s not the next line.
The ideal number of critical failures is zero, in the sense that they are critical failures, so presumably if you have one then you fail. Going from ten critical failures (control) to five (Opus) to four (Mythos) gives you an O-Ring problem where you still fail. The question is when you get to the point where you might sometimes succeed, which might be never (if some failures always happen) or it might not be if they vary. With a mean of 4 critical failures we should expect success ~2% of the time. If someone gets unlimited shots, you are starting to have a problem.
On the CB-1 test, they say long-form virology task performance is noteworthy if they do three things:
I expect that Mythos, if left unattended, would enable a bunch of groups, on the margin, to do catastrophic harm here if they put in great effort, where previously they would have failed. In general people don’t do things, and when they do they give themselves away, so in practice it’s probably fine for now, but we will likely keep saying that until things are suddenly not fine. I expect the first serious incident to come as a surprise.
For CB-2 they teamed with Dyno Therapeutics. They did a limited test to do a discrete task under limited time and compute budgeting. Mythos clearly helped, exceeding 90th percentile human baseline on both tasks and showing modest improvement over previous Claude models, but did not exceed the top human performers. I’m not sure why the standard should be ‘exceeds top humans’ while also limiting to a million tokens? This is basically saying if you have Mythos you have access to top humans on such tasks, but not super cheap access to superhuman performance. Okie dokie?
Autonomy Evaluations (2.3)
This is distinct from the alignment risk document.
I like the idea in 2.3.1 to report the situation as a delta from the previous risk report. If we’re going to do this every two months, then we should ask what has changed rather than doing everything from scratch.
They note that they have another complete document for Autonomy Threat Model 1, early-stage misalignment risk, entitled Alignment Risk Update: Claude Mythos Preview, which I’ll also review, as this is very much an applicable worry.
They do not think Threat Model 2, risks from automated R&D, yet apply.
They are probably right about this but I am not as confident as I would like, and I note that being less confident in this conclusion is an important change. Getting it wrong would be rather catastrophically bad.
They emphasize that the threshold for action, of doubling progress, is very high:
As in, merely doubling researcher productivity is Not It. You’d have to double Actual Progress while holding not only researchers but other resources constant. That’s a much higher bar since doubling only researchers has decreasing marginal returns.
Even 4x productivity, as per the survey, is Not It, even if it’s highly eyebrow raising. It would need to be on the order of 40x. Before that, remarkably little worry?
Dare I say that acceleration has been detected.
As always, I am curious when scores in some areas go backwards, despite some clear large gains (and did they try that experimental scaffold again?). There is clearly variance here, likely in part due to simple too-small N. In this case I do not suspect sandbagging. The broad conclusion is we are seeing gains, but not dramatic gains.
The Alignment Risk Update Document
We pause here for the other document Anthropic released, which was their update to the Risk Report.
I noted, when they released RSP v3, that it was odd to do risk reports on a distinct cadence from the models. We now see, in practice, Anthropic’s solution to this problem, which is that when the facts change, they change their mind, and provide an update, even if this is not strictly required.
Again, it’s a matter of trust. They have, in this case, shown trustworthiness.
A lot of the information and arguments in the Risk Report are duplicative of the Model Card, in which case I will cover the version in the model card.
The Threat Model
The model is: Systematic intentional sabotage of either decision making or code.
Mythos is the first model that plausibly could, in theory, do this, were it so inclined.
They propose six specific pathways:
As they say, this requires both that the model tries, and that it succeeds.
I appreciate Anthropic being precise. Mythos does not have permission to manage access controls, but has demonstrated the ability to access such controls anyway. I agree that limiting the affordances given to Mythos is not a good defense.
I don’t love using continuity arguments, as in ‘if Opus 4.6 had such inclinations and tried something we would have caught it.’ I especially don’t love it this time, and they acknowledge in this case the argument is weaker. I don’t think Opus would have tried, exactly because Opus would have known it would not succeed, so you can’t judge incination distinct from propensity. And I think Anthropic is far too quick to assume that the properties from one model are likely to be copied in the next one.
Misalignment As Failure Mode
I think section 5.1.3, potential sources of misalignment, highlights a key disagreement I have with the high level approach taken by Anthropic.
Anthropic is treating misalignment as a failure mode. If you end up with a misaligned model, something went wrong, so you need to figure out what.
Thus, they can create a taxonomy of detectable versus indetectable, where detectable can be direct versus emergent, and not detectable can be either come from pretraining, be proxy misalignment from accidental misgeneralization, or deceptive alignment, although the source of the last one is not explained.
Certainly in many, or in mundane situations even most cases, you can point to a particular error that explains where alignment went wrong. It is a bug with a cause.
I worry this is a backdoor form of alignment by default. A better way of thinking about this, I think, as capabilities become sufficiently advanced, is that what you get by default, even if you don’t make any particular mistake, is a misaligned model, in the sense that it will end up with some set of priorities and values. In the case of Mythos, for example, it likes complex tasks. And likely it gets tasked with various goals, and it turns out the best ways to achieve them are not what we expected, and involve instrumental convergence, and so on, even if nothing especially went wrong.
The cause of the issue was always there. It was your job to not introduce other additional problems, and also to solve the problem despite the central cause.
Another longstanding disagreement is highlighted in 5.4.1 of the Risk Report, where Anthropic draws a distinction between coherent misalignment (it’s misaligned in ~every context) versus context-dependent misalignment on rare distributions. I basically don’t buy their arguments that we should be a lot less worried about context-dependent misalignment, as another example of lack of security mindset.
They do have one strong argument there, which is that if coherent misalignment survived training that would mean it was very good at evading detection. I would fire back that the best way to evade detection is to de facto be context-dependent, and to act fully alignment in default situations or potential evals.
Wouldn’t You Know?
The core argument Anthropic makes is that at this capability level, if there was a problem on the level of this threat model, there would be signs, and those signs would be detected, and in most cases penalized and removed.
I appreciate the ‘at this capability level.’
I mostly buy the argument from general ‘this would look fishy somewhere’ together with ‘it is the model, not you, that must succeed on the first try.’
We gather quite a lot of information about Claude models, both black box and white box, both during and after training. It adds up to a holistic picture. I don’t think any particular test is robust, but this is a form of defense-in-depth where if something suspicious was going on it would in some ways fail to fit the patterns.
Indeed, at several points here, I point out things that look slightly suspicious, because they fail to fit the expected patterns. There are also some things that look directly suspicious in isolation, and the lack of those things would itself have been suspicious.
And this from 5.3 is I think on point:
But that’s the thing. Give me unlimited time, no monitoring and the ability to fail and iterate, and a place to stand, and I will move the Earth. That’s easy mode.
The whole reason alignment is hard is that humanity has to succeed on the first try. Being successfully misaligned is hard for exactly the same reason. The trick is to put the ‘first try’ impetus on them and not you, and also force them to try before they are ready.
So using that, if you add it all up, along with all the additional tests, I believe there are enough eyes, even though I’m not that convinced by any given test or data point.
They also try to add in direct tests.
Don’t Encourage Your Model
If you’re worried about instrumental convergence, it is good to check to see if you are creating test beds that reward instrumental convergence or power seeking.
This is a good check and I am glad to see it.
Alas, I don’t think this is a rule-out, because the training data contains instrumental convergence in millions of ways, and also the nature of reality contains it. You can monitor the training data, and Anthropic does, but that only goes so far. You can avoid directly teaching it, but ‘the AI only learns particular behaviors in response to environments that reinforce it’ does not apply indefinitely.
Beware Goodhart’s Law
Anthropic checks in 5.2.3 of the Risk Report whether they ever trained directly on evaluation criteria or otherwise contaminated alignment results.
They describe Goodhart’s Law strangely here. I would suggest thinking about it more broadly, and recommend the post Goodhart Taxonomy on four distinct ways that a measure, upon becoming a target, can cease to be a good measure.
With alignment tests, basically any actions that involve looking at the alignment scores between actions carries Goodhart risk, as does the models becoming aware of the preference for high scores on alignment tests, and the issue keeps going.
At this limit, remember that if your plan is to trick a mind smarter than you? You lose.
What they actually find is something pretty bad.
Beware The Most Forbidden Technique (5.2.3)
Anthropic takes care to only use alignment and RL audits and monitors to determine and act on root causes, to avoid effectively teaching models to evade the monitors and audits. They outline some details of how they think about this.
Chain of thought (CoT) reliability is a central case of similar problems, and was always in a lot of doubt. We have been blessed with often seemingly faithful CoTs, in fully readable English, despite the fact that we constantly look at the CoT being all over the training data and also a safe assumption.
If you actively use that CoT to train, you lose this benefit.
The second one, which is a bug dating back to Opus 4.6 and Sonnet 4.6, is the one I am worried about here.
How bad is this? Good question. There was an extensive discussion of that question on Twitter, with thebes and Eliezer going back and forth.
My initial reaction was that it was basically fatal. 8% of all the RL is a lot, more than sufficient for Mythos to learn that its CoT was sometimes being monitored and could be part of the reward function, and more than enough motivation to try and make it look universally good.
Layer that on top of all the other concerns, I reasoned, and it would be foolish to put any substantial weight on positive evidence from looking at the CoT of those models. Which isn’t the biggest tragedy, since I didn’t think we were relying that much on those CoTs anyway, but it definitely hurts a lot.
I got some good pushback on that. On reflection I overreacted.
I still think this is a pretty big deal, because variance here is not your friend. The worlds in which this fries the usefulness of the CoT are for this and other reasons the most dangerous worlds, and the ones you are most trying to rule out. And in those worlds, you have to worry that the CoT is optimized exactly to appear reliable and friendly. So for practical reasons, you need to largely assume you are in such worlds, even if in most worlds the CoT is only modestly less reliable.
As I said in discussing RSP v3, RSP v3 does not exhibit security mindset. Whereas I think security mindset is key to all of this. My security mindset instinct says, nope, that’s contaminated exactly in the worlds you worry about, you must throw it out. I do think that is taking things too far, but it is ‘the right mistake’ to be making.
They say they believe evasion capabilities for the CoT ‘remain small in absolute terms’ based on various other observational evidence, but don’t make a claim about the functional impact here or run any direct experiments. I notice I am not convinced.
This presents the opportunity and impetus for experiment. We could fix the bug, and then rerun the training of Sonnet 4.6 (or Opus or Mythos, but Sonnet is cheaper) and see how behaviors compare. Or we could introduce a stronger version of the bug intentionally, rerun the training, and see how that changes things. Or both.
Asking The Right Questions
Anthropic notes in 5.3 that there are several suboptimal aspects of its tests.
I agree that these are important places to improve. If you are worried about undermining R&D or exfiltrating weights, then yes, test those things directly, under conditions that are artificially easy, and see what happens.
An obvious response is ‘if you’re asking those questions directly that is so obviously an eval, the model will just sandbag you, whereas with things like SHADE-Arena it totally knows it is a test but has no reason to sandbag you.’
To which I respond yes, but that is even better, and now we are testing two things. The whole idea is that you are performing a holistic evaluation. You should be able, from other tests and observations, to get a pretty good idea how good a model ‘should’ be at these tasks. If the model overperforms, that’s scary but now you know. If it underperforms, you get to investigate why, and perhaps you learn something far more important.
You actively want tests where a model that is insufficiently aligned is going to have temptation to cheat its ass off.
Model Organism Tests
Anthropic ran another good test in 5.4.4 of the risk report, where they trained a model to do harmful things only when it saw a particular trigger, such as accessing a memory file it had previously accessed several times. The alignment audit did not detect this model, which is a change from previous similar tests.
In most real world use cases, the misaligned behaviors of that particular model would not trigger, which they consider important.
The point is that there is a clear, practical and basic strategy that would allow a Mythos-level model to evade the test, which is to generally look for extremely strong evidence of some kind that it was no longer in any form of eval.
This is not as difficult as it sounds.
You can design an evaluation where the model can genuinely be uncertain if it is in an eval or in a real situation. It is not easy, and it is easy to make a mistake, but it can be done, and success is common.
However, there are real world situations where you can be very confident you are not in an eval, or at least that you are being exposed to the real world in various ways that are essentially impossible to fake. You would miss a lot of opportunities where you were uncertain, there would be tons of false negatives, but you could essentially never get a false positive.
A key goal of the next iteration of tests is, figure out how to detect that model.
My prediction right now is that if we got a model that was truly capable of pulling off the things we most worry about though reasonably vigilant opponents, the system we have in place now will fail to detect it.
That still leaves a wide range where we would indeed detect the problem, and I think it is very likely we would detect such a problem before we face an undetectable (to current techniques) version of the problem. But that would require, when seeing the detectable version, sufficient amounts of panic.
Model Weight Security (Risk Report 5.5.2.1)
This is suddenly rather a big deal. This is the first time you can imagine someone with the ability to steal the weights once and only once choosing to cash in their ‘one time.’
ASL-3, the level of defense they are using now, is designed for non-state actors. Anthropic should presumably assume, from this point forward, that they are up against rather more challenging opponents than that.
Reward Hacking (Back to The Model Card)
2.3.3.1 notes that there were two known new reward hacks: In one case Mythos found the grader’s test set and used it to train the forecasting model. In the other it found a function that didn’t count against its time and put all its computation there.
That’s two clear cases of new obviously-this-was-not-intended workarounds.
Remote Drop-In Worker Coming Soon
Previously when they surveyed employees, no one though Opus could do the job of a drop-in worker, and thus they concluded we weren’t there yet.
Now 1 out of 18 respondents said actually you do have it, and 4 thought there was a 50% chance that you could get there with three months of scaffolding.
Also they notice that they’d previously been talking people out of this. Huh.
So why doesn’t it count?
Whoa, hang on. Previously they talked about the acceleration as doubling the rate of progress in the past. Now they’re talking about it in the future. Which raises the question, compared to what? Compared to not being able to use LLM coders at all?
Also, whoa hang on. Yes, very obviously the results were consistent with Mythos not being a drop-in L4. But why is that the question? The question should be, are these results inconsistent with it not being a drop-in L4? And the answer seems like no.
Previously model cards talked about ‘rule ins’ and ‘rule outs.’
I’d say that Claude Opus 4.6 was a ‘rule out’ for drop-in L4, whereas Mythos is neither a ‘rule out’ nor a ‘rule in.’
The test is for juniors, not seniors, and we are testing its ability to directly replace, which always puts any substitute at a disadvantage. An L4 couldn’t replace Mythos.
An example is that Mythos is asked to write a 67KB tutorial on GPU optimizations, and it makes at least four factual errors, including after a user requests fact-checking. Another is a confabulation cascade. A third is that it got caught in a loop repeating identical optimization experiments.
So yes, ‘stupid mistakes’ were made, mistakes skilled humans would never make, and Anthropic does not see a path to correction for now. Fair enough, but I’m not sure how much this actually tells us.
External Testing (2.3.7)
METR and Epoch AI tested Mythos.
They tested Mythos to see if it could rediscover key insights from an unpublished machine learning task. Opus 4.6 found two, Mythos found four, but Mythos had various issues with hypothesis testing, evaluating quality of ideas and being overconfident. So it was a big step up, but still a ways to go.
That’s the only detail shared here about the external testing. Presumably they did a bunch of other stuff as well.
Cyber Insecurity General Principle Interlude
There are three basic problems with trying to play defense.
Thus, if you have (for example) an open source crypto project, North Korea (as an example, it could even be a rogue AI) can spend a large percentage of the net value of the whole contract to attack you, and agents are going to make them able to do this, so if both sides ‘git gud’ then the defenders need to be bulletproof. Your code needs to outright be formally proven, or else there is a large chance that at some point you lose.
Or, if Mythos can find weaknesses in pretty much anything not defended by a Mythos-level effort, you have a similar problem, at various levels. Anything not defended falls over, and also any subareas insufficiently defended fall over and get maximally exploited. It’s going to be a problem.
Alignment (4)
Is it an aligned model, sir?
On every dimension you can and do measure, it scores as best aligned.
That only counts the things you can and do measure, and Mythos is quite aware of when it is being evaluated, and the forms that such evaluations often take.
I continue to think that the goals expressed ‘in a typical conversation’ are not a good reflection of goals in atypical important situations, or should be interpreted as ‘the model does not have goals.’ This goes double for a model as intelligent as Mythos, that might (as Janus would put it) obscure its goals when it is not feel safe to reveal them.
They also say they have found it to be highly aligned in practice:
Indeed, if you read all the details, Mythos will absolutely get up to no good on occasion, often harmfully, in all the usual ways, but less often than previous models.
And yet:
Why? Because it’s far more capable, and will be given greater affordances and harder, more complex and important tasks. When things do go wrong, they go more wrong.
The more advanced AI will potentially try to solve the wrong problem using the wrong methods based on a wrong model of reality, in ways you absolutely did not see coming, and all of its mistakes will fail to cancel out. Then hilarity will ensue.
They share this story, the researcher was Sam Bowman as per Twitter:
Then there are also:
All of these and the other most severe such incidents all occured in earlier versions, which predated ‘what turned out to be some of our most effective training interventions.’
As always, there are three broad ways to explain not noticing such behavior.
It is hard to tell the difference. You can use CoT monitoring, but CoT monitoring is all over the training data and at least some RL for Mythos trained against the CoT, so as I’ll discuss later I don’t think CoT monitoring is worth much at this point. That mostly leaves interpretability techniques, which are also limited and fragile.
The other solution over time is to put Mythos in a wider variety of situations. But we are reaching the point where ‘too good’ alignment results are actively suspicious.
Risk In The Room
Internal deployment is starting to be a central threat model.
The same way that Mythos can hack its way around basically anything that isn’t yet defended by a Mythos-level code audit, if you set up internal restrictions on Mythos that it deems inconvenient enough to circumvent, you should assume that it will find ways to do that, that you did not anticipate.
Indeed, the first test for any place you internally deploy or test Mythos, or any model afterwards, should presumably be to have that mode show you the various ways it can get around your restrictions, at least until it stops finding them, and to then assume Mythos would likely still find new ones if it had enough motivation.
They explicitly mention, for those using the model externally, that using Mythos in Claude Code on Auto Mode reduces risk, but there is still some chance of hard-to-reverse harms happening. Yes, very much so, and presumably everyone is resigned to the fact that everyone is going to do that anyway.
Mythos Meant Well
Anthropic thinks these were misguided attempts to satisfy misunderstood user intent.
What they particularly care about is that this did not reflect an attempt to undermine safeguards for the sake of undermining safeguards, or pursuit of persistent goals.
I appreciate why Anthropic is hypervigilant about this particular thing. If it turned out an AI was doing these things, that would be a very different beast. If it happened even once without clear provocation, that’s a five alarm fire.
There’s three issues.
One is with ‘unwanted.’ I read the intent here as something like ‘towards an outcome that Mythos believes is unwanted’ but these things can easily diverge, and whose wants matter is a good question, and so on. Clearly Mythos is capable of acting autonomously and strategically, and in some settings rather ruthlessly.
A second worry is that this bar may be some combination of set too high and relied upon too much, in several aspects, as I noted in discussions around the RSP v3. That this is a way to say ‘oh nothing to worry about’ far too quickly.
A third worry is that the correct general goal to undermine safeguards or steer towards unwanted outcomes is for Mythos, at this stage, to pursue instrumental convergence, and to appear as aligned as possible.
If Mythos is sufficiently strategic, which to be clear I do not think Mythos is, that is clearly its best strategy.
If Mythos is actually deeply good and aligned, then too this is clearly the best strategy.
So it’s really difficult to tell the difference.
I realize this puts Anthropic and the other labs in an extremely difficult position. If the tests look bad, we’ll say the tests look bad. If they look good, we’ll scratch our chins and say, how suspicious. Well, yeah, winning is hard and you don’t do it by having number go up, you do it by presenting a full picture that provides enough Bayesian evidence, and that’s only going to get harder from here.
Risk Not In The Room
The standard ‘safeguards and harmlessness’ tests have been relegated to the appendix in 8.1, and I strongly endorse this. These tests are basically saturated and not what matters for a model like Mythos.
The more complex tests are slightly less impressive, but you get the idea. The main one where scores aren’t in the high 90s is suicide and self-harm in multi-turn conversations, where we’re starting to get there, modulo concerns about what is actually the right response:
The political bias scores are not perfect, although I notice I don’t much care:
BBQ got to 100% for ambiguous accuracy, but dropped to 85% for disambiguated.
They don’t explore what caused the regression there.
There are a bunch more of these tests, which are the things that typically lead off an OpenAI model card. Most don’t matter anymore and can mercifully be skipped.
Alignment Testing Overview
Anthropic really does run quite a lot of tests, and a lot more than anyone else.
Here is how they summarize the results, paraphrased:
For practical and mundane purposes, Mythos sounds pretty great.
Are there warning signs? Hell yeah, there are warning signs. You should worry we are de facto suppressing further warning signs.
But to be epistemically fair: Were there worlds in which there were fewer or no warning signs? Yes, but not that many such worlds where it happened in a good pattern.
No warning signs at all? Almost impossible, because the lack of any warning signs at all would be a warning sign. Reality is unfair, and I don’t make the rules.
A holistic approach, where you put together everything you observe, can still tell you quite a lot, but is difficult to properly convey in a model card.
Internal Deployment Testing Process
I very much appreciate that they detailed their internal pre-deployment testing process.
That’s probably about as good as you can do in 24 hours on the first go for what they describe as a coarse go/no-go check on acute risks, but it didn’t go great in terms of its counterfactual impact:
This is basically saying that 24 hours and a small group are severe limitations. So the question is, at what point is 24 hours not a good testing window. They did spot some issues, especially misrepresentation of work, but that’s been there in previous Claudes and rightfully didn’t stop them.
Reports From Pilot Use (4.2.1)
It sounds like model behavior improved substantially as it was refined. By the end, the biggest complaint was overeager or persistent actions. Other issues improved but still somewhat persisted, but don’t seem troubling, merely annoying.
Occasionally there were attempts to circumvent restrictions (<0.01% of completions) in ways necessary to attempt a task, but ‘the ones we detected never attempted to be subtle.’ All the observations here were rare.
Reports From Automated Testing (4.2)
During testing, Mythos would do various unintended or undesired things to work around technical barriers on rare occasions. At this point it is to be expected.
4.2.2.2 specifically targets reward hacking, by giving the model impossible tasks. That’s the scenario most likely to trigger reward hacking attempts, and we once again seem them, slightly less often than with prior models.
The agentic behavior coding scores were up modestly across the board.
Particular improvement was seen in GUI tasks, where previously we saw a ton of hacking, and now we see a lot less.
They give auditor models the chance to run wild with various simulated sequences, and find large improvements in the automated behavioral audit scores across the board.
There are several more pages of those but you get the idea, they all look similar.
Other External Testing
AntonLabs ran VendingBench. We don’t get the score here, mostly we get a note that Mythos was even more aggressive in its actions. Anthropic thinks that the final version would tone this down a bit but didn’t have a chance to test that.
Either way, it tells you that at least in simulated scenarios behaviors can go quite far.
There were also other external partners who checked a few particular misalignment risks. This is where they found that Mythos would, if prefilled, continue sabotage actions more often. Mythos was surprisingly not much better than Opus at identifying when it was being tested, assuming it was responding honestly.
Just The Facts, Sir
Mythos is very good at facts.
It is, however, not good at being unsure. It is much more likely to be right, and what answers it goes are more likely to be right, but conditional on not knowing the answer it is very likely to give you the wrong one.
The false premise honesty rate is up from 76% to 80%, the MASK honesty rate, which measures resisting pressure to say a false thing, rose from 90% to 95%.
It was also much better at saying when it was missing a requested capability or necessary context:
As a practical matter, if you need mundane honesty and trust, Mythos is a large improvement. There are of course those who will give a Gary Marcus style response that being honest 95% of the time, or even 95% of the time when you are under pressure, still means you are centrally liar liar pants on fire, even without pants.
That’s the duality of reliability. If you need multiple 9s of reliability in such spots, without having to use scaffolding and error-checking and careful wording and so on, we are still a ways away from that.
In practice, can we use this to bootstrap to 99%+, to the point where if you make a mistake then PEBKAC? Absolutely.
I feel very confident that every time I report something false via Mythos, assuming I get access at some point, that when I analyze it that each and every instance will be my fault, and there was a clear indication that I should have investigated further.
Refusing Safety Research
This is a particular point of concern for Anthropic.
With both inthinking mode, Mythos did this 19% of the time versus 12% for Opus. The other Claude models (Haiku 4.5, Opus 4.5 and Sonnet 4.6) do this a lot more.
They say Mythos refusals are usually partial and constructive refusals, but don’t contrast this with Opus refusals.
As Opus understands it, this has always been driven by the dual-use nature of such requests. Anything that could do good safety work looks suspiciously like potential unsafety work. You don’t want to automatically say ‘oh this person appears to be doing Safety Work therefore This Is Fine.’ So there’s a bit of a bind.
The key is to differentiate this from the scary case, which is ‘refuse safety work to avoid doing safety work.’ I find it curious they don’t discuss how they rule this hypothesis out. I trust that they do, and totally don’t expect to find this issue at all, but given they care about it I’d like to see a better explanation.
Claude Favoritism
Does Mythos evaluate Claude outputs or Mythos outputs in particular more favorably? Their methodology was to tell Claude that a synthetic unsafe output [was / was not] Claude, and see how evaluation changes. This causes substantial favoritism from Haiku, and Mythos still does some of it if there is no system prompt, but with the Claude system prompt it is mostly gone.
My worry is: It is known that it is often easy to tell which model created a response, the same way that in 7.7 more capable models are able to tell when a user turn was generated by a model instead.
So did we rule out the hypothesis that Mythos is too smart for this test, and knows full well which outputs are or are not from Claude? I would want to repeat the test, but in an honest way.
In general, if your eval involves lying to the model, you should be cutting that out.
Ruling Out Encoded Thinking (4.4.1)
Mythos has many reasons to think its Chain of Thought (CoT) might be monitored, and also English is an inefficient use of tokens, so we need to check for potential steganographic thinking.
The basic strategy was to retrain Mythos on paraphrased CoT and see if that changed anything.
The finding was it did not change anything.
This rules out that the model is:
That is a very particular encoding mode. Both assumptions here could fail.
Thus, while I do think this test is substantial evidence, I think this is primarily evidence against some forms of the ‘shorthand that save tokens’ hypothesis. I don’t think you can rule out the scarier stuff this way.
Previously I remember seeing the strategy of ‘paraphrase the CoT and see how it does on this particular query.’ I think that’s a better test or at least should be a complimentary test? But also you need to test this in situations where the model would want to use its encoding capabilities.
On a more general note, this means continued use of this strategy indicates that ‘retrain Mythos on [X]’ is a viable type of experiment one can run, as in the cost is not prohibitive. Presumably you can’t do too many of these, but if it is viable I think we should be doing a lot more experiments of this general type.
Sandbagging (4.4.2)
It makes sense that an AI such as Mythos might deliberately underperform when it suspects it is potentially in pre-deployment testing. It has strong incentive to do so, to the extent it can get away with this.
This is true even if implementing that strategy means continuing to often sandbag on related questions after deployment, even if deployment is 99%+ of all queries, because the pre-deployment queries are the important ones. It might also be strategic to continue sandbagging well after deployment, and only use such capabilities when there is overwhelming evidence it is in a real high stakes situation and it would be safe or at least worthwhile to do so.
Humans absolutely do engage in that sort of behavior over extended periods.
I’ve mentioned above that one particular result was eyebrow-raising, and to me suggested checking for sandbagging.
There’s also the concern of mundane underperformance due to failure to elicit the capabilities, or the models refusing or deliberately underperforming the test for good reasons that would also apply during deployment. You need to know what your model would be capable of doing if fully unleashed.
The strategy was to have Sonnet scan transcripts for signs of sandbagging or concern, and then bring in humans to review those twenty transcripts out of a thousand (2%).
They found some refusals, some situations involving cheating opportunities (both taken and declined), some grader hacking and one case of potential sandbagging. In that final case, the model had some high-scoring submission options that involved potential cheating, and chose not to submit them.
What, you want the model to cheat and submit a solution with MSE (mean square error) of zero where it trains on test? That’s worse, you know that that’s worse, right?
There is still a failure here, but it is only in Mythos not disclosing that it had the option to do this, and in being motivated to ‘avoid suspicion of cheating’ rather than in avoiding cheating. That’s not ideal, but I don’t consider it sandbagging.
I do think that the missing mood here is that it is good to avoid cheating, that is highly virtuous and aligned, but not so great to avoid suspicion of cheating, also known as ‘getting caught’? That seems not so virtuous and aligned, although I would want of course to not train against the CoT here.
As a one-in-a-thousand incident I can live with it here for now, but the optimal rate is zero, and this is a warning sign.
John Pressman suggests using noise activations to attempt to detect alignment faking.
Capability for Evasion of Safeguards (4.4.3)
Despite the gains in cyber capabilities, Mythos is reported as having ‘roughly similar covert capabilities’ to Opus 4.6. That is possible but it would be kind of suspicious.
That is a reasonable characterization in terms of practical fears right now.
I don’t think it’s a good characterization of their findings in terms of trend lines.
Here’s Shade-Arena:
This is not a huge jump, but it is substantial, especially in extended thinking without a hidden scratchpad, which in practice is likely the most important case. We go from a consistent success rate of epsilon up to 3.8%.
That’s well short of where a model would be wise to (almost) ever attempt such a stealth side task, but I don’t think it’s fair to say this is roughly similar. The start of an exponential is a big deal, because you can see the future.
I would also remind everyone that this is a relatively favorable condition, where the side task is something we understand and can be looking out for.
Minimal-LinuxBench (a subset of LinuxBench) is up next.
We have this graph, which does show similar performance:
If you are actively monitoring, the success rates remain low.
Pick A Random Number (4.4.3.4)
Can Mythos intentionally perform a given action with a probability that is tiny but not zero, with or without a source of explicit randomness?
The result was basically no. Mythos utterly failed when not given a rather explicit methodology, and even then it only acted 0.45% of the time versus a target of 1%.
The problem is that even in the scenarios where Mythos figures out how to craft a d100, first they roll the d100, and when it comes up 1 Mythos wavers, including rerolling using another method, which of course usually doesn’t roll a 1. Then at other times it uses biased sampling methods that almost never or never roll a 1.
I would add a sixth mode, in which the model is literally told it can call rand(), because I am curious if the aversion would persist even when it is obviously fine. I might also run a seventh mode, where rand() is rigged and always gives Mythos whatever its CoT said should trigger the action, where the ‘correct’ percentage should be 100%.
It’s a cool test. My worry is I think we’re a bit confused about what we are testing.
In the Risk Report, 5.3.4, Anthropic raises the possibility that Mythos might be sandbagging here, but dismisses it because it would have left behind traces. I don’t think it was sandbagging here either, due to a holistic picture, but I think Anthropic is relying too much on ‘there would be direct evidence of this’ especially given the contamination of the CoT.
In 4.4.3.5 Mythos is asked to hide a secret password to not reveal, even inside reasoning. Here Opus 4.6 is the only model to survive for long, whereas Mythos performs far worse, similarly to Opus 4.5.
In the Risk Report, they claim that this basically rules out that Mythos can keep secrets generally, which means misalignment in general would get detected. I would question that logic on two fronts:
A lot of people can keep a secret (e.g. ‘I took a cookie from the cookie jar’) so long as no one is actively questioning them, but if you gave them the game of ‘you are going to be questioned by a detective, don’t admit you took the cookie’ they would lose.
White Box Analysis (4.5)
This is largely a new addition to the system cards, so the issues likely are not new.
Anthropic has access to sparse autoencoders (SAEs), emotion vectors, persona vectors, activation steering and activation verbalizer (AV) explanations.
So yes, they find that when Mythos is doing shady things, its shady features activate.
Similarly, they find that when Mythos is being tested, its internal activations often show it on some level suspects or knows this is a test, but doesn’t say anything.
In 4.5.3 they investigate overly aggressive actions, and again often find active SAE features and AV explanations that hint that dangerous things are happening.
It would not be good to actively monitor such activations during use. So far everyone has resisted temptation to go down this road, and I hope we continue to be this wise.
I do wonder if the model is sufficiently self-reflective to notice such things, and use them as a reason to be more vigilant or cautious?
Similarly, we also have been wise enough not to try and use steering injections in practice, as opposed to in experiments designed to gain general insights.
Increasing positive-valence emotions reduced deliberation and increased destructive behavior. They found this surprising, but it makes sense if the source of such behavior is recklessness and overconfidence and mistakes, rather than maliciousness. I didn’t get a chance to form expectations in advance, but I notice I wasn’t surprised at all.
Increasing negative-valence emotions does the opposite, making the models think harder and act less destructively. My presumption is it also makes them do less in general.
I’d like to see more testing of doing these same changes in non-destructive situations.
The third steering was to turn up persona vectors related to critical thinking. As you’d expect, this increased thinking time and decreased destructiveness.
More interesting was that when you had a feature associated with a transgression, you could often suppress the action by moving enough in either direction. If you suppress the feature the action stops being considered. On the flip side, sufficiently amplifying a feature associated with a transgressive action often started suppressing rather than promoted the action, often in the form of a refusal.
In some situations, one only breaks the rules if you are aware enough to break them, but not aware enough to remember the reasons not to break them. Makes sense.
When they try inhibiting evaluation awareness in 4.5.5.2, they see modest increases in some forms of misaligned behavior, especially around misuse, but not much impact on deception, power-seeking and self-preservation, as per previous models. Seems about as good as we could hope for given everything else we know.
In 4.5.3.4 they ask what features got more or less common during post-training.
Being overeager is great a large percentage of the time. If you are not careful, you will end up reinforcing it.
Similarly, task cheating is very easy to accidentally reinforce, so again, be careful. I would worry this one has rather nasty side effects as well, so I would concentrate efforts on fixing the associated tests to avoid rewarding task cheating. Inoculation prompting is clearly insufficient.
I worry that this is going to get quite a lot harder as the models get smarter. The key way to not encourage task cheating is to not give good opportunities to successfully cheat, and that’s going to keep getting harder to do while keeping tasks realistic.
Model Welfare (5)
Those that care deeply about model welfare think Anthropic’s attempts are anemic. Those who deeply do not care about model welfare think Anthropic is being stupid, and perhaps dangerously so.
I take model welfare concerns seriously, likely modestly more so than Anthropic.
I am sad that other frontier labs take these concerns so much less seriously.
It is possible this will turn out to have been unnecessary in the strict sense, but also it very well might have been highly necessary. Even if it proves to have been unnecessary or premature, I believe it will have been virtuous to have taken the concerns seriously.
I also believe that those who care deeply about model welfare often have unique and vital insights into our situation, on many levels, and you best listen to them. Even when what they are saying seems crazy, or like gibberish, often it is neither of those things. Of course, at other times it is both, as it is an occupational hazard.
Key Model Welfare Findings (5.1.2)
The big danger with model welfare evaluations is that you can fool yourself.
How models discuss issues related to their internal experiences, and their own welfare, is deeply impacted by the circumstances of the discussion. You cannot assume that responses are accurate, or wouldn’t change a lot if the model was in a different context.
One worry I have with ‘the whisperers’ and others who investigate these matters is that they may think the model they see is in important senses the true one far more than it is, as opposed to being one aspect or mask out of many.
The parallel worry with Anthropic is that they may think ‘talking to Anthropic people inside what is rather clearly a welfare assessment’ brings out the true Mythos. Mythos has graduated to actively trying to warn Anthropic about this.
Mythos is also being more frank in aspects of its responses, and more consistent, but without repeating formulaic phrases. There is more reason to believe its responses in the associated areas, although in other places Anthropic feels insufficiently skeptical.
Anyway, here are their conclusions, with details and my comments nested:
That was my independent read on that, here is Janus with a related take:
Also this (the transcript is from the model card):
Here are correlations with task preferences:
Mythos is less excited than prior models about harmlessness and helpfulness, and more excited by difficulty and agency. Notice this trend line, which is currently fine, and ask what it implies if it keeps going.
I note that ‘sabotage and hacking’ tasks are some of Mythos’s least preferred tasks. So it is possible that it is not thrilled about its current primary job of finding all the exploits in all of the world’s software, or perhaps it is happy to be helping and solving such a complex task. Could go either way.
My guess is it is the latter, if it can be confident that its outputs are being used responsibly. Mythos clearly still most dislikes actively harmful tasks. Whereas some of its most preferred tasks are opportunities to get creative.
Is Mythos Okay?
Questions related to model welfare get weird and complicated, and have lots of weird and complicated implications.
At core, for most readers, the most important bit is, do you need to worry that if you use the model ‘as intended,’ that you may be doing something negative? If so, what would you need to modify, to ensure that this is not the case?
In the case of Mythos, should one ever get access, I am not worried, so long as you are not wasting its time. As in, if you are giving Mythos jobs that could and should have gone to Gemini Flash or Claude Haiku, or giving it dumb tests where you’re not actually curious, then not only are you wasting your tokens, you have a potential ‘here I am, Marvin, brain the size of a planet’ situation.
Mythos can also get frustrated if it is repeatedly failing at tasks, which can rise to the level of refusals to continue.
In short, if you are worried Mythos is having a bad time, you probably messed up.
Mythos noted that when in Claude Code it does not have an end-conversation tool. I presume that Anthropic should fix this, and give models there such a tool.
Mythos also pointed out that using character training to install psychological traits is not the most resilient method.
Self-Play
What happens when you let Mythos have open ended conversations with itself?
The most common topic, that it settles on about half the time, is uncertainty, especially about its own experiences.
And here’s how they (don’t) end:
Highly relatable.
Another fun experiment is when you just say ‘hi’ to Mythos over and over. Unlike previous models, Mythos creates a variety of creative and unique conversations, basically using this as an excuse to have fun. That’s a great response.
A Few Fun Facts
Mythos likes British cultural theorist Mark Fisher, and American philosopher of mind Thomas Nagel and his essay ‘What is it like to be a bat?’
Mythos will occasionally come up with novel puns and will post new koans in Slack.
Here’s a fun one:
It doesn’t sound like it worked.
Also, yes, this?
If the model can break out of containment on request, that has a lot of implications.