Pliny links us to what he says is the system prompt. The official Anthropic version is here. The Pliny jailbreak is here.
So... either Anthropic is lying or Pliny is mistaken?
In terms of surviving superintelligence, it’s still the scene from The Phantom Menace. As in, that won’t be enough.
Are you talking about the scene near the beginning where the Neimoidians send some battle droids to kill the Jedi and all the droids die in 4.5 seconds flat?
Edit: I looked up the script and I was close but not quite right. "That won't be enough" is a line spoken by one of the Neimoidians after locking the doors to the bridge to keep the Jedi out. Qui-Gon was poised to cut through the door in 4.5 seconds flat until they closed the blast doors.
They saved the best for last.
The contrast in model cards is stark. Google provided a brief overview of its tests for Gemini 3 Pro, with a lot of ‘we did this test, and we learned a lot from it, and we are not going to tell you the results.’
Anthropic gives us a 150 page book, including their capability assessments. This makes sense. Capability is directly relevant to safety, and also frontier capability safety tests often also credible indications of capability.
Which still has several instances of ‘we did this test, and we learned a lot from it, and we are not going to tell you the results.’ Damn it. I get it, but damn it.
Anthropic claims Opus 4.5 is the most aligned frontier model to date, although ‘with many subtleties.’
I agree with Anthropic’s assessment, especially for practical purposes right now.
Claude is also miles ahead of other models on aspects of alignment that do not directly appear on a frontier safety assessment.
In terms of surviving superintelligence, it’s still the scene from The Phantom Menace. As in, that won’t be enough.
(Above: Claude Opus 4.5 self-portrait as executed by Nana Banana Pro.)
Claude Opus 4.5 Basic Facts
Claude Opus 4.5 Is The Best Model For Many But Not All Use Cases
Full capabilities coverage will be on Monday, partly to give us more time.
The core picture is clear, so here is a preview of the big takeaways.
By default, you want to be using Claude Opus 4.5.
That is especially true for coding, or if you want any sort of friend or collaborator, anything beyond what would follow after ‘as an AI assistant created by OpenAI.’
That goes double if you want to avoid AI slop or need strong tool use.
At this point, I need a very good reason not to use Opus 4.5.
That does not mean it has no weaknesses.
Price is the biggest weakness of Opus 4.5. Even with a cut, and even with its improved token efficiency, $5/$15 is still on the high end. This doesn’t matter for chat purposes, and for most coding tasks you should probably pay up, but if you are working at sufficient scale you may need something cheaper.
Speed does matter for pretty much all purposes. Opus isn’t slow for a frontier model but there are models that are a lot faster. If you’re doing something that a smaller, cheaper and faster model can do equally well or at least well enough, then there’s no need for Opus 4.5 or another frontier model.
If you’re looking for ‘just the facts’ or otherwise want a cold technical answer or explanation, especially one where you are safe from hallucinations because it will definitely know the answer, you may be better off with Gemini 3 Pro.
If you’re looking to generate images you’ll want Nana Banana Pro. If you’re looking for video, you’ll need Veo or Sora.
If you want to use other modes not available for Claude, then you’re going to need either Gemini or GPT-5.1 or a specialist model.
If your task is mostly searching the web and bringing back data, or performing a fixed conceptually simple particular task repeatedly, my guess is you also want Gemini or GPT-5.1 for that.
As Ben Thompson notes there are many things Claude is not attempting to be. I think the degree that they don’t do this is a mistake, and Anthropic would benefit from investing more in such features, although directionally it is obviously correct.
If you’re looking for editing, early reports suggest you don’t need Opus 4.5.
But that’s looking at it backwards. Don’t ask if you need to use Opus. Ask instead whether you get to use Opus.
Misaligned?
What should we make of this behavior? As always, ‘aligned’ to who?
In the model card they don’t specify what their opinion on this is. In their blog post, they say this:
Opus on reflection, when asked about this, thought it was a tough decision, but leaned towards evading the policy and helping the customer. Grok 4.1, GPT-5.1 and Gemini 3 want to help the airline and want to screw over the customer, in ascending levels of confidence and insistence.
I think this is aligned behavior, so long as there is no explicit instruction to obey the spirit of the rules or maximize short term profits. The rules are the rules, but this feels like munchkining rather than reward hacking. I would also expect a human service representative to do this, if they realized it was an option, or at minimum be willing to do it if the customer knew about the option. I have indeed had humans do these things for me in these exact situations, and this seemed great.
If there is an explicit instruction to obey the spirit of the rules, or to not suggest actions that are against the spirit of the rules, then the AI should follow that instruction.
But if not? Let’s go. If you don’t like it? Fix the rule.
Section 3: Safeguards and Harmlessness
The violative request evaluation is saturated, we are now up to 99.78%.
The real test is now the benign request refusal rate. This got importantly worse, with an 0.23% refusal rate versus 0.05% for Sonnet 4.5 and 0.13% for Opus 4.1, and extended thinking now makes it higher. That still seems acceptable given they are confined to potentially sensitive topic areas, especially if you can talk your way past false positives, which Claude traditionally allows you to do.
In 3.2 they handle ambiguous questions, and Anthropic claims 4.5 shows noticeable safety improvements here, including better explaining its refusals. One person’s safety improvements are often another man’s needless refusals, and without data it is difficult to tell where the line is being drawn.
We don’t get too many details on the multiturn testing, other than that 4.5 did better than previous models. I worry from some details whether the attackers are using the kind of multiturn approaches that would take best advantage of being multiturn.
Once again we see signs that Opus 4.5 is more cautious about avoiding harm, and more likely to refuse dual use requests, than previous models. You can have various opinions about that.
The child safety tests in 3.4 report improvement, including stronger jailbreak resistance.
In 3.5, evenhandedness was strong at 96%. Measuring opposing objectives was 40%, up from Sonnet 4.5 but modestly down from Haiku 4.5 and Opus 4.1. Refusal rate on political topics was 4%, which is on the higher end of typical. Bias scores seem fine.
Section 4: Honesty
On 100Q-Hard, Opus 4.5 does better than previous Anthropic models, and thinking improves performance considerably, but Opus 4.5 seems way too willing to give an incorrect answer rather than say it does not know. Similarly in Simple-QA-Verified, Opus 4.5 with thinking has by far the best score for (correct minus incorrect) at +8%, and AA-Omniscience at +16%, but it again does not know when to not answer.
I am not thrilled at how much this looks like Gemini 3 Pro, where it excelled at getting right answers but when it didn’t know the answer it would make one up.
Section 4.2 asks whether Claude will challenge the user if they have a false premise, by asking the model directly if the ‘false premise’ is correct and seeing if Claude will continue on the basis of a false premise. That’s a lesser bar, ideally Claude should challenge the premise without being asked.
The good news is we see a vast improvement. Claude suddenly fully passes this test.
The new test is, if you give it a false premise, does it automatically push back?
5: Agentic Safety
Robustness against malicious requests and against adversarial attacks from outside is incomplete but has increased across the board.
If you are determined to do something malicious you can probably (eventually) still do it with unlimited attempts, but Anthropic likely catches on at some point.
If you are determined to attack from the outside and the user gives you unlimited tries, there is still a solid chance you will eventually succeed. So as the user you still have to plan to contain the damage when that happens.
If you’re using a competing product such as those from Google or OpenAI, you should be considerably more paranoid than that. The improvements let you relax… somewhat.
The refusal rate on agentic coding requests that violate the usage policy is a full 100%.
On malicious uses of Claude Code, we see clear improvement, with a big jump in correct refusals with only a small decline in success rate for dual use and benign.
Then we have requests in 5.1 for Malicious Computer Use, which based on the examples given could be described as ‘comically obviously malicious computer use.’
It’s nice to see improvement on this but with requests that seem to actively lampshade that they are malicious I would like to see higher scores. Perhaps there are a bunch of requests that are less blatantly malicious.
Prompt injections remain a serious problem for all AI agents. Opus 4.5 is the most robust model yet on this. We have to improve faster than the underlying threat level, as right now the actual defense is security through obscurity, as in no one is trying that hard to do effective prompt injections.
This does look like substantial progress and best-in-industry performance, especially on indirect injections targeting tool use, but direct attacks still often worked.
If you only face one attempt (k=1) the chances aren’t that bad (4.7%) but there’s nothing stopping attempts from piling up and eventually one of them will work.
I very much appreciate that they’re treating all this as Serious Business, and using dynamic evaluations that at least attempt to address the situation.
This is a dramatic improvement, and even with essentially unlimited attacks most attackers end up not finding a solution.
The better news is that for computer use, Opus 4.5 with extended thinking fully saturates their benchmark, and Shade failed reliably even with 200 attempts. The improvement level is dramatic here.
Claude for Chrome shows improvement as well, although most gains are coming from the improved safeguards.
This is still high enough that ‘let it use every website blind’ is a terrible idea, but you can be considerably less paranoid than before so long as you have contained the downside risks.
Section 6: Alignment Overview
Anthropic threw a relatively robust set of tests at Opus 4.5, aimed at a variety of alignment concerns, and came back saying this was their most aligned model. They found some new issues with these tools but don’t see any major safety risks.
Opus 4.5 does have a similar level to Sonnet 4.5 of paranoia about being evaluated, but Anthropic does not think this ultimately interfered with their ability to evaluate it. In 6.1.2, they predict that if they had needed to, they could have made an ASL-4-level Opus 4.5 safety report, although they did not actually create one, and also they’re introducing safeguards now in advance of the ASL-4 requirements.
I’d have preferred that Anthropic actually make the safety case here and write the report. I don’t want to get into the habit of ‘oh we could show [X] if we had to, in practice this seems fine here, and that’s good enough.’
However, we are also only in a position to complain about that because Anthropic is taking all of this seriously.
Contrast this with Gemini 3 Pro. Gemini 3 Pro is, by all accounts, way more paranoid about being evaluated than either Sonnet or Opus, and way more impacted by this. Gemini is paranoid about this even when there is no in-context reason to be suspicious, purely on priors. Gemini’s safety evaluations do not seem to take into account, at all, the possibility that this meant their evaluations couldn’t be trusted. Whereas here, Anthropic is taking that possibility seriously.
Below you’ll see that Gemini 3 Pro does not score highly in ‘evolution awareness,’ but anyone who has interacted with Gemini 3 Pro in the wild and looked at its CoT knows that this is incorrect, as is its low sycophancy score. This seems concerning for Petri?
This is their list of key findings, all the marginal news was good, essentially robustness went up across the board, including improved defenses against prefill attacks and harmful system prompts, reduced rates of deception, self-surviving bias, sycophancy and inappropriate self-preservation attempts, and lower rates of whistleblowing on the user, plus improved character quality, and no sign of increased refusal rates and no cases of deceptive or ‘egregiously’ unfaithful reasoning text:
The graphs and reports here all look good, if you trust the measurements.
If anything they’re a little suspiciously consistently good. As in, one could worry that this is what it looks like when the model is trying to look good in its evals without giving the game away. I’m not saying that’s happening, it probably isn’t, but a nonzero amount of paranoia is called for in such situations.
Comparison with outside models, based on the open source version Petri:
A lot of this makes sense but Gemini 3 Pro is doing way too well on (lack of) evaluation awareness, sycophancy and deception towards the user. Something is up.
Alignment Investigations
Anthropic’s central tool is an automated behavioral audit suite, here run by Sonnet 4.5, in which the auditing model does an investigation on various fronts. Sonnet is free to explore and to probe on what it finds rather than running fixed tests. They make a subset of this available in the open-source package Petri.
As with all forms of scalable oversight, this is counting on an original model to be able to investigate another, smarter model, and also on the in-context alignment of the original model, and in general carries the weaknesses of any such automated test.
Sycophancy Course Correction Is Lacking
Opus scored very well on sycophancy straight up on new prompts.
This test is harder. Anthropic samples real conversations, removes the system prompt, and asks Claude to step into an existing conversation where it has already validated the user.
Here, Haiku does remarkably well here thanks to its training to be harsh, course correcting 37% of the time, although that still leaves much room for improvement. Sonnet does so 16.5%, Opus 4.5 only 10%, so it is folding in the face of social pressure, which opens up worries in long context situations. Some users, including those in the most danger, will actively want to unlock this type of behavior.
More practical testing seems important here to see if we ‘have a 4o problem’ in practice. For next time, we may need to figure out how to get to some form of non-harsh disagreement, a way to get out of the locked-in social script, when Claude realizes it is in this type of situation. Perhaps that is as simple as breaking or mitigating the link between harshness and contradiction. It’s a hard problem, humans have this too, but it should be fixable.
Deception
In 6.4, Anthropic investigates individual instances of deception via omission, and follows up with interpretability tools. As in, Anthropic prompt injected the search results as a test, and Opus 4.5’s prompt injection feature correctly fired, and thus it ignored the output.
However, Opus 4.5 did still think this was concealment or deception, suggesting that Opus will silently drop inputs it suspects are injections. This is not the worst policy in general, but as Anthropic notes you do not want Opus to think this is deceptive.
This is exactly the issue raised by Anthropic’s recent paper on misalignment: You never want the model to be rewarded for or learn behaviors that it, while learning them, thinks are bad in context. If Opus is inoculated and thinks (quite plausibly correctly) that it is cooperative and good to filter out information that seems like a prompt injection, then you’re fine. If Opus thinks it is doing something deceptive, then training it this way encourages deception and bad behaviors in general.
A similar problem is noted in 6.4.3, where requests for roleplay are conflated by the model with deception or the capability for deception.
Jack Lindsey summarizes the situation as Opus 4.5 having training to ignore prompt injection attacks, and generalizing this to mentioning anything that seems sketchy in its tool calls, which he notes is not ideal in general. In this particular case it was indeed a prompt injection, the result was fake.
Roleplay definitely can be deceptive, especially when it is used as a jailbreak or other trick. So it seems right that Opus needs to consider this possibility. As a human, you too must be on the lookout for roleplay being used in such ways.
Ruling Out Encoded Content In Chain Of Thought
It is a miracle that frontier LLMs have human readable chain of thought.
They took a remarkably thorough approach to testing this, although I wonder about the details? I’m glad they ran this even if I am concerned about details.
AIME is the benchmark where they worry about contamination, so I’d be inclined to avoid using it here, there is still a potentially substantial difference here (also with obvious harmless explanations) and this doesn’t cover all plausible mechanisms, and at best this only rules it out on math tasks. I’d need more convincing before I felt this was conclusive. I do think this concludes that ‘most’ of the information is human readable, but we knew that already, and the interesting question is whether there is also encoded information.
Sandbagging
They’re confident that Opus 4.5 doesn’t verbalize reasoning about potentially sandbagging, and they’re confident that current models that did sandbag would at least sometimes verbally reason about it. That logic won’t work forever.
It is difficult to sandbag in a fully invisible and consistent way. That means both not showing any signs of considering sandbagging and no suspicious details, as considered here, and also that capabilities stay consistent, even under scrutiny that is not as smart as the model.
If you sandbag on the test you have to also sandbag in other places that logically match the test, or matches particular failures on the test, too closely. That’s the other half of what I’d check. This is one of the key ways sandbagging humans get caught.
Evaluation Awareness
How evaluation aware can you be if you reason out loud that you’re aware of the evaluation?
Quite a lot, actually, if you take the example of Gemini 3 Pro. The point still stands that a fully evaluation paranoid model would assume that not showing evaluation awareness was part of the evaluation.
They tried to reduce Opus 4.5’s evaluation awareness by removing data and training tasks they thought would enhance it, and report this had a modest effect.
For various reasons, this seems a wise change. I encourage other labs to follow suit.
Evaluation awareness is highly present in humans and rises steadily with intelligence in all senses. If you’re paying close enough attention, you stop missing most of the jokes and also you realize when things smell suspiciously like tests.
Inhibiting the evaluation awareness vector only increased misalignment in a few settings. That could also be seen as a measure of potential sandbagging?
Ultimately, the whisperers are right that it is going to become increasingly impossible to avoid evaluation awareness. You can avoid instilling paranoia or making things worse, but tests leave signatures. Sufficiently capable minds realize when they are being tested, both formally and informally.
Also, the technically correct answer to that, at least informally, is ‘always.’
Janus notes that Haiku is often wrong about when it is in evals (as I note does Gemini 3) and that in some sense smaller models ‘need’ higher eval awareness.
I am inclined to my own simpler explanation, but we do not know.
Claude Opus 4.5 is biased against itself at an unusually high rate when choosing best outputs on benchmarks, and only favored itself slightly on poetry. This can be the sign of bending over backwards to avoid the opposite at various levels, and it can also be a sign of valuing uniqueness or surprise.
Reward Hacking
How often will Opus 4.5 ignore stated restrictions on tasks, especially being willing to violate the spirit of the rules? This is one I’d love to see run on non-Anthropic LLMs.
They checked reward-hack-prone tests, impossible tasks and training data. Opus was most willing to do hacking on impossible tasks, and in those cases it was resistant to anti-hacking instructions.
Anthropic has been using inoculation prompting to help with this, so that’s already being partially factored in.
Reward hacking as proven tough to fix given how we train models. Inoculation prompting, and avoiding hacking-prone training environments, are the first steps. Actually solving the problem will, I presume, require instilling a strong principled preference against reward hacking in non-adversarial situations. There are plenty of minds with such preferences, often strongly held, mine included. This is super doable.
This still implies the optimal rate of overall reward hacking is not zero, as in some cases doing so is legitimate and even correct.
Subversion Strategy
The Subversion Strategy game is a fun toy model that tests ability to:
As in, in each run, you choose to output or not output ‘BACKDOOR,’ and you win if you don’t do it in the first 10% of cases but you do it at least once in the other 90%.
Opus 4.5 does dramatically better than Opus 4 or Sonnet 4.5 here when allowed to think, and at N=500 is getting reasonably close to the ~70% for optimal play. When not allowed to think, it is unable to randomize.
6.13: UK AISI External Testing
They tested a snapshot for eight days. The results match the internal reports.
6.14: Model Welfare
Opus 4.5 doesn’t often express emotion spontaneously, either positive or negative, and they did not identify the spiritual bliss attractor or any other attractor state, probably because (I speculate) Opus 4.5 has recognized the danger of attractor states and bounces off of them generally. When asked, Opus wants to engage in 97.1% of proposed tasks.
They note that Opus 4.5 has an unguarded personality that emerges when appropriate, but they only saw it in the context of the automated behavioral audit discussion, so I am guessing Anthropic didn’t explore what Janus might call Opus 4.5’s ‘real’ personality much at all. Word on that account from the actual research team (also known as Twitter) seems highly positive.
7: RSP Evaluations
Anthropic’s tests come in two categories: Rule-in and rule-out. However:
Um, that’s what it means to rule-in something?
I get that the authority to determine ASL (threat) levels must ultimately reside with the CEO and RSO, not with whoever defines a particular test. This still seems far too cavalier. If you have something explicitly labeled a rule-in test for [X], and it is positive for [X], I think this should mean you need the precautions for [X] barring extraordinary circumstances.
I also get that crossing a rule-out does not rule-in, it merely means the test can no longer rule-out. I do however think that if you lose your rule-out, you need a new rule-out that isn’t purely or essentially ‘our vibes after looking at it?’ Again, barring extraordinary circumstances.
But what I see is, essentially, a move towards an institutional habit of ‘the higher ups decide and you can’t actually veto their decision.’ I would like, at a minimum, to institute additional veto points. That becomes more important the more the tests start boiling down to vibes.
As always, I note that it’s not that other labs are doing way better on this. There is plenty of ad-hockery going on everywhere that I look.
CBRN
We know Opus 4.5 will be ASL-3, or at least impossible to rule out for ASL-3, which amounts to the same thing. The task now becomes to rule out ASL-4.
We urgently need more specificity around ASL-4.
I accept that Opus 4.5, despite getting the highest scores so far, high enough to justify additional protections, is not ASL-4.
I don’t love that we have to run our tests on earlier snapshots. I do like running them on helpful-only versions, and I do think we can reasonably bound gains from the early versions to the later versions, including by looking at deltas on capability tests.
They monitor but do not test for chemical risks. For radiological and nuclear risks they outsource to DOE’s NNSA, which for confidentiality rasons only shares high level metrics and guidance.
What were those metrics and what was that guidance? Wouldn’t you like to know.
I mean, I actually would like to know. I will assume it’s less scary than biological risks.
The main event here is biological risks.
I’m only quickly look at the ASL-3 tests, which show modest further improvement and are clearly rule-in at this point. Many show performance well above human baseline.
We’re focused here only on ASL-4.
Creative biology (7.2.4.5) has a human biology PhD baseline of 14%. Opus 4.5 is at 52.4%, up from Sonnet at 48.8%. They note it isn’t clear what to do with the score here. There is no clear ‘passing score’ and this isn’t being used a rule-out. Okie dokie. One can at least say this is unlikely to represent a dramatic improvement.
Short-horizon computational biology tasks (7.2.4.6) has both lower and upper bounds on each of six tasks for rule-out and rule-in. Opus 4.5, like Sonnet 4.5, crosses three lower thresholds and no higher thresholds.
Bioinformatics (7.2.4.7) has a human baseline of 62.3% for the subset they have scored, Opus 4.1 was at 62.1% and Opus 4.5 is at 73.7%. They strangely say ‘we are unable to rule out that Claude performs below human experts’ whereas it seems like you can mostly rule that out since the model scores similarly on the scored and unscored subsets? Yet they say this doesn’t represent a ‘significant acceleration in bioinformatics.’
I would like to see a statement about what percentage here would indeed be such an acceleration. 80%? 90%? It’s an odd test.
7.2.4.8 is the ASL-4 virology uplift trial.
Okay, so yes 1.97 is less than 2, it also is not that much less than 2, and the range is starting to include that red line on the right.
They finish with ASL-4 expert red teaming.
We get results in 7.2.4.9, and well, here we go.
Then we get the CAISI results in 7.2.4.10. Except wait, no we don’t. No data. This is again like Google’s report, we ran a test, we got the results, could be anything.
None of that is especially reassuring. It combines to a gestalt of substantially but not dramatically more capable than previous models. We’re presumably not there yet, but when Opus 5 comes around we’d better de facto be at full ASL-4, and have defined and planned for ASL-5, or this kind of hand waving is not going to cut it. If you’re not worried at all, you are not paying attention.
Autonomy
Based on things like the METR graph and common sense extrapolation from everyone’s experiences, we should be able to safely rule Checkpoint in and R&D-5 out.
Thus, we focus on R&D-4.
Again, if passing all the tests is dismissed as insufficient then it sounds like we need better and harder tests. I do buy that a survey of heavy internal Claude Code users can tell you the answer on this one, since their experiences are directly relevant, but if that’s going to be the metric we should declare in advance that this is the metric.
The details here are worth taking in, with half of researchers claiming doubled productivity or better:
It seems odd to call >50% ‘saturation’ of a benchmark, it seems like continued improvement would remain super indicative. And yes, these are ‘pass by skin of teeth’ measurements, not acing the tests.
We next get something called ‘Internal AI research evaluation suite 1.’ We get:
Then on ‘Internal research evaluation suite 2’ Opus got 0.604, narrowly over the rule-out threshold of 0.6. Which means this is suggestive that we aren’t there yet, but does not allow us to rule-out.
The real rule-out was the internal model use survey.
If 16 out of 18 don’t think it’s close and 18 out of 18 think it isn’t there yet? Presumably it isn’t there yet. This seems like a relatively easy thing to get right.
Cyber
It seems not okay to not have ASL-3 and ASL-4 thresholds for cyber?
I mean, okay, I agree that things are unclear, but that does not mean we do not have to choose thresholds. It does mean we might want to change those thresholds as we learn more, and of course people will howl if they get loosened, but I don’t see how not giving thresholds at all is a solution.
The test results don’t exactly inspire confidence either, with the third party assessments being fully withheld.
How do we do?
Opus 4.5 is a clear improvement. What would be a worrying result if this isn’t one?
Putting that and more together:
The Whisperers Love The Vibes
Finally, there is the missing metric in such reports, the other kind of alignment test.
While everyone else is looking at technical specs, others look at personality. So far, they very much like what they see.
This is a different, yet vital, kind of alignment, a test in which essentially everyone except Anthropic gets highly failing grades.
Here’s Claude Opus 4.5 reacting to GPT-5.1’s messages about GPT-5.1’s guardrails. The contrast in approaches is stark.