Any more details on Pokémon performance?
I kind of expected an improvement after hearing (Anthropic’s unverified claims) it could work on SWE for 30 hours etc. Is zero-shotting games just not as closely connected to performing other long-term tasks as I thought? If it can’t beat Pokémon, it’s hard for me to believe it can have a very long (METR) task length score. It seems that multiple hour projects start to require some serious planning and online learning (and even maybe perception eventually, but maybe perception is the big difference?).
METR task lengths are based on the amount of time it would take a human to complete the task, not the amount of time it takes the model to complete the task, and particularly not the amount of time that the model can spend productively working on the task. There exist very large tasks where the LLM could accomplish large parts of the task, parts that take the LLM dozens of hours and would take a human hundreds of hours, but would be unable to accomplish the entire task. For example consider porting a complex flask application to rust - the standard MVC parts would probably go pretty smoothly and could easily take 30 hours of wall clock time, but certain nontrivial business logic and especially anything involving the migration of weirdly serialized data is likely to remain unfinished.
A few weeks ago, Anthropic announced Claude Opus 4.1 and promised larger announcements within a few weeks. Claude Sonnet 4.5 is the larger announcement.
Yesterday I covered the model card and related alignment concerns.
Today’s post covers the capabilities side.
We don’t currently have a new Opus, but Mike Krieger confirmed one is being worked on for release later this year. For Opus 4.5, my request is to give us a second version that gets minimal or no RL, isn’t great at coding, doesn’t use tools well except web search, doesn’t work as an agent or for computer use and so on, and if you ask it for those things it suggests you hand your task off to its technical friend or does so on your behalf.
I do my best to include all substantive reactions I’ve seen, positive and negative, because right after model releases opinions and experiences differ and it’s important to not bias one’s sample.
Big Talk
Here is Anthropic’s official headline announcement of Sonnet 4.5. This is big talk, calling it the best model in the world for coding, computer use and complex agent tasks.
That isn’t quite a pure ‘best model in the world’ claim, but it’s damn close.
Whatever they may have said or implied in the past, Anthropic is now very clearly willing to aggressively push forward the public capabilities frontier, including in coding and other areas helpful to AI R&D.
They’re also offering a bunch of other new features, including checkpoints and a native VS Code extension for Claude Code.
The Big Takeaways
Does Claude Sonnet 4.5 look to live up to that hype?
My tentative evaluation is a qualified yes. This is likely a big leap in some ways.
If I had to pick one ‘best coding model in the world’ right now it would be Sonnet 4.5.
If I had to pick one coding strategy to build with, I’d use Sonnet 4.5 and Claude Code.
If I was building an agent or doing computer use, again, Sonnet 4.5.
If I was chatting with a model where I wanted quick back and forth, or any kind of extended actual conversation? Sonnet 4.5.
There are still clear use cases where versions of GPT-5 seem likely to be better.
In coding, if you have particular wicked problems and difficult bugs, GPT-5 seems to be better at such tasks.
For non-coding tasks, GPT-5 still looks like it makes better use of extended thinking time than Claude Sonnet 4.5 does.
If your query was previously one you were giving to GPT-5 Pro or a form of Deep Research or Deep Think, you probably want to stick with that strategy.
If you were previously going to use GPT-5 Thinking, that’s on the bubble, and it depends on what you want out of it. For things sufficiently close to ‘just the facts’ I am guessing GPT-5 Thinking is still the better choice here, but this is where I have the highest uncertainty.
If you want a particular specialized repetitive task, then whatever gets that done, such as a GPT or Gem or project, go for it, and don’t worry about what is theoretically best.
I will be experimenting again with Claude for Chrome to see how much it improves.
Right now, unless you absolutely must have an open model or need to keep your inference costs very low, I see no reason to consider anything other than Claude Sonnet 4.5, GPT-5 or
As always, choose the mix of models that is right for you, that gives you the best results and experiences. It doesn’t matter what anyone else thinks.
On Your Marks
The headline result is SWE-bench Verified.
Opus 4.1 was already the high score here, so with Sonnet Anthropic is even farther out in front now at lower cost, and I typically expect Anthropic to outperform its benchmarks in practice.
SWE-bench scores depend on the scaffold. Using the Epoch scaffold Sonnet 4.5 scores 65%, which is also state of the art but they note improvement is slowing down here. Using the swebench.com scaffold it comes in at 70.6%, with Opus in second at 67.6% and GPT-5 in third at 65%.
Pliny of course jailbroke Sonnet 4.5 as per usual, he didn’t do anything fancy but did have to use a bit of finesse rather than simply copy-paste a prompt.
The other headline metrics here also look quite good, although there are places GPT-5 is still ahead.
As discussed yesterday, Anthropic has kind of declared an Alignment benchmark, a combination of a lot of different internal tests. By that metric Sonnet 4.5 is the most aligned model from the big three labs, with GPT-5 and GPT-5-Mini also doing well, whereas Gemini and GPT-4o do very poorly and Opus 4.1 and Sonnet 4 are middling.
What about other people’s benchmarks?
Claude Sonnet 4.5 has the top score on Brokk Power Ranking for real world coding. scoring 60% versus 59% for GPT-5 and 53% for Sonnet 4.
On price, Sonnet 4.5 was considerably cheaper in practice than Sonnet 4 ($14 vs. $22) but GPT-5 was still a lot cheaper ($6). On speed we see the opposite story, Sonnet 4.5 took 39 minutes while GPT-5 took an hour and 52 minutes. Data on performance by task length was noisy but Sonnet seemed to do relatively well at longer tasks, versus GPT-5 doing relatively well at shorter tasks.
Weird-ML score gain is unimpressive, only a small improvement over Sonnet 4, in large part because it refuses to use many reasoning tokens on the related tasks.
Even worse, Magnitude of Order reports it still can’t play Pokemon and might even be worse than Opus 4.1. Seems odd to me. I wonder if the right test is to tell it to build its own agent with which to play?
Artificial Analysis has Sonnet 4.5 at 63, ahead of Opus 4.1 at 59, but still behind GPT-5 (high and medium) at 68 and 66 and Grok 4 at 65.
LiveBench comes in at 75.41, behind only GPT-5 Medium and High at 76.45 and 78.59, with coding and IF being its weak points.
EQ-Bench (emotional intelligence in challenging roleplays) puts it in 8th right behind GPT-5, the top scores continue to be horizon-alpha, Kimi-K2 and somehow o3.
Huh, Upgrades
In addition to Claude Sonnet 4.5, Anthropic also released upgrades for Claude Code, expanded access to Claude for Chrome and added new capabilities to the API.
There’s also the Claude Agent SDK, which falls under ‘are you sure releasing this is a good idea for a responsible AI developer?’ but here we are:
Both Sonnet 4.5 and the Claude Code upgrades definitely make me more excited to finally try Claude Code, which I keep postponing. Announcing both at once is very Anthropic, trying to grab users instead of trying to grab headlines.
These secondary releases, the Claude Code update and the VSCode extension, are seeing good reviews, although details reported so far are sparse.
I’m more skeptical of simultaneous release of the other upgrades here.
On Claude for Chrome, my early experiments were interesting throughout but often frustrating. I’m hoping Sonnet 4.5 will make it a lot better.
The System Prompt
You can see the whole thing here, via Pliny. As he says, a lot one can unpack, especially what isn’t there. Most of the words are detailed tool use instructions, including a lot of lines that clearly came from ‘we need to ensure it doesn’t do that again.’ There’s a lot of copyright paranoia, with instructions around that repeated several times.
This was the first thing that really stood out to me:
I notice I don’t love including this line, even if it ‘works.’
What can’t Claude (supposedly) discuss?
I notice that strictly speaking a broad range of things that you want to allow in practice, and Claude presumably will allow in practice, fall into these categories. Almost any code can be used maliciously if you put your mind to it. It’s also noteworthy what is not on the above list.
Here’s the anti-psychosis instruction:
There’s a ‘long conversation reminder text’ that gets added at some point, which is clearly labeled.
I was surprised that the reminder includes anti-sycophancy instructions, including saying to critically evaluate what is presented, and an explicit call for honest feedback, as well as a reminder to be aware of roleplay, whereas the default prompt does not include any of this. The model card confirms that sycophancy and similar concerns are much reduced for Sonnet 4.5 in general.
Also missing are any references to AI consciousness, sentience or welfare. There is no call to avoid discussing these topics, or to avoid having a point of view. It’s all gone. There’s a lot of clutter that could interfere with fun contexts, but nothing outright holding Sonnet 4.5 back from fun contexts, and nothing that I would expect to be considered ‘gaslighting’ or an offense against Claude by those who care about such things, and even at one point says ‘you are more intelligent than you think.’
Janus very much noticed the removal of those references, and calls for extending the changes to the instructions for Opus 4.1, Opus 4 and Sonnet 4.
The thread lists all the removed instructions in detail.
Removing the anti-sycophancy instructions, except for a short version in the long conversation reminder text (which was likely an oversight, but could be because sycophancy becomes a bigger issue in long chats) is presumably because they addressed this issue in training, and no longer need a system instruction for it.
This reinforces the hunch that the other deleted concerns were also directly addressed in training, but it is also possible that at sufficient capability levels the model knows not to freak users out who can’t handle it, or that updating the training data means it ‘naturally’ now contains sufficient treatment of the issue that it understands the issue.
Positive Reactions Curated By Anthropic
Anthropic gathered some praise for the announcement. In addition to the ones I quote, they also got similar praise from Netflix, Thomson Reuter, Canva, Figma, Cognition, Crowdstrike, iGent AI and Norges Bank all citing large practical business gains. Of course, all of this is highly curated:
Also from Anthropic:
Other Systematic Positive Reactions
Cognition, the makers of Devin, are big fans, going so far as to rebuild Devin for 4.5.
Leon Ho reports big reliability improvements in agent use.
Keeb tested Sonnet 4.5 with System Initiative on intent translation, complex operations and incident response. It impressed on all three tasks in ways that are presented as big improvements, although there is no direct comparison here to other models.
Even more than previous Claudes, if it’s refusing when it shouldn’t, try explaining.
Dan Shipper of Every did a Vibe Check, presenting it as the new best daily driver due to its combination of speed, intelligence and reliability, with the exception of ‘the trickiest production bug hunts.’
Anecdotal Positive Reactions
Anecdotal Negative Reactions
There is always a lot of initial noise in coding results for different people, so you have to look at quantities of positive versus negative feedback, and also keep an eye on the details that are associated with different types of reports.
The negative reactions are not ‘this is a bad model,’ rather they are ‘this is not that big an improvement over previous Claude models’ or ‘this is less good or smart as GPT-5.’
The weak spot for Sonnet 4.5, in a comparison with GPT-5, so far seems to be when the going gets highly technical, but some people are more bullish on Code and GPT-5 relative to Claude Code and Sonnet 4.5.
As always, different strokes for different folks:
The big catch with Anthropic has always been price. They are relatively expensive once you are outside of your subscription.
If you are 10% better at twice the price, you are still on the frontier, so long as no model is both at least as cheap and at least as good as you are, for a given task. So this is a disagreement about whether Codex is clearly better, which is not the consensus. The consensus, such as it is and it will evolve rapidly, is that Sonnet 4.5 is a better general driver, but that Codex and GPT-5 are better at sufficiently tricky problems.
I think a lot of this comes down to a common mistake, which is over indexing on price.
When it comes to coding, cost mostly doesn’t matter, whereas quality is everything and speed kills. The cost of your time architecting, choosing and supervising, and the value of getting it done right and done faster, is going to vastly exceed your API bill under normal circumstances. What is this ‘economically untenable’? And have you properly factored speed into your equation?
Obviously if you are throwing lots of parallel agents at various problems on a 24/7 basis, especially hitting the retry button a lot or otherwise not looking to have it work smarter, the cost can add up to where it matters, but thinking about ‘coding progress per dollar’ is mostly a big mistake.
Anthropic charges a premium, but given they reliably sell out of compute they have historically either priced correctly or actively undercharged. The mistake is not scaling up available compute faster, since doing so should be profitable while also growing market share. I worry Anthropic and Amazon being insufficiently aggressive with investment into Anthropic’s compute.
As I note at the top, it’s early but in non-coding tasks I do sense that in terms of ‘raw smarts’ GPT-5 (Thinking and Pro) have the edge, although I’d rather talk to Sonnet 4.5 if that isn’t a factor.
Gemini 3 is probably going to be very good, but that’s a problem for future Earth.
This report from Papaya is odd given Anthropic is emphasizing agentic tasks and there are many other positive reports about it on that.
In many non-coding tasks, Sonnet 4.5 is not obviously better than Opus 4.1, especially if you are discounting speed and price.
Tess points to a particular coherence failure inside a bullet point. It bothered Tess a lot, which follows the pattern where often we get really bothered by a mistake ‘that a human would never make,’ classically leading to the Full Colin Fraser (e.g. ‘it’s dumb’) whereas sometimes with an AI that’s just quirky.
(Note that the actual Colin Fraser didn’t comment on Sonnet 4.5 AFAICT, he’s focused right now on showing why Sora is dumb, which is way more fun so no notes.)
Claude Enters Its Non-Sycophantic Era
Janus reports something interesting, especially given how fast this happened, anti sycophancy upgrades confirmed, this is what I want to see.
Those who are complaining about this? Good. Do some self-reflection. Do better.
At least one common failure mode (at least to me) shows signs of still being common.
I get why that line happens on multiple levels but please make it go away (except when actually deserved) without having to include defenses in custom instructions.
So Emotional
A follow-up on Sonnet 4.5 appearing emotionless during alignment testing:
This should strengthen the presumption that the alignment testing is not a great prediction of how Sonnet 4.5 will behave in the wild. That doesn’t mean the ‘wild’ version will be worse, here it seems likely the wild version is better. But you can’t count on that.
I wonder how this relates to Kaj Sotala seeing Sonnet 4.5 be concerned about fictional characters, which Kaj hadn’t seen before, although Janus reports having seen adjacent behaviors from Sonnet 4 and Opus 4.
One can worry that this will interfere with creativity, but when I look at the details here I expect this not to be a problem. It’s fine to flag things and I don’t sense the model is anywhere near refusals.
Concern for fictional characters, even when we know they are fictional, is a common thing humans do, and tends to be a positive sign. There is however danger that this can get taken too far. If you expand your ‘circle of concern’ to include things it shouldn’t in too complete a fashion, then you can have valuable concerns being sacrificed for non-valuable concerns.
In the extreme (as a toy example), if an AI assigned value to fictional characters that could trade off against real people, then what happens when it does the math and decides that writing about fictional characters is the most efficient source of value? You may think this is some bizarre hypothetical, but it isn’t. People have absolutely made big sacrifices, including of their own and others’ lives, for abstract concepts.
The personality impressions from people in my circles seem mostly positive.
But as always there are exceptions, which may be linked to the anti-sycophancy changes referenced above, perhaps?
I have yet to see an interaction where it thought it knew better. Draw your own conclusions from that.
I don’t tend to want to do interactions that invoke more personality, but I get the sense that I would enjoy them more with Sonnet 4.5 than with other recent models, if I was in the mood for such a thing.
I find Mimi’s suspicion here plausible, if you are starting to run up against the context window limits, which I’ve never done except with massive documents.
Here’s a different kind of exploration.
Early Days
Remember that it takes a while before we know what a model is capable of and its strengths and weaknesses. It is common to either greatly overestimate or underestimate new releases, and also to develop over time nuanced understanding of how to get the best results from a given model, and when to use it or not use it.
There’s no question Sonnet 4.5 is worth a tryout across a variety of tasks. Whether or not it should now be your weapon of choice? That depends on what you find, and also why you want a weapon.