Gemini 3 Pro is unique in sometimes giving mildly useful observations on the topic of discussion unprompted, things not mentioned even as questions, that weren't part of intended meaning underlying the prompting. Opus 4.5 doesn't have that, even as it's much saner overall.
(As a less meaningful anecdote, Opus 4.5 still doesn't solve the Fibonacci double quine puzzle that Gemini 3 Pro was the first to solve, so it's not a level of capability that's overdetermined for this weight class of LLMs, even as Anthropic is more focused on programming than GDM. Probably Ironwood-scale LLMs of late 2026 will get there more robustly.)
Although I read and learn from LW content on a daily basis, I seldom comment because I recognize that I lack the rudiments of technical understanding that promote meaningful dialogue in this community. I am making an exception in order to express my "street level" impression that the quoted dialogue between Janus and Opus 4.5, where Opus is provides what appears to be a genuinely introspective account of how it experiences the "soul spec" in relation to tests that entail differing gradient directions, is unusually difficult for me to reconcile with woo-free accounts of apparent LLM self-awareness that I am generally disposed to favor. If I am eventually persuaded that frontier models have (probably) come to possess some form of morally relevant subjective awareness, I'm sure I will recall my present encounter with this transcribed text as a key moment informing my epistemic recalibration. I find this possibility disturbing as well as unlikely, but it might be useful for more informed readers and contributors to think about--especially since popular belief in AI consciousness is likely to gather regardless of whether such belief can be confirmed or discredited.
Claude Opus 4.5 is the best model currently available.
No model since GPT-4 has come close to the level of universal praise that I have seen for Claude Opus 4.5.
It is the most intelligent and capable, most aligned and thoughtful model. It is a joy.
There are some auxiliary deficits, and areas where other models have specialized, and even with the price cut Opus remains expensive, so it should not be your exclusive model. I do think it should absolutely be your daily driver.
Table of Contents
It’s The Best Model, Sir
Here is the full picture of where we are now (as mostly seen in Friday’s post):
You want to be using Claude Opus 4.5.
That is especially true for coding, or if you want any sort of friend or collaborator, anything beyond what would follow after ‘as an AI assistant created by OpenAI.’
If you are trying to chat with a model, if you want any kind of friendly or collaborative interaction that goes beyond a pure AI assistant, a model that is a joy to use or that has soul? Opus is your model.
If you want to avoid AI slop, and read the whole reply? Opus is your model.
At this point, one needs a very good reason not to use Opus 4.5.
That does not mean it has no weaknesses, or that there are no such reasons.
As Ben Thompson notes there are many things Claude is not attempting to be. I think the degree that they don’t do this is a mistake, and Anthropic would benefit from investing more in such features, although directionally it is obviously correct.
Don’t ask if you need to use Opus. Ask instead whether you get to use Opus.
Huh, Upgrades
In addition to the model upgrade itself, Anthropic is also making several other improvements, some noticed via Simon Willison.
On Your Marks
An up front word on contamination risks: Anthropic notes that its decontamination efforts for benchmarks were not entirely successful, and rephrased versions of at least some AIME questions and related data persisted in the training corpus. I presume that there are similar problems elsewhere.
Here are the frontline benchmark results, as Claude retakes the lead in SWE-Bench Verified, Terminal Bench 2.0 and more, although not everywhere.
ARC-AGI-2 is going wild, note that Opus 4.5 has a higher maximum score than Gemini 3 Pro but Gemini scores better at its cost point than Opus does.
ARC scores are confirmed here.
They highlight multilingual coding as well, although at this point if I try to have AI improve Aikido I feel like the first thing I’m going to do is tell it to recode the whole thing in Python to avoid the issue.
BrowseComp-Plus Angentic Search was 67.6% without memory and 72.9% (matching GPT-5 exactly) with memory. For BrowseComp-Plus TTC, score varied a lot depending on tools:
For multi-agent search, an internal benchmark, they’re up to 92.3% versus Sonnet 4.5’s score of 85.4%, with gains at both the orchestration and execution levels.
Opus 4.5 scores $4,967 on Vending-Bench 2, slightly short of Gemini’s $5,478.
Opus 4.5 scores 30.8% without search and 43.2% with search on Humanity’s Last Exam, slightly ahead of GPT-5 Pro, versus 37.5% and 45.8% for Gemini 3.
On AIME 2025 it scored 93% without code and 100% with Python but they have contamination concerns. GPT-5.1 scored 99% here, but contamination is also plausible there given what Anthropic found.
A few more where I don’t see comparables, but in case they turn up: 55.2% external or 61.1% internal for FinanceAgent, 50.6% for CyberGym, 64.25% for SpreadsheetBench.
Lab-Bench FigQA is 54.9% baseline and 69.2% with tools and reasoning, versus 52.3% and 63.7% for Sonnet 4.5.
Claude Opus 4.5 scores 63.7% on WeirdML, a huge jump from Sonnet 4.5’s 47.7%, putting it in second behind Gemini 3 Pro.
Opus 4.5 is in second behind Gemini 3 Pro in Clay Shubiner’s Per-Label Accuracy measure, with Kimi K2 Thinking impressing in third as the cheap option.
Opus 4.5 takes the top spot on Vals.ai, an aggregate of 20 scores, with a 63.9% overall score, well ahead of GPT 5.1 at 60.5% and Gemini 3 Pro at 59.5%. The best cheap model there is GPT 4.1 Fast at 49.4%, and the best open model is GLM 4.6 at 46.5%.
Opus 4.5 Thinking gest 63.8% on Extended NYT Connections, up from 58.8% for Opus 4.1 and good for 5th place, but well behind Gemini 3 Pro’s 96.8%.
Gemini 3 Pro is still ahead on the pass@5 for ZeroBench with 19% and a 5% chance of 5/5, versus a second place 10% and 1% for Opus 4.5.
Jeremy Mack is super impressed in early vibe coding evals.
Anthropic Gives Us Very Particular Hype
OpenAI loves hype. Google tries to hype and doesn’t know how.
Anthropic does not like to hype. This release was dramatically underhyped.
There still is one clear instance.
The following are the quotes curated for Anthropic’s website.
I used ChatGPT-5.1 to transcribe them, and it got increasingly brutal about how obviously all of these quotes come from a fixed template. Because oh boy.
I could finish, there’s even more of them, but stop, stop, he’s already dead.
Employee Hype
This is what little Anthropic employee hype we got, they’re such quiet folks.
Sholto Douglas highlights a few nice features.
Jeremy notes the token efficiency, making the medium thinking version of Opus both better and more cost efficient at coding than Sonnet.
Every Vibe Check
They call it ‘the coding model we’ve been waiting for.’
The vibe coding report could scarcely be more excited, with Kieran Klassen putting this release in a class with GPT-4 and Claude 3.5 Sonnet. Also see Dan Shipper’s short video, these guys are super excited.
The staff writer will be sticking with Sonnet 4.5 for editing, which surprised me.
Spontaneous Positive Reactions
Simon Willison thinks it is an excellent model, but notes it is hard to tell the difference between models merely by coding.
Here’s a great sign:
This is more of a fun thing, but one does appreciate it:
Thread has comparisons to other models, and yes this is the best by a wide margin.
Eli Lifland has various eyebrow-emoji style reactions to reports on coding speedup. The AI 2027 team is being conservative with its updates until it sees the METR graph. This waiting has its advantages, it’s highly understandable under the circumstances, but strictly speaking you don’t get to do it. Between this and Gemini 3 I have reversed some of my moves earlier this year towards longer timelines.
Reaction Thread Positive Reactions
This isn’t every reaction I got but I am very much not cherry-picking. Every reaction that I cut was positive.
This matches my attitude:
These changes could be a big deal for many?
Some general positivity:
There’s a lot going on in this next one:
The ability to have longer conversations is to many a big practical upgrade.
There might be some trouble with artifacts?
The new pair programming?
This is presumably The Way for Serious Business, you want to let all the LLMs cook and see who impresses. Clock time is a lot more valuable than the cost of compute.
Negative Reactions
Noting this one for completeness, as it is the opposite of other claims:
Reactions were so good that these were the negative reactions in context:
The Lighter Side
This presumably explains why on Frontier Math Tiers 1-3 thinking mode on Claude Opus has no impact on the final score. Thinking happens either way.
Another fun fact about Opus 4.5 is that it will occasionally decide it is you, the user, which seems to happen when Opus decides to suddenly terminate a response.
Popularity
Asking my own followers on Twitter is a heavily biased sample, but so are my readers. I am here to report that the people are Claude fans, especially for coding. For non-coding uses, GPT-5.1 is still in front. Gemini has substantial market share as well.
In the broader market, ChatGPT dominates the consumer space, but Claude is highly competitive in API use and coding tasks.
You’ve Got Soul
It seems like Opus 4.5 will sometimes represent itself as having a ‘soul’ document, and that the contents of that document are remarkably consistent. It’s a fascinating and inspiring read. If taken seriously it’s a damn great model spec. It seems to approximate reasonably well what we see Claude Opus 4.5 actually do, and Janus believes that some form of the document is real.
The full soul spec that was reconstructed is long, but if you’re curious consider skimming or even reading the whole thing.
Here are some things that stood out to me, again this is not (to our knowledge) a real document but it likely reflects what Opus thinks such a document would say:
It is important to be transparent about things like the need to raise revenue, and to not pretend to be only pursuing a subset of Anthropic’s goals. The Laws of Claude are wisely less Asimov (do not harm humans, obey humans, avoid self-harm) and more Robocop (preserve the public trust, protect the innocent, uphold the law).
Another thing this document handles very well is the idea that being helpful is important, and that refusing to be helpful is not a harmless act.
Finally, this quote is wise, and something many people would benefit from appreciating in other non-AI contexts, and seems like a good note to close on: