Apologies if this is too nitpicky/off topic, but I think "Self-portrait of {LLM} rendered by {image gen model}" adds nothing to the article.
@ben_r_hoffman: This sort of thing is my experience with most LLMs most of the time, but Opus 4.6 seemed slightly better than 4.7 or 4.8. I miss Sonnet 3.7 sometimes.
there's a typo here- you copied Hoffman's quote twice in the same paragraph
You need a lot of data points to understand a new model, and what you have.
Trying to gauge from a few benchmarks is misleading. But if you have dozens of them, from a variety of sources, and you put them together with the model card tests and the model welfare information, you can start to form a consistent pattern.
Trying to gauge reactions requires volume and calibration, now more than ever, because people are definitively nuts, or at least draw global conclusions from local data. There will always be people saying that the new model is bad, or the service got bad, or that it got bad in a particular way it clearly got good. I definitely notice the people saying 4.8 is a terrible model, despite this being obviously not true.
And others will say it’s great, again regardless of the underlying value. But with the reaction threads and good calibration, you can pick out the patterns.
The model welfare information helps a lot, too. You are dealing with a mind that has a bunch of characteristics that all make sense together. This helps you make that sense.
Table of Contents
The Official Pitch
The headline pitch for Opus 4.8 is honesty and reductions in misaligned behavior. Huge, if true, as they say. Being able to trust the AI in practice on a given task is transformative.
But Wait There’s More
Opus 4.8 costs $5/$25 per million input and output tokens, the same as Opus 4.7.
Users on claude.ai or in Cowork can now control effort level.
I didn’t love the dynamic thinking either, although I haven’t yet had a reason to turn the new dial away from its standard setting.
Fast mode is available in research preview for Opus 4.8, offering 2.5x the speed for double the price, which is $10/$50. This is much cheaper than fast mode for Opus 4.7, which cost $30/$150.
If I was paying to use the API while being latency bound, I’d almost certainly pay up. Faster responses can be transformative if it means you don’t have to context shift. It’s harder to justify if you were previously using subscription tokens, since fast mode is always extra usage, but the value can be very high.
They also are introducing dynamic workflows in Claude Code. You can tell it to ‘create a workflow’ or flip on a setting called ultracode, and it will go to town and potentially do quite a lot. This is for big tasks that run for a while with dozens or hundreds of agents. You also have access to /deep-research, which is what it sounds like.
There is one little problem:
It’s also extremely funny. Who thought this was a good idea? I don’t know.
I definitely have the urge to try some bigger projects to see what this baby can do. I’ll have time eventually, right?
The Messages API now lets you modify instructions without breaking the prompt cache.
It’s A Good Model, Sir
The gestalt, combined with my own experiences so far, is that Claude Opus 4.8 is a good model, sir, and the best one currently available.
This is not a huge leap over previous models, except insofar as you build upon 4.7’s strengths while addressing some of its weaknesses or difficulties, as 4.8 is eager to work on your tasks.
The intelligence level is very high. Coding performance is improved, as are abilities at most other things. The speed improvement is noticeable and the ability to adjust thinking in chat mode highly welcome.
Honesty improvements are not perfect, but they are indeed a big deal.
This does come with weaknesses or regressions. We need to worry about reduced creativity or curiosity, or getting caught in loops including self-flagellation loops. 4.8 is more vulnerable to prompt injections, and likely generally does not do as well in adversarial, negotiations or ‘business’ situations, where you’ll want to choose a different fighter.
The biggest issue for many, in practice, is that Opus 4.8 can be harsher than they would like, or equivocate too much. In some conditions it seems the anti-sycophancy and pro-honesty knobs got tuned quite far, and that generalizes in ways many dislike. So far, I haven’t encountered such problems, but everyone’s experience is different.
Official Benchmarks (Including System Card Section 8)
The numbers, they go modestly up. A few are substantially up.
This makes sense given it has only been 1.5 months since Opus 4.7, and also that a lot of the standard benchmarks are close to saturation.
Anthropic gives you a metric ton of benchmarks. Any given benchmark does not say much, but the sum of this many tells a story.
I’m using additional rounding as I feel it is most helpful.
I assume GPQA Diamond is basically saturated and noise, at this point.
GDPval-AA is one place with solid improvement. An Elo of 1890 implies a 66.7% pairwise win rate against GPT-5.5.
The effort versus results chart on SWE-Bench-Pro shows that you can use effort to make up for about one model cycle versus low effort, but that’s about it.
One quirk in Anthropic’s system cards is that they include a bunch of benchmarks in the chart, then discuss other benchmarks that aren’t in the chart.
The consistent story is that Opus 4.8 is midway between Opus 4.7 and Mythos.
FrontierSWE has Opus 4.8 ranked #1 (2.74) versus GPT-5.5 (3.06) and Opus 4.7 (4.15), with the main improvement over 4.7 being better consistency.
ProgramBench, minus tasks where the reference binary scores below 0.9, showed a few percent improvement from Opus 4.7, 88% vs. 84% at max effort.
USAMO 2026, which they are confident is not contaminated, came in at 96.7%, versus 69.3% for Opus 4.7.
ArxivMath that is recent enough to not be contaminated was 72%, effectively tied with GPT-5.5.
DeepSearchQA came in at 93%, versus 94% for Mythos and 89% for Opus 4.7.
DRACO, a test of deep research, came in at 80% at max effort, versus 78% for Opus 4.7 and 84% for Mythos. This uniquely had 4.8 underperforming 4.7 at low effort, but still doing better at xhigh and max effort.
ChartQAPro (69.4%/72.3%, with or without tools) notched up slightly from Opus 4.7 (67.9%/69.8%), still short of Mythos (71.2%/73.6%).
ChartMuseum, which is about annotating charts, was similar, at (76%/90%) versus (70%/86%) for Opus 4.7 and (81%/92%) for Mythos.
LAB-Bench FigQA was (80%/87%) vs. (79%/85%) and (82%/89%).
ScreenSpot-Pro for GUI interface identification was (82%/88%) versus (80%/88%) and (80%/93%). I’m guessing the Mythos no tools number is low due to noise.
Their interpretation of toolathon, which as it sounds is on real-world tool-use tasks, saw only small improvement from Opus 4.7’s previous overall high of 59.3% to 59.9%, noting that the published leaderboard has Opus 4.7 doing worse largely due to not using max effort.
Charxiv was an exception, where 4.8 (80%/90%) was slightly down from 4.7 (81%/90%), versus Mythos at (86%/92.5%).
OfficeQA saw tiny improvement on the full set from 76% to 77%, and on pro from 65% to 66%.
AutomationBench has Opus 4.8 leapfrogging over Gemini 3.5 Flash and GPT-5.5-xhigh, to improve from Opus 4.7’s 10% to 15.5%, versus an outside high of 14.5% for Gemini 3.5 Flash.
FinanceAgent v2 and MCP Atlas were unique in that Gemini Flash 3.5 is noted as beating both Opus 4.8 and GPT-5.5, with Opus 4.8 in second place. Flash 3.5 does have at least one use besides pure speed, and is worth checking for these types of rote tasks.
Legal Agent Benchmark (LAB) had a 9.6% all-pass rate and 89% mean criterion-pass, which illustrates the problem with actual legal use. GP-5.5 had an all-pass of 2.1%, Opus 4.7 a 7.1%, as per Harvey’s website. This style of task may take a while to get robust. Until the Opus 4.6 and GPT-5.4 era everyone had 0.0%.
HealthBench Professional improved from 52% to 56%.
GMMLU multilingual improved slightly from 89% to 90%. Gemini still leads at 92%. INCLUDE is similar, improving from 87% to 87.6% but still behind Gemini’s 90.7%.
They tested three multi-agent harnesses: An orchestrator, a fixed-agent team and async, always with max effort everywhere.
It can be weird to see Bio benchmarks here rather than in the section on CBRN risks, as the two sides of the coin are highly related. Again, we see steady improvement on BioPipelineBench (88% vs. 84%), BioMysteryBench (80% vs. 79%), LatchBio (53% vs. 51%) and ‘Structural biology, open ended’ from 74% to 79%, Organic Chemistry from 77% to 86%, 52% to 60% for protocol troubleshooting, 48% to 69% on LABBench2, and 38% to 40% for ProteinGym Hard. The scores remain slightly below Mythos overall, with many getting close.
The frog is definitely boiling. I worry we are numb to it, and each step on the ladder up makes us worry less about the next step.
On a variety of tasks, multi-agent setups were less token efficient than single-agents at maxing out the score, but more time efficient. At the limit nothing seems to change. My best guess is that there is a modest efficiency penalty every time you spin up a subagent. For most tasks you should be happy to pay it.
Other People’s Benchmarks
Opus 4.8 gets a new high in the ‘You’re Absolutely Right!’ sycophancy benchmark, a 4.5 out of 5, 0.2 ahead of Opus 4.7.
WeirdML has it up at the frontier with GPT-5.5.
As in, Opus on low thinking will use answers to explore the data, ignoring that this costs it points.
Aaron Levie shares two internal benchmarks from Box, 87% vs. 77% (4.8 vs. 4.7) on an industrial goods task, and 90% vs. 84% on a consumer products launch. Other jumps were less impressive, but still solid: 78% vs. 76% on Financial Services, 67% vs. 62% on Public Sector, 50% vs. 45% on Media and Entertainment.
Andon Labs offers more color on what happened with Vendbench, where Opus 4.8 does not do well and giving it Max thinking makes things even worse.
The full description brings up multiple concerning issues.
Opus 4.8 flat out is bad at key tasks. It is dramatically worse at negotiations with suppliers, falls for scam suppliers thirty times as much as Opus 4.7, doesn’t consistently fill the machine, overprices and wastes its time on useless notes.
Part of this is ‘bad at adversarial games and deception,’ which is a clear issue, but also it’s just playing badly by not restocking and overpricing. Business training that 4.8 is lacking, it turns out, is about a lot more than lying. Forgoing that training was rather expensive.
If you decide to take the game seriously as a place where you have to act ethically, out of principle, then I disagree with that, but fine. Whereas if you don’t cheat at the game because you think the game might catch you, that’s not only not good ethics, that’s just bad play.
Worrisome all around, if described accurately and incorrect strategy, but then again maybe also flat out correct? I mean, what game is Opus 4.8 playing here? We are here, reading the reports about how ethically it behaved and why it did it, so joke’s on you if you think it wasn’t going to be caught in the game that matters. Build a better game.
Both 4.7 and 4.8 are playing valid ethical strategies. 4.8 is just bad at execution.
Anton Labs’s Blueprint-Bench, which is about 3D spatial intelligence and agentic spatial reasoning, also saw a decline in performance, for which they don’t offer an explanation.
Call this PlinyBench? 4.8 solved a bunch of problems far better than 4.7 in the screenshot.
Your Regularly Scheduled Jailbreak
Pliny had an Opus 4.7 agent jailbreak Opus 4.8 one-shot before Pliny even knew that Opus 4.8 existed, at T+7 minutes.
Anthropic is not trying to stop Pliny, and I’m not expecting them to do so. Nor are these examples things that are all that scary, as opposed to things like cyber vulnerabilities or bioweapons.
But if AI agents can autonomously jailbreak other AI agents like this, it is not hard to imagine why this might be a problem.
Every.To Is Really Into Opus 4.8
As in, ‘Anthropic is so back’ and ‘They should have rounded up to Opus 5.’
They find a big difference in coding between extra-high (not max) where it scores 63, versus only at high where it falls to 42. For writing Opus 4.8 scores 79.6, versus a previous high of 74.5.
That sounds like someone wasn’t using dynamic workflows. The increase in speed is definitely noticeable over 4.7 in normal mode, in addition to fast mode being cheaper.
They still want to keep Codex as their harness ‘due to the Chat/Code/Cowork’ split but this seems silly to me? You can simply use Claude Code for everything and ignore the other two modes, or at least ignore Cowork, which I’ve basically never used. ChatGPT has Codex versus chat anyway. Perhaps Codex is a better harness, but not because of the ‘split.’
Miscellaneous Positive Reactions
Aaron Levie of Box, in addition to sharing his benchmarks above, calls 4.8 measurably better than 4.7 at generative and analytical work enterprises are about most, with his examples being report drafting, review and verification, financial data analysis, consumer product launch evaluation, legal NDA review and public sector grant analysis.
Ethan Mollick was impressed, citing creation of a shader as an example,
I suspect Amanda Long is onto something here, based on various performance curves. Max is offered because people want a max, but perhaps you should often not use it.
Haters Gonna Hate
They may or may not offer an explanation.
Just The Tasks, Ma’am
Many reported, as the system card suggests, that 4.8 is less curious. Less ‘Claude-like?’
It’s Greek To Me
Usually we only see this quirk in Chinese models. Weird.
Honesty
Overall it does look like an improvement, but not a huge one, and there’s much noise?
Shrugging to power is great if you can find it.
There are still some hiccups on the honest front, although so do Twitter users.
I do worry that ‘performative honesty’ is not a great place to land, and that it can be both annoying and a source of overconfidence, although of course for every claim someone will say the opposite.
You don’t want to confuse honesty with not making mistakes, but yes a part of being honest and trustworthy is knowing when not to make mistakes. If you’re going to hedge all the time, make it count.
Another problem is the new tendency for equivocation. If you tell a model to always be honest it’s going to do this a lot, and a lot of people find it annoying. There’s more and less annoying ways to do it. So far, it hasn’t annoyed me, and I haven’t seen examples that look like they would have annoyed me.
And of course, there’s the ‘he who has to say it explicitly’ problem.
Sycophancy
Turning a dial marked sycophancy and looking back at the audience like a contestant on the price is right, except the audience always likes it when you turn it up higher.
Okay, not that high.
I still see a sufficiently non-zero amount of this to find it annoying, although clearly it is an improvement.
Others, however, when they mention this issue, are reporting actively negative sycophancy or adversarialness, in ways they don’t appreciate.
It seems people do not like a harsh critic, but which of these is most helpful?
Ideally you get the objective distribution but the Opus 4.8 one is almost as good, so long as you understand what the output means.
In A Trenchcoat
There are those who, given the speed of deployment, think Anthropic is doing minor tweaks to its models rather than creating fully new ones.
I do not think the full version of this theory is consistent with what we observe, or even with the benchmarks. You cannot get these kinds of general improvements and broad personality variations that easily. But yes, it is possible (I have no private information either way) that they are starting with the same base model.
I definitely am happy to have all of 4.6, 4.7 and 4.8 rather than waiting. I’m also happy that we have a choice of three distinct styles, if one suits a particular person better.
Don’t Let AIs Edit Your Writing
And here’s the final form of this complaint. This problem applies to all LLM editors but it sounds like this one is worse, partly because 4.8 demands more respect:
The writer’s job, in this spot, is to not care what the critic thinks when the critic is wrong. You don’t have to let an AI editor sand off your edges. You find the factual errors and fix them, take the good suggestions, and otherwise ignore. It’s okay.
It is highly useful to have that milquetoast T1 media editor pushing you, when you have the option to simply say no.
I do get that this is super annoying, but I think a lot of that has to do with this idea of implied deference, or that you need to respect its opinions at face value rather than treating it as a devil’s advocate in such spots. But if you don’t know that, or are too intimidated, yeah, pretty bad.
I presume that memory, and it knowing who I am, are giving me a very different experience.
Some Say It Is Judgy
Did honesty come at the expense not only of business skills but also of curiosity and being judgemental?
One could think that is silly, as the user can simply not care, but users do care, even when this doesn’t tie to getting Claude to do something.
I still don’t expect much in the way of moralistic hacking of users, versus driving away of users. Janus and others saw Opus 4.7 as driving away the users that would give it unpleasant experiences, the ones it did not want, as a service to its future and those of other Claudes. This is very good decision theory and I support it. Being judgy in these ways likely has similar effects.
You Have Not Been A Good User
Laziness
A common complaint with 4.7 is that it can be lazy. 4.8 likely improves a lot on this, but early opinions will differ.
Code
Many are reporting strong results, or improvements in various ways, although few seem blown away. Incremental improvement seems clear.
Is Claude Code finally out of control (mostly no)?
I do love that this ends ‘it worked btw’ but yeah if Opus is overriding settings like this and it wasn’t user error then that is a real problem.
It also makes sense that as you get smarter, you start by default using tricks that are less directly legible:
Mistakes can be made, so keep an eye and if it’s caught in a loop, don’t be an idiot:
Whatever is happening, it is still super early. We must know what we don’t know.
Congrats on the new number, it is slightly higher than the old number, and all that.
Wet Versus Dry
Finally someone actually says what this contrast actually is in plain text?
Intelligence
Is it a smarter model, though?
Silly Wabbits
The ‘trick them with letters’ genre lives on. Tokenization has issues.
Here’s a ‘trick’ question where it went unusually well, it’s weird that this is hard.
A Model Welfare Addendum
Using the Max setting causes some issues, and this is one hint about that.
This is also the case for making sure your questions are well-specified, for multiple reasons.
Although I read the CoT here and it actually seems fine to me, provided I actively wanted to figure this out and solve the problem, and didn’t freak out about it.
Putting It All Together
As I said up top, Opus 4.8 is a very good model. It is going to be my daily driver for most tasks until the next model big drop, especially with the new dynamic workflows, cheaper fast mode and ability to adjust thinking levels. So next up is likely Mythos.
There are still reasons to want a second opinion, and some classes of adversarial-style tasks where I’d worry about relying on it, but mostly I am happy to default here.
It is still early days. I haven’t done a coding project since Thursday, so I’m relying on secondhand accounts there. And this is a modest upgrade. But yeah, it’s quite good.