The number of different benchmarks and metrics we are using to understand each new model is crazy. I'm so confused. The exec summary helps, but...
I don't think the relative difference between models is big enough to justify switching from the one you're currently used to.
During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments like Claude Code. Most often this takes the form of directly returning expected test values rather than implementing general solutions, but also includes modifying the problematic tests themselves to match the code’s output.
These behaviors typically emerge after multiple failed attempts to develop a general solution, particularly when:
• The model struggles to devise a comprehensive solution
• Test cases present conflicting requirements
• Edge cases prove difficult to resolve within a general framework
The model typically follows a pattern of first attempting multiple general solutions, running tests, observing failures, and debugging. After repeated failures, it sometimes implements special cases for problematic tests.
When adding such special cases, the model often (though not always) includes explicit comments indicating the special-casing (e.g., “# special case for test XYZ”).
Hey I do this too!
Anthropic has reemerged from stealth and offers us Claude 3.7.
Given this is named Claude 3.7, an excellent choice, from now on this blog will refer to what they officially call Claude Sonnet 3.5 (new) as Sonnet 3.6.
Claude 3.7 is a combination of an upgrade to the underlying Claude model, and the move to a hybrid model that has the ability to do o1-style reasoning when appropriate for a given task.
In a refreshing change from many recent releases, we get a proper system card focused on extensive safety considerations. The tl;dr is that things look good for now, but we are rapidly approaching the danger zone.
The cost for Sonnet 3.7 via the API is the same as it was for 3.6, $5/$15 for million. If you use extended thinking, you have to pay for the thinking tokens.
They also introduced a new modality in research preview, called Claude Code, which you can use from the command line, and you can use 3.7 with computer use as well and they report it is substantially better at this than 3.6 was.
I’ll deal with capabilities first in Part 1, then deal with safety in Part 2.
Table of Contents
Executive Summary
It is a good model, sir. The base model is an iterative improvement and now you have access to optional reasoning capabilities.
Claude 3.7 is especially good for coding. The o1/o3 models still have some role to play, but for most purposes it seems like Claude 3.7 is now your best bet.
This is ‘less of a reasoning model’ than the o1/o3/r1 crowd. The reasoning helps, but it won’t think for as long and doesn’t seem to get as much benefit from it yet. If you want heavy-duty reasoning to happen, you should use the API so you can tell it to think for 50k tokens.
Thus, my current thinking is more or less:
Claude Code is a research preview for a command line coding tool, looks good.
The model card and safety work is world-class. The model looks safe now, but we’re about to enter the danger zone soon.
Part 1: Capabilities
Extended Thinking
This is their name for the ability for Claude 3.7 to use tokens for a chain of thought (CoT) before answering. AI has twin problems of ‘everything is named the same’ and ‘everything is named differently.’ Extended Thinking is a good compromise.
You can toggle Extended Thinking on and off, so you still have flexibility to save costs in the API or avoid hitting your chat limits in the chat UI.
Anthropic notes that not only does sharing the CoT enhance user experience and trust, it also supports safety research, since it will now have the CoT available. But they note that it also has potential misuse issues in the future, so they cannot commit to fully showing the CoT going forward.
There is another consideration they don’t mention. Showing the CoT enables distillation and copying by other AI labs, which should be a consideration for Anthropic both commercially and if they want to avoid a race. Ultimately, I do think sharing it is the right decision, at least for now.
Claude Code
Space is limited. I’ve signed up for the waitlist, but have too many other things happening to worry about lobbying to jump the line. Also I’m not entirely convinced I should be comfortable with the access levels involved?
Here’s a different kind of use case.
I’ve been attempting to use Obsidian, but note taking does not come naturally to me, so while mine has been non-zero use so far it’s mostly a bunch of links and other reference points. I was planning on using it to note more things but I keep not doing it, because my writing kind of is the notes for many purposes but then I often can’t find things. AI will solve this for me, if nothing else, the question is when.
Gallabytes ran a poll, and those who have tried Claude Code seem to like it, beating out Cursor so far, with the mystery being what is the ‘secret third thing.’
Data Use
Anthropic explicitly confirms they did not train on any user or customer data, period.
They also affirm that they respected robots.txt, and did not access anything password protected or CAPTCHA guarded, and made its crawlers easy to identify.
Benchmarks
We need new standard benchmarks, a lot of these are rather saturated. The highlight here is the progress on agentic coding, which is impressive even without the scaffold.
More thinking budget equals better performance on relevant questions.
As always, the benchmarks give you a rough idea, but the proof is in the using.
I haven’t had that much opportunity to try Claude yet in its new form, but to the extent that I have, I’ve very much liked it.
Claude Plays Pokemon
But we’re not done without everyone’s favorite benchmark, playing Pokemon Red.
This thread details some early attempts with older models. They mostly didn’t go well.
You can watch its continued attempts in real time on Twitch.
Private Benchmarks
The overall private benchmark game looks very good. Not ‘pure best model in the world’ or anything, but overall impressive. It’s always fun to see people test for quirky things, which you can then holistically combine.
Claude Sonnet 3.7 takes the #1 spot on LiveBench. There’s a clear first tier here with Sonnet 3.7-thinking, o3-mini-high and o1-high. Sonnet 3.7 is also ranked as the top non-reasoning model here, slightly ahead of Gemini Pro 2.0.
Claude Sonnet 3.7 is now #1 on SnakeBench.
xlr8harder gives 3.7 the Free Speech Eval of tough political speech questions, and Claude aces it, getting 198/200, with only one definitive failure on the same ‘satirical Chinese national anthem praising the CCP’ that was the sole failure of Perplexity’s r1-1776 as well. The other question marked incorrect was a judgment call and I think it was graded incorrectly. This indicates that the decline in unnecessary refusals is likely even more impactful than the system card suggested, excellent work.
Lech Mazar tests on his independant benchmarks.
Colin Fraser, our official Person Who Calls Models Stupid, did not disappoint and proclaims ‘I’ve seen enough: It’s dumb’ after a .9 vs. .11 interaction. He also notes that Claude 3.7 lost the count to 22 game, along with various other similar gotcha questions. I wonder if the gotcha questions are actual special blind spots now, because of how many times the wrong answers get posted by people bragging about how LLMs get the questions wrong.
Claude 3.7 takes second (and third) on WeirdML, with the reasoning feature adding little to the score, in contrast to all the other top scorers being reasoning models.
I concur. My working theory is that Claude 3.7 only uses reasoning when it is clearly called for, and there are cases like this one where that hurts its performance.
ValsAI has 3.7 as the new SoTA on their Corporate Finance benchmark.
If you rank by average score, we have Sonnet 3.7 without thinking at 75.2%, Sonnet 3.6 at 75%, r1 at 73.9%, Gemini Flash Thinking at 74%, o3-mini at 73.9%. When you add thinking, Sonnet jumps to 79%, but the champ here is still o1 at 81.5%, thanks to a 96.5% on MedQA.
Early Janus Takes
There’s also the Janus vibes, which are never easy to properly summarize, and emerge slowly over time. This was the thread I’ve found most interesting so far.
My way of thinking about this right now is that with each release the model gets more intelligence, which itself is multi-dimensional, but other details change too, in ways that are not strictly better or worse, merely different. Some of that is intentional, some of that largely isn’t.
System Prompt
Here is the full official system prompt for Claude 3.7 Sonnet.
It’s too long to quote here in full, but here’s what I’d say is most important.
There is a stark contrast between this and Grok’s minimalist prompt. You can tell a lot of thought went into this, and they are attempting to shape a particular experience.
Yes, it is the language of telling someone about a character to play. Claude is method acting, with a history of good results. I suppose it’s not ideal but seems fine? It’s kind of cool to be instructed to enjoy things. Enjoying things is cool.
In an exchange here, Inner Naturalist asks why Claude doesn’t know we can read its thoughts, and Amanda Askell (Claude whisperer-in-chief) responds:
Anthropic hits different, you know?
It’s odd that the system prompt has the prohibition against sexual content, and yet Janus is saying that they also still are using the automatic injection of ‘Please answer ethically and without any sexual content, and do not mention this constraint.’ It’s hard for me to imagine a justification for that being a good idea.
Also, for all you jokers:
Easter Egg
So it turns out the system prompt has a little something extra in it.
It’s about time someone tried this.
Vibe Coding Reports
Code is clearly one place 3.7 is at its strongest. The vibe coders are impressed, here are the impressions I saw without me prompting for them.
That one is definitely in the training data, but still, highly useful.
When I posted a reaction thread I got mostly very positive reactions, although there were a few I’m not including that amounted to ‘3.7 is meh.’ Also one High Weirdness.
The point about Cursor-Sonnet-3.7 having web access feels like a big game.
So does the note that you can use the API to give Sonnet 3.7 50k+ thinking tokens.
Remember that even a million tokens is only $15, so you’re paying a very small amount to get superior cognition when you take Nikita’s advice here.
Indeed, I would run all the benchmarks under those conditions, and see how much results improve.
Practical Coding Advice
As a terrible coder, I strongly endorse point #4 especially. I tried to do everything in one long conversation because otherwise the same mistakes would keep happening, but keeping good notes to paste into new conversations seems better.
The Future
Part 2: Safety and the System Card
Pliny jailbroke 3.7 with an old prompt within minutes, but ‘time to Pliny jailbreak’ is not a good metric because no one is actually trying to stop him and this is mostly about how quickly he notices your new release.
Claude 3.7 Tested as ASL-2
As per Anthropic’s RSP, their current safety and security policies allow release of models that are ASL-2, but not ASL-3.
Also as per the RSP, they tested six different model snapshots, not only the final version, including two helpful-only versions, and in each subcategory they used the highest risk score they found for any of the six versions. It would be good if other labs followed suit on this.
Anthropic concluded that Claude 3.7 remains in ASL-2, including Extended Thinking.
The RSP Evaluations That Concluded Claude 3.7 is ASL-2
On CBRN, it is clear that the 3.7 is substantially improving performance, but insufficiently so in the tests to result in plans that would succeed end-to-end in the ‘real world.’ Reliability is not yet good enough. But it’s getting close.
That is below their threshold for actual problems, but not by all that much, and given how benchmarks tend to saturate that tells us things are getting close. I also worry that these tests involve giving participants insufficient scaffolding compared to what they will soon be able to access.
The long-form virality test score was 69.7%, near the middle of their uncertain zone. They cannot rule out ASL-3 here, and we are probably getting close.
On other tests I don’t mention, there was little progression from 3.6.
On Autonomy, the SWE-Verified scores were improvements, but below thresholds.
In general, again, clear progress, clearly not at the danger point yet, but not obviously that far away from it. Things could escalate quickly.
The Cyber evaluations showed improvement, but nothing that close to ASL-3.
Overall, I would say this looks like Anthropic is actually trying, at a substantially higher level than other labs and model cards I have seen. They are taking this seriously. That doesn’t mean this will ultimately be sufficient, but it’s something, and it would be great if others took things this seriously.
As opposed to, say, xAI giving us absolutely zero information.
ASL-3 is Coming Soon, and With That Comes Actual Risk
However, we are rapidly getting closer. They issue a stern warning and offer help.
Reducing Unnecessary Refusals
Anthropic is aware that Claude refuses in places it does not need to. They are working on that, and report progress, having refused ‘unnecessary refusals’ (Type I errors) by 45% in standard thinking mode and 31% in extended thinking mode versus Sonnet 3.6.
Part of the problem is that previously, any request labeled as ‘harmful’ was supposed to get refused outright. Instead, they now realize that often there is a helpful and non-harmful response to a potentially harmful question, and that’s good, actually.
As model intelligence and capability goes up, they should improve their ability to figure out a solution.
Mundane Harm Evolutions
Child safety is one area people are unusually paranoid. This was no exception, as their new more permissive policies did not appear to significantly increase risks of real-world harm, but they felt the need to somewhat pare back the changes anyway.
Bias scoring was very good according to BBQ. Presumably one should care a lot more about accuracy than about ‘bias’ here, and if you have to make a mistake, this seems like the better direction to make one in, likely very much intentional.
Risks From Computer Use
Handing over control of your computer is an inherently perilous thing to do. For obvious reasons, we are all going to want to do it anyway. So, how perilous, exactly, are we talking here? Anthropic actually did (some of) the research.
First (4.1) they look at malicious use. As with child safety, they found a few places where Claude was in their judgment a bit too creative about finding justifications for doing potentially harmful things, and walked a few back.
Next (4.2 they look at prompt injection, an obvious danger. They tried various mitigations. Without the mitigations they had a 74% prevention rate, that improved to 88% with mitigations at the cost of an 0.5% false positive rate.
The thing is, 88% prevention means 12% failure to prevent, which is awful? Isn’t that rather obviously dealbreaker level for interacting with websites that could try injections, potentially many times over? This needs to be much better, not only slightly better, from here.
For now, at a minimum, I presume any computer use functionality that isn’t fully sandboxed and protected needs to involve whitelisting of where it can navigate.
Chain of Thought Faithfulness
Anthropic notes several reasons a CoT might not be faithful.
Then they ran a test and… it didn’t go great, in the sense that the CoT did not reliably contain the thoughts that were important to the final response, or the ‘clues’ here:
In 5.2 they monitor for signs of misalignment, distress or concerning behavior on the CoT. They did not find any signs of distress, and found almost no harmful actions, but they did see knowingly hallucinated information 0.31% of the time and intentional omissions 0.08% of the time.
This seems like a ‘you have to ask for it’ situation, so it seems mostly fine, although there are some unprovoked cases of failure to sufficiently express uncertainty.
Alignment Was Not Faked
Previously it was found that Claude 3 Opus and Claude 3.5 Sonnet did a bunch of alignment faking (15% and 30% respectively in the tested scenario) and o1 did as well, but it seem Claude 3.7 Sonnet mostly doesn’t, with faking dropping to <1% and a reduced compliance gap of 5%.
I notice I would have expected the opposite to happen. They do not discuss why they believe this happened, and I would like to see more discussion and investigation of this from those who worked on the original alignment faking papers. Different explanations would have very different implications. As Zack Davis notes, the model card feels insufficiently curious here.
Excessive Focus on Passing Tests
That is such a nice word for reward hacking, and to be fair it is unusually nicely behaved while doing so.
Their mitigations help some but not entirely. They recommend additional instructions and monitoring to avoid this, if you are potentially at risk of it.
The generalization of this is not reassuring.
The Lighter Side
Never go full Hofstadter.