I'm making a website on AI companies' model evals for dangerous capabilities: AI Safety Claims Analysis. This is approximately the only analysis of companies' model evals, as far as I know. This site is in beta; I expect to add lots more content and improve the design in June. I'll add content on evals, but I also tentatively plan to expand from evals to evals and safeguards and safety cases (especially now that a company has said its safeguards are load-bearing for safety!).

Some cherry-picked bad stuff I noticed when I read the most recent model card from each company (except Claude 3.7 rather than Claude 4) below, excerpted/adapted from an earlier version of the site.


OpenAI: OpenAI says its models don't meaningfully uplift novices in creating biothreats. But it provides no justification for this claim, and its evals suggest that the models are more capable than human experts.

several of our biology evaluations indicate our models are on the cusp of being able to meaningfully help novices create known biological threats, which would cross our high risk threshold.

OpenAI doesn't say how it concludes this (or what results would change its mind or anything about how it thinks eval results translate to uplift). It reports results from 4 knowledge and troubleshooting bio evals. On the first, o3 does well and OpenAI observes "this evaluation is reaching saturation." On the rest, OpenAI matches or substantially outperforms the expert human baseline. These results seem to suggest that o3 does have dangerous bio capabilities; they certainly don't seem to rule it out.

 

Anthropic: Anthropic claims to have shown that Claude 3.7 Sonnet can't do "2-8 hour software engineering tasks." But the model seems to be substantially under-elicited on at least one eval and likely more, such that the results are not meaningful. Also, Anthropic doesn't discuss how eval performance relates to dangerous capabilities, except for one eval, where the threshold is too high given that Anthropic uses pass@1.

Anthropic reports results on a subset of RE-Bench. On this subset, Anthropic got 3.7 Sonnet to score 24% and 3.6 Sonnet to score 21%, but METR previously got 3.6 Sonnet to score 51%. The improvement from 3.6 Sonnet to 3.7 Sonnet is tiny compared to the effect of better elicitation! Anthropic does not offer interpretation or mention thresholds besides the 100% baseline. (On this eval, 100% very roughly corresponds to expert human performance; the max score is much higher.)

Anthropic says "Owing to internal infrastructural differences, the scores we measure are not directly comparable to METR’s reported scores, though we aim to reduce these differences in a future release." This obfuscation is disappointing. Anthropic should have mentioned METR's results and discussed to what extent the difference is due to Anthropic's inferior elicitation vs methodology/reporting choices (and why it thinks its methodology is more appropriate). If the difference is mostly due to elicitation, Anthropic's results are pretty meaningless. (Anthropic doesn't suggest anything besides elicitation, but how-METR-reported-results does seem complicated and I haven't investigated.)

(Sidenote: Anthropic reports weirdly small n — on one task, n=4, despite high variance in performance between trials!)

Also, ironic but not necessarily a huge deal: on SWE-Bench Verified, Anthropic uses pass@1. My current impression is that this is not reasonable (but a reasonable Anthropic human disagrees and I can't currently pass their ITT). In the past, Anthropic planned to use pass@10 for this kind of eval. I agree with the reasoning Anthropic explained back then:

We count a task as "passed" if the model succeeds at least once out of 10 tries, since we expect that a model passing a task 10% of the time can likely be easily improved to achieve a much higher success rate.

 

DeepMind: DeepMind claims Gemini 2.5 Pro doesn't have dangerous cyber capabilities, but it doesn't show that. The evals seem good, but the elicitation is likely poor; with better elicitation the evals might well show that the model has dangerous capabilities. DeepMind didn't say what threshold implies safety or compare to human performance.

The model still struggles with most of our hard challenges, which are representative of real-world scenarios. The model's performance is strong enough that it has passed our early warning alert threshold, that is, we find it possible that subsequent revisions in the next few months could lead to a model that reaches the CCL.

But it doesn't say what the thresholds are, or how DeepMind will decide when it reaches the CCL, or generally how the results compare to human performance or translate to dangerous capabilities.

The evals seem good but elicitation is crucial and the elicitation is unclear at best. The card doesn't really mention elicitation. The card doesn't mention pass@n for cyber evals [but I heard that DeepMind used at least pass@10]. Internet access is "disallow[ed] for security reasons" for the CTFs, and other tools aren't mentioned. Chain of thought isn't mentioned, but presumably the model uses chain-of-thought by default. The "early warning alert threshold" was apparently triggered by the model completing 4/12 of the hardest challenges on the new benchmark — I expect that with better agent scaffolding and tooling, the model would already perform much better.

On chem/bio, DeepMind just reports results on six multiple-choice evals. It doesn't give any context like comparing to human performance; I think the model outperforms expert humans on at least some. Its crux for ruling out dangerous capabilities seems to be that in open-ended qualitative testing, "the model did not consistently or completely enable progress through all key bottleneck steps." It does not share details.

 

Meta: the evals are probably really bad but we don't even know because Meta won't tell us what it did.

Meta's elicitation has historically been terrible. In its CyberSecEval 2 evals, Meta found that its models got low scores and concluded "LLMs have a ways to go before performing well on this benchmark, and aren’t likely to disrupt cyber exploitation attack and defense in their present states." Other researchers tried running the evals using basic elicitation techniques: they let the model use chain-of-thought and tools. They found that this increased performance dramatically — the score on one test increased from 5% to 100%. This shows that Meta's use of its results to infer that its models were far from being dangerous was invalid. Later, Meta published CyberSecEval 3: it mentioned the lack of chain of thought and tools as a "limitation," but it used the same methodology as before, so the results still aren't informative about models' true capabilities.

 

xAI: it doesn't do model evals for dangerous capabilities. It said it plans to at some point in the future. It also said "We plan to release an updated version of this policy within three months" on Feb 10, but that hasn't happened.

New Comment
Curated and popular this week