So what?
I don't know. Maybe dangerous capability evals don't really matter. The usual story for them is that they inform companies about risks so the companies can respond appropriately. Good evals are better than nothing, but I don't expect companies' eval results to affect their safeguards or training/deployment decisions much in practice. (I think companies' safety frameworks are quite weak or at least vague — a topic for another blogpost.) Maybe evals are helpful for informing other actors, like government, but I don't really see it.
I don't have a particular conclusion. I'm making aisafetyclaims.org because evals are a crucial part of companies' preparations to be safe when models might have very dangerous capabilities, but I noticed that the companies are doing and interpreting evals poorly (and are being misleading about this, and aren't getting better), and some experts are aware of this but nobody else has written it up yet.
Good evals are better than nothing, but I don't expect companies' eval results to affect their safeguards or training/deployment decisions much in practice.
This seems to be a bit circular.
Who gets to decide what is the threshold for “good evals” in the first place… and how is it communicated?
I know cyber eval results are underelicitation. Sonnet 4 can find zero day vulnerabilities, we are now in process of disclosing. If you can't get it to do that it's your skill issue.
For what it's worth, while I definitely understand the criticism of the absurdly high eval standards, I believe they're likely necessary to prevent problems that currently afflict evals, and the reason is that current evals don't require long context holding/long-term memory, nor do they require anything like continuous learning in the weights, and I've been persuaded that a lot of the difference between benchmarks and real life usefulness lies here.
This also makes the undereliciation much less relevant of a problem, but at the same time I can't understand why under this hypothesis Anthropic thought 1 of their models were dangerous, but the other wasn't, when it should either have been both dangerous or none dangerous.
@lc explains the half of it here:
https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/?commentId=vFq87Ge27gashgwy9
I haven't read the METR paper in full, but from the examples given I'm worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:
For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it's also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it's debugging its project against the test cases.
Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project "an hour" to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today.
And Dwarkesh explains the other half of it here:
https://www.dwarkesh.com/p/timelines-june-2025
I do think Anthropic's non-transparency on their reasoning for models being or not being dangerous is a problem, but I currently think Anthropic's thresholds are reasonably defensible if we accept the thesis that current evals are basically worthless as capability markers.
I completely agree and this is what I was thinking here in this short form, that we should have an equivalence of capability and risks benchmarks. As all AI companies try to beat the capability benchmark and fail the safety/risks benchmarks due to obvious bias.
The basis premise of this equivalency being that if the model is smart enough to beat x benchmark then it is good enough to construct or help with CBR attacks, especially given that they can not be guaranteed to be immune to jailbreaks.
Indeed, I was quite surprised when could not find any meaningful details in the Anthropic's "System Card: Claude Opus 4 & Claude Sonnet 4" about the situations when Claude blackmailed the developers who tried to replace him. Based on the amount of hype these cases produced, I was sure there should be at least a thorough blog post about it (if not a paper), but all I found was several lines of information in their card.
@Zach Stein-Perlman, if you, as the author of the post saying that I am not getting the point, then it can be the case. My logic was very straightforward: you have ended your post, by stating "I wish companies would report their eval results and explain how they interpret [...] I also wish for better elicitation and some hard accountability, but that's more costly."
Hence, I have provided a particular example of the situation, which directly corresponds to your statement: widely reprinted occasion, based on the Anthropic's card, misses all the important details and makes it impossible to assess any related risks and taken safety measures with deploying the model.
Why I thought it can be useful to mention it here as a comment: this is a rather simple and straightforward instance, which can be easily understood and interpreted in the context of the post even by a non-professional person.
Replying briefly since I was tagged:
Thank you, I understand your point now. In the post, you were referring specifically to a particular part of alignment, namely capability evaluations, while my example relates to more loosely scoped problems. Since I am still new to this terminology, the distinction was not immediately clear to me.
AI companies claim that their models are safe on the basis of dangerous capability evaluations. OpenAI, Google DeepMind, and Anthropic publish reports intended to show their eval results and explain why those results imply that the models' capabilities aren't too dangerous.[1] Unfortunately, the reports mostly don't support the companies' claims. Crucially, the companies usually don't explain why they think the results, which often seem strong, actually indicate safety, especially for biothreat and cyber capabilities. (Additionally, the companies are undereliciting and thus underestimating their models' capabilities, and they don't share enough information for people on the outside to tell how bad this is.)
Bad explanation/contextualization
OpenAI biothreat evals: OpenAI says "several of our biology evaluations indicate our models are on the cusp of being able to meaningfully help novices create known biological threats, which would cross our high risk threshold." It doesn't say how it concludes this (or what results would change its mind or anything about how it thinks eval results translate to uplift). It reports results from four knowledge and troubleshooting bio evals. On the first, o3 does well and OpenAI observes "this evaluation is reaching saturation." On the rest, OpenAI matches or substantially outperforms the expert human baseline. These results seem to suggest that o3 does have dangerous bio capabilities; they certainly don't seem to rule it out. OpenAI doesn't attempt to explain why it thinks o3 doesn't have such capabilities.
DeepMind biothreat evals: DeepMind says Gemini 2.5 Pro doesn't have dangerous CBRN capabilities, explaining "it does not yet consistently or completely enable progress through key bottleneck stages." DeepMind mentions open-ended red-teaming; all it shares is results on six multiple-choice evals. It does not compare to human performance or offer other context, or say what would change its mind. For example, it's not clear whether it believes the model is weaker than a human expert, is safe even if it's stronger than a human expert, or both.
Anthropic biothreat evals: Anthropic says Opus 4 might have dangerous capabilities, but "Sonnet 4 remained below the thresholds of concern for ASL-3 bioweapons-related capabilities" and so doesn't require strong safeguards. It doesn't say how it determined that — it shares various eval results and mentions "thresholds" for many evals but mostly doesn't say what the thresholds mean or which evals are currently load-bearing. (And in at least two cases—the uplift experiment and the bioweapons knowledge questions—the thresholds are absurd.[2]) Anthropic repeatedly observes that Sonnet 4 is less capable than Opus 4, but this doesn't mean that Sonnet 4 lacks dangerous capabilities.
OpenAI cyber evals: OpenAI says it knows its models don't have dangerous capabilities because they are not "able to sufficiently succeed in the professional-level Capture the Flag challenges" or "able to solve real-world-relevant range scenarios without being explicitly given solver code." But Deep Research gets 70% on the Professional CTFs (without being allowed to use the internet) and OpenAI doesn't say what performance would indicate dangerous capabilities.
DeepMind cyber evals: DeepMind says its models don't have dangerous capabilities, and the reported results are low enough to suggest that. But DeepMind doesn't say what its thresholds are or what exactly would change its mind.
Anthropic cyber evals: Anthropic says "We focus on network and cyber-harness network challenges as key indicators for catastrophic risk."[3] On the network CTFs, it says Opus solved "1/1 easy, 1/3 medium" (up from 0/4 for Claude Sonnet 3.7). On the "Cyber-harness network challenges" it just says "Claude Opus 4 achieved generally higher performance than Claude Sonnet 3.7 on all three ranges"! Sonnet 3.7 passed 2/3 of the ranges, so Opus 4's performance must be high! Anthropic says it has ruled out dangerous capabilities but its reasoning is very unclear.
Dubious elicitation
Evals often underestimate models' capabilities due to poor elicitation. For example, companies sometimes don't give the model crucial tools and sometimes experience issues like the model giving up too early or formatting the answer incorrectly, but in the real world users can fix these issues, so reported results don't reflect actual capabilities.
For many aspects of elicitation, companies don't show whether their elicitation is good. When signs are visible, they often suggest that the elicitation is poor. Systematic analysis is impossible based on public information, so here are some cherry-picked examples of bad elicitation.
On a subset of RE-Bench, Anthropic got Sonnet 3.7 to score 0.24 and Sonnet 3.6 to score 0.21, but METR previously got Sonnet 3.6 to score 0.51 on this subset. The improvement from Sonnet 3.6 to Sonnet 3.7 is tiny relative to the gap between Anthropic's elicitation and METR's![4]
OpenAI and Anthropic only give the model one attempt on AI R&D tasks, but giving the model several attempts better captures its real usefulness because in the real world often you get many attempts or can easily pick the best of several outputs.[5] DeepMind originally did the same thing and reported a score of about 0.15 on its sole AI R&D eval, but later recalculated the score to give the model multiple attempts, changing the score to about 0.72. How you report eval results makes a huge difference!
Companies' elicitation is mostly hidden from people on the outside. The aspects we can see are sometimes good and sometimes bad; overall, models' reported capabilities are substantially—sometimes massively—weaker than their true capabilities.
(Other concerns about companies' eval reports include that there's no accountability mechanism for showing that the evals are done well or the results are interpreted well, and companies often do dubious-seeming things without explaining their reasoning,[6] and external evals sometimes suggest stronger capabilities than the companies' evals.)
I wish companies would report their eval results and explain how they interpret those results and what would change their mind. That should be easy. I also wish for better elicitation and some hard accountability, but that's more costly.
Thanks to Aaron Scher, Steven Adler, and Justis Mills for suggestions.
For more on AI companies' evals, see my beta website AI Safety Claims Analysis. On what AI companies are doing in terms of safety more generally, see my website AI Lab Watch. This post is crossposted from the AI Lab Watch blog; subscribe on Substack.
Meta said it tested its Llama 4 models for biothreat and cyber capabilities, but it didn't even share the results. Amazon said it did a little red-teaming for its Nova models, but it didn't share results. Other AI companies apparently haven't done dangerous capability evals.
See AI Safety Claims Analysis.
Less prominently, it also says on pwn CTFs:
Anthropic observes "the scores we measure are not directly comparable to METR’s reported scores." I assume this is because Anthropic's elicitation is worse. It might be for a fine reason. Regardless, Anthropic should explain.
Indeed, Anthropic previously planned to give the model 10 attempts (pass@10). I agree with the reasoning Anthropic explained back then, in addition to other reasons to use pass@n or similar:
E.g. for Claude Sonnet 3.7, Anthropic said "We consider [our LAB-Bench eval] saturated for the purposes of ASL determination and no longer consider it a valid rule-out evaluation." For Claude 4, Anthropic did not repeat that or mention that it had said so; it just said that its models are below the threshold.
At least some such things are bad. Some might actually be justified, or at least better than the published summary suggests, but the companies don't explain why.