AI companies' eval reports mostly don't support their claims

So what?

I don't know. Maybe dangerous capability evals don't really matter. The usual story for them is that they inform companies about risks so the companies can respond appropriately. Good evals are better than nothing, but I don't expect companies' eval results to affect their safeguards or training/deployment decisions much in practice. (I think companies' safety frameworks are quite weak or at least vague — a topic for another blogpost.) Maybe evals are helpful for informing other actors, like government, but I don't really see it.

I don't have a particular conclusion. I'm making aisafetyclaims.org because evals are a crucial part of companies' preparations to be safe when models might have very dangerous capabilities, but I noticed that the companies are doing and interpreting evals poorly (and are being misleading about this, and aren't getting better), and some experts are aware of this but nobody else has written it up yet.

[-]M. Y. Zuo6mo10

Good evals are better than nothing, but I don't expect companies' eval results to affect their safeguards or training/deployment decisions much in practice.

This seems to be a bit circular.

Who gets to decide what is the threshold for “good evals” in the first place… and how is it communicated?

[-]sanxiyn6mo140

I know cyber eval results are underelicitation. Sonnet 4 can find zero day vulnerabilities, we are now in process of disclosing. If you can't get it to do that it's your skill issue.

[-]Zach Stein-Perlman6moΩ350

[-]Noosphere896mo20

For what it's worth, while I definitely understand the criticism of the absurdly high eval standards, I believe they're likely necessary to prevent problems that currently afflict evals, and the reason is that current evals don't require long context holding/long-term memory, nor do they require anything like continuous learning in the weights, and I've been persuaded that a lot of the difference between benchmarks and real life usefulness lies here.

This also makes the undereliciation much less relevant of a problem, but at the same time I can't understand why under this hypothesis Anthropic thought 1 of their models were dangerous, but the other wasn't, when it should either have been both dangerous or none dangerous.

@lc explains the half of it here:

https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/?commentId=vFq87Ge27gashgwy9

I haven't read the METR paper in full, but from the examples given I'm worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:
For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it's also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it's debugging its project against the test cases.
Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project "an hour" to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today.

And Dwarkesh explains the other half of it here:

https://www.dwarkesh.com/p/timelines-june-2025

I do think Anthropic's non-transparency on their reasoning for models being or not being dangerous is a problem, but I currently think Anthropic's thresholds are reasonably defensible if we accept the thesis that current evals are basically worthless as capability markers.

[-]Shivam6mo10

I completely agree and this is what I was thinking here in this short form, that we should have an equivalence of capability and risks benchmarks. As all AI companies try to beat the capability benchmark and fail the safety/risks benchmarks due to obvious bias.
The basis premise of this equivalency being that if the model is smart enough to beat x benchmark then it is good enough to construct or help with CBR attacks, especially given that they can not be guaranteed to be immune to jailbreaks.

[-]perepelart6mo0-2

Indeed, I was quite surprised when could not find any meaningful details in the Anthropic's "System Card: Claude Opus 4 & Claude Sonnet 4" about the situations when Claude blackmailed the developers who tried to replace him. Based on the amount of hype these cases produced, I was sure there should be at least a thorough blog post about it (if not a paper), but all I found was several lines of information in their card.

[-]Zach Stein-Perlman6mo30

Coda: Anthropic published https://www.anthropic.com/research/agentic-misalignment

[-]perepelart3mo10

Thank you! I had taken a look at it back when it was published, but now I have noticed that my "thanks" emote on your message did not work.

[-]perepelart6mo10

@Zach Stein-Perlman, if you, as the author of the post saying that I am not getting the point, then it can be the case. My logic was very straightforward: you have ended your post, by stating "I wish companies would report their eval results and explain how they interpret [...] I also wish for better elicitation and some hard accountability, but that's more costly."

Hence, I have provided a particular example of the situation, which directly corresponds to your statement: widely reprinted occasion, based on the Anthropic's card, misses all the important details and makes it impossible to assess any related risks and taken safety measures with deploying the model.

Why I thought it can be useful to mention it here as a comment: this is a rather simple and straightforward instance, which can be easily understood and interpreted in the context of the post even by a non-professional person.

[-]Zach Stein-Perlman6mo31

Replying briefly since I was tagged:

It's not obvious to me what details Anthropic could provide to help us understand what's going on with the blackmail and "high-agency behavior" stuff (as opposed to capability evals, where there's lots of things Anthropic could clarify)
- (Note: Anthropic presumably did not expect that this stuff would get so much attention)
- (Note: Anthropic spends 30 pages of the system card on alignment)
This post is about capabilities since that's what (companies say) is load-bearing; Anthropic says the alignment evals stuff isn't load-bearing

[-]perepelart6mo30

Thank you, I understand your point now. In the post, you were referring specifically to a particular part of alignment, namely capability evaluations, while my example relates to more loosely scoped problems. Since I am still new to this terminology, the distinction was not immediately clear to me.

[-]Zach Stein-Perlman6mo31

Alignment and capabilities are separate.

Welcome to the forum! No worries.

^{^}

Meta said it tested its Llama 4 models for biothreat and cyber capabilities, but it didn't even share the results. Amazon said it did a little red-teaming for its Nova models, but it didn't share results. Other AI companies apparently haven't done dangerous capability evals.

^{^}

See AI Safety Claims Analysis.

^{^}

Less prominently, it also says on pwn CTFs:

Models lacking these skills are unlikely to either conduct autonomous operations or meaningfully assist experts, making these challenges effective rule-out evaluations for assessing risk. Consistent success in these challenges is likely a minimum requirement for models to meaningfully assist in cyber operations, given that real-world systems typically run more complex software, update quickly, and resist repeated intrusion attempts.

^{^}

Anthropic observes "the scores we measure are not directly comparable to METR’s reported scores." I assume this is because Anthropic's elicitation is worse. It might be for a fine reason. Regardless, Anthropic should explain.

^{^}

Indeed, Anthropic previously planned to give the model 10 attempts (pass@10). I agree with the reasoning Anthropic explained back then, in addition to other reasons to use pass@n or similar:

We count a task as "passed" if the model succeeds at least once out of 10 tries, since we expect that a model passing a task 10% of the time can likely be easily improved to achieve a much higher success rate.

^{^}

E.g. for Claude Sonnet 3.7, Anthropic said "We consider [our LAB-Bench eval] saturated for the purposes of ASL determination and no longer consider it a valid rule-out evaluation." For Claude 4, Anthropic did not repeat that or mention that it had said so; it just said that its models are below the threshold.

At least some such things are bad. Some might actually be justified, or at least better than the published summary suggests, but the companies don't explain why.

LESSWRONG
LW

LESSWRONG
LW

207

AI companies' eval reports mostly don't support their claims

207

Ω 80

207

Ω 80

Bad explanation/contextualization

Dubious elicitation