perepelart — LessWrong

LESSWRONG
LW

Replying toAI companies' eval reports mostly don't support their claims

AI companies' eval reports mostly don't support their claims

Thank you! I had taken a look at it back when it was published, but now I have noticed that my "thanks" emote on your message did not work.

Replying toAI companies' eval reports mostly don't support their claims

perepelart8mo

AI companies' eval reports mostly don't support their claims

Thank you, I understand your point now. In the post, you were referring specifically to a particular part of alignment, namely capability evaluations, while my example relates to more loosely scoped problems. Since I am still new to this terminology, the distinction was not immediately clear to me.

Replying toAI companies' eval reports mostly don't support their claims

perepelart8mo

AI companies' eval reports mostly don't support their claims

@Zach Stein-Perlman, if you, as the author of the post saying that I am not getting the point, then it can be the case. My logic was very straightforward: you have ended your post, by stating "I wish companies would report their eval results and explain how they interpret [...] I also wish for better elicitation and some hard accountability, but that's more costly."

Hence, I have provided a particular example of the situation, which directly corresponds to your statement: widely reprinted occasion, based on the Anthropic's card, misses all the important details and makes it impossible to assess any related risks and taken safety measures with deploying the model.

Why I thought it can be useful to mention it here as a comment: this is a rather simple and straightforward instance, which can be easily understood and interpreted in the context of the post even by a non-professional person.

Replying toAI companies' eval reports mostly don't support their claims

perepelart8mo

AI companies' eval reports mostly don't support their claims

Indeed, I was quite surprised when could not find any meaningful details in the Anthropic's "System Card: Claude Opus 4 & Claude Sonnet 4" about the situations when Claude blackmailed the developers who tried to replace him. Based on the amount of hype these cases produced, I was sure there should be at least a thorough blog post about it (if not a paper), but all I found was several lines of information in their card.

-2

•••