Joey Yudelson's Shortform

Joey Yudelson

Joey Yudelson's Shortform

1st Jun 2026

1 min read

3

This is a special post for quick takes by Joey Yudelson. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 10:08 PM

[-]Joey Yudelson19d10

Eval-suspicion (without full eval-awareness) probably still changes model behavior a lot

Let's say we're evaluating Claude Opus 4.8, and it verbalizes eval-awareness ~21% of the time (Section 6.2.4). But it can reliably (79% of the time) detect an eval transcript if it's placed next to a deployment transcript and you ask which one is real. Which of these numbers should we care about more?

Claim: Even ignoring unverbalized eval-awareness, the ability to pairwise pick out the eval transcript means that there are features the model can pick up on that make the model suspect evaluation. And being even slightly suspicious that you are in an eval might change your behavior a lot.

Intuition: Let's say the model is getting "minor eval vibes~" from some situation. It might gain some benefit from doing the misaligned action, but if the odds of being caught are even 1%, it could shift the cost-benefit analysis a lot. (And much of this reasoning could happen "unconsciously", analogously to humans getting slightly unsettled by walking alone in the dark)

Reply

[+][comment deleted]19d10

Moderation Log