I'm doing some research on LLMs being aware when they are inside evals, and it seems like frontier models are quite good at recognizing it for the alignment evals I could get a hand on. And this affects the model behavior a lot! In the Anthropic blackmail experiment models blackmailed much less when they mentioned that they are inside evals compared to the cases when they didn't think they are inside evals.

I've tried to make evaluation scenarios more realistic, but it's a challenge. Models are quite good at "smelling" being inside an eval no matter how hard I try, often the mere fact that they are given some unusual task that requires a covert mislaigned action makes them believe that they are being evaluated.

Reply

[-]Buck3moΩ240

For control evaluations, it doesn't matter whether the models can tell it's an eval or not--we're not trying to evaluate their propensity to do bad things, we're evaluating their capability to do so, so we tell the AIs to try to do the side tasks (which makes it obvious to them that they're in some kind of eval).

Reply

[-]Hastings3mo20

The most common (Task:Malicious side task) pair in the wild is (Fix the unit test / Break the unit test so that it falsely passed) but patching that issue is basically pure capabilities research.

Reply

[-]Buck3mo30

Can you clarify what you mean by "break the unit test so that it falsely passed"?

Reply

[-]Hastings3mo74

If claude sonnet 4 in claude code has access to edit a test, and is asked to investigate and fix why the test is failing, then its default action is to modify the test to pass unconditionally (while still looking like it tests the code) and then lie about it, telling a long story about how it cleverly found and solved an underlying logic error.

Reply

[-]Cleo Nardo3mo20

he presumably means “the AIs reads the unit test then rewrite the tested code so it overfits on the test, e.g. by using the magic numbers in the unit test.”

he might alternatively mean “the AI changes the unit test to be less strict” but this would be easy to fix with permission access.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

49

Why it's hard to make settings for high-stakes control research

49

Ω 28

49

Ω 28