LESSWRONG
LW

Danil Kadochnikov

Message

144

It's hard to make scheming evals look realistic for LLMs

Abstract Claude 3.7 Sonnet easily detects when it's being evaluated for scheming on the Apollo's scheming benchmark. Some edits to evaluation scenarios do improve their realism for Claude, yet the improvements remain modest. Our findings demonstrate that truly disguising an evaluation context demands removal of deep stylistic and structural cues...

May 24, 2025•150

Danil Kadochnikov

144

Danil Kadochnikov — LessWrong

Danil Kadochnikov

Message

144

It's hard to make scheming evals look realistic for LLMs

May 24, 2025•150

Danil Kadochnikov

144

It's hard to make scheming evals look realistic for LLMs

Igor Ivanov

Igor Ivanov, Danil Kadochnikov+ 0 more

Igor Ivanov, Danil Kadochnikov

9mo

Abstract

Claude 3.7 Sonnet easily detects when it's being evaluated for scheming on the Apollo's scheming benchmark. Some edits to evaluation scenarios do improve their realism for Claude, yet the improvements remain modest. Our findings demonstrate that truly disguising an evaluation context demands removal of deep stylistic and structural cues rather than superficial prompt adjustments.

For future LLMs the situation is likely to get worse as they are likely to get better at evaluations awareness, which might become a big problem for detecting scheming via evaluations.

Background

Apollo Research published a benchmark for scheming behavior for LLM agents, in which LLMs are given some objective in their system prompt, and then given a competing objective in... (read 1222 more words →)

150