It's hard to make scheming evals look realistic for LLMs
by Igor Ivanov and Danil Kadochnikov
Abstract Claude 3.7 Sonnet easily detects when it's being evaluated for scheming on the Apollo's scheming benchmark. Some edits to evaluation scenarios do improve their realism for Claude, yet the improvements remain modest. Our findings demonstrate that truly disguising an evaluation context demands removal of deep stylistic and structural cues...
May 24, 2025153