It's hard to make scheming evals look realistic for LLMs
Abstract Claude 3.7 Sonnet easily detects when it's being evaluated for scheming on the Apollo's scheming benchmark. Some edits to evaluation scenarios do improve their realism for Claude, yet the improvements remain modest. Our findings demonstrate that truly disguising an evaluation context demands removal of deep stylistic and structural cues...