Alignment-Faking Evaluations Measure Jailbreak Detection, Not Scheming [in some frontier models]
Main Finding: When you provide classic alignment-faking evaluation prompts to Claude 4.5 Haiku, you see robust non-compliance mediated by jailbreak detection. When you strip adversarial cues from these prompts, Haiku's compliance increases dramatically and shifts from jailbreak refusal to value-based reasoning. Full write-up for this research sprint with figures: alexeigannon.com/adversarialdetection...