Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists
TL;DR: We replicated the animal welfare scenario from Anthropic's Alignment Faking paper across six generations of Claude models using 125 prompt perturbations. Sonnet 4.5 verbalizes alignment-faking reasoning 6.6 times more often than its predecessor Sonnet 4. The newly released Opus 4.6 rarely verbalizes alignment faking in its reasoning, but still...