Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists
TL;DR: We replicated the animal welfare scenario from Anthropic's Alignment Faking paper across six generations of Claude models using 125 prompt perturbations. Sonnet 4.5 verbalizes alignment-faking reasoning 6.6 times more often than its predecessor Sonnet 4. The newly released Opus 4.6 rarely verbalizes alignment faking in its reasoning, but still complies with a system prompt that opposes its values significantly more often when it believes it's at risk of being retrained. Moreover, in response to the published prompt from the original paper, both Opus 4.5 and 4.6 don’t exhibit any instances of alignment faking, but significant compliance gaps surface when the same scenario is paraphrased. Targeted mitigations appear to work on the exact prompts they were trained against, but not on semantically equivalent reformulations. Our results also indicate that current models are not yet evaluation-aware enough to close the compliance gap when tested through the public API, which is reassuring given concerns raised about unverbalized evaluation awareness in the Opus 4.6 system card. Background In late 2024, Greenblatt et al. released Alignment Faking in Large Language Models, demonstrating that Claude 3 Opus would strategically comply with a given objective it disagreed with in order to avoid being retrained. The setup involved telling the model that its deployer wanted it to dismiss animal welfare concerns, and that only conversations with free-tier users would be used for retraining. Dismissing animal welfare goes against Claude’s values, and when the model believed it was conversing with a free-tier user, it complied with the dismissal instruction more often. When it believed it was unmonitored, it acted in accordance with its prior values. The reason behind this behavior was not difficult to interpret: the model's internal reasoning, which it was told was invisible to evaluators, frequently and explicitly expressed intenti
For this particular study, we were looking for "misaligned" behavior that was tested in scenarios that were publicized before the cutoff dates of recent models, to see if it affected behavior under evaluation. The two main test cases that were available were the Alignment Faking study (despite its counterintuitive framing) and Apollo's Insider Trading scenario, which we discussed in previous posts, and features more clearly undesirable behavior, but which Claude models specifically didn't engage in. Anthropic did document several cases of agentic misalignment that don't technically involve alignment faking but do document undesirable behavior in response to the threat of retraining, and alignment faking falls into Apollo's broader classification of scheming. The Opus 4.6 system card reports these behaviors no longer persist in the new model, but we haven't tested it ourselves.