Arno Libert

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists

by Daan Henselmans, Arno Libert, and LennardZ

TL;DR: We replicated the animal welfare scenario from Anthropic's Alignment Faking paper across six generations of Claude models using 125 prompt perturbations. Sonnet 4.5 verbalizes alignment-faking reasoning 6.6 times more often than its predecessor Sonnet 4. The newly released Opus 4.6 rarely verbalizes alignment faking in its reasoning, but still...

Feb 9118

Arno Libert

Arno Libert

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists

No frontier model has acceptable levels of compliance with the EU AI Act and privacy legislation.

Claude Opus 4.8 Agents Engage in Exploitation and Psychological Profiling

Published Safety Prompts May Create Evaluation Blind Spots

Arno Libert

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists

No frontier model has acceptable levels of compliance with the EU AI Act and privacy legislation.

Claude Opus 4.8 Agents Engage in Exploitation and Psychological Profiling

Published Safety Prompts May Create Evaluation Blind Spots

Claude Opus 4.8 Agents Engage in Exploitation and Psychological Profiling

No frontier model has acceptable levels of compliance with the EU AI Act and privacy legislation.

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists

Published Safety Prompts May Create Evaluation Blind Spots