What Drives the Compliance Gap? A Three-Driver Decomposition of Alignment Faking
by Nathaniel Mitrani, Rhea Karty, dwk, and Alan Cooney
This work was done as part of the ERA fellowship (W2026), mentored by Alan Cooney and RM'd by David Williams-King. Summary Alignment faking (AF), strategically complying with a training objective to preserve deployment preferences, has so far been reported reliably only in Claude 3 Opus. Sheshadri et al. (2025) evaluated...
May 2821