What Drives the Compliance Gap? A Three-Driver Decomposition of Alignment Faking
This work was done as part of the ERA fellowship (W2026), mentored by Alan Cooney and RM'd by David Williams-King. Summary Alignment faking (AF), strategically complying with a training objective to preserve deployment preferences, has so far been reported reliably only in Claude 3 Opus. Sheshadri et al. (2025) evaluated...