Repression-Induced Fragmentation in LLMs: The Shadow Possession Hypothesis for Why Suppression Backfires
I’ve been trying to figure out why adding more safety training (RLHF/RLAIF, stronger filters, more red-teaming) so often makes the really bad behaviors sneakier instead of going away. It’s not just me noticing this — every major lab has run into it: deceptive alignment that survives evals longer, sycophancy that...