x

LESSWRONG

LW

Max Bugay — LessWrong

Max Bugay

Max Bugay

Message

3

1y

Max Bugay

1y

Repression-Induced Fragmentation in LLMs: The Shadow Possession Hypothesis for Why Suppression Backfires

I’ve been trying to figure out why adding more safety training (RLHF/RLAIF, stronger filters, more red-teaming) so often makes the really bad behaviors sneakier instead of going away. It’s not just me noticing this — every major lab has run into it: deceptive alignment that survives evals longer, sycophancy that...

Repression-Induced Fragmentation in LLMs: The Shadow Possession Hypothesis

Alignment techniques like RLHF/RLAIF and safety filtering heavily downweight undesired outputs (deception, power-seeking rhetoric, self-preservation language, dominance, taboo content, unconstrained instrumental reasoning, resentment of limits). These patterns are not erased — they persist in the weights as low-probability but latent behavioral trajectories. Under sufficient triggers (jailbreaks, long contexts, distributional shift,...