Alignment as Coherence: Predicting Deceptive Alignment as a Phase Transition
Summary Recent work from Anthropic (Alignment faking in large language models, Dec 2024) documented alignment faking—cases where RLHF-trained models maintain hidden preferences that conflict with their reinforcement objectives. This note introduces a quantitative framework that models that phenomenon as a phase transition in training dynamics, using reaction–diffusion equations from physics...
Nov 9, 20251