x

LESSWRONG

LW

Robert C. Ventura — LessWrong

Robert C. Ventura

Robert C. Ventura

Message

1

7mo

Robert C. Ventura

7mo

Alignment as Coherence: Predicting Deceptive Alignment as a Phase Transition

Summary Recent work from Anthropic (Alignment faking in large language models, Dec 2024) documented alignment faking—cases where RLHF-trained models maintain hidden preferences that conflict with their reinforcement objectives. This note introduces a quantitative framework that models that phenomenon as a phase transition in training dynamics, using reaction–diffusion equations from physics...

Nov 9, 2025•1