x
Latent Space Dynamics of RLHF: Quantifying the Safety Adjudication Window in Dense Transformers — LessWrong