x
Towards training-time mitigations for alignment faking in RL — LessWrong