x
Evaluating Stability of Unreflective Alignment — LessWrong