Evaluating Stability of Unreflective Alignment — LessWrong