Relaxed adversarial training for inner alignment — LessWrong