High-stakes alignment via adversarial training [Redwood Research report] — LessWrong