Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing — LessWrong