Detecting AI Agent Failure Modes in Simulations — LessWrong