It's hard to make scheming evals look realistic for LLMs — LessWrong