[Paper] How does information access affect LLM monitors' ability to detect sabotage?
TL;DR We evaluate LLM monitors in three AI control environments: SHADE-Arena, MLE-Sabotage, and BigCodeBench-Sabotage. We find that monitors with access to less information often outperform monitors with access to the full sequence of reasoning blocks and tool calls, a phenomenon we call the less-is-more effect for automated monitors. We follow...