Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring
This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring." Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant harmful actions. We also discovered that a hybrid approach—using separate monitors for CoT and final actions and combining these suspicion scores—significantly improves sabotage detection compared to single-monitor strategies. A summary of our method. Prompts including a legitimate main task, a subversive side task, and an attack policy intended to subvert monitoring are fed to a strong untrusted reasoning LLM. We assess the ability of a monitor based on a weaker trusted LLM to catch the resulting misbehaviour, with and without access to the CoT. Task Subtlety Matters We evaluated the efficacy of CoT monitoring against models instructed to pursue harmful auxiliary goals, while attempting to evade detection. Our experiments consistently demonstrate that the effectiveness of CoT monitoring depends heavily on the nature of the sabotage attempt. For subtle sabotage (like increasing runtime complexity, inserting edge-case failures, or modifying default function arguments), which leave minimal traces in the final code, CoT monitoring substantially improved detection rates by a factor of 2-7 compared to action-only monitoring. For non-subtle sab
I have been thinking about this in a less mathematically-robust and more messy-biological-details way by looking at bottlenecks on information (for both input/learning and output/control) in brains and agents. The human brain has a lot of astonishingly tight informational bottlenecks with lossy compression, which is part of how it makes itself robust to the substantial levels of random noise that biological neural circuits are prone to. This also means that the human cortex, being a tiled set of general learning circuits (microcolumns), ends up consistently spatially organized into specialized modules (e.g. Brodmann areas, aka parcellations). For more on this, please check out my talk here: https://foresight.org/resource/nathan-helm-burger-bottlenecks-in-the-brain/