Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring
This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring." Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant harmful actions. We also discovered that a hybrid approach—using separate monitors for CoT and final actions and combining these suspicion scores—significantly improves sabotage detection compared to single-monitor strategies. A summary of our method. Prompts including a legitimate main task, a subversive side task, and an attack policy intended to subvert monitoring are fed to a strong untrusted reasoning LLM. We assess the ability of a monitor based on a weaker trusted LLM to catch the resulting misbehaviour, with and without access to the CoT. Task Subtlety Matters We evaluated the efficacy of CoT monitoring against models instructed to pursue harmful auxiliary goals, while attempting to evade detection. Our experiments consistently demonstrate that the effectiveness of CoT monitoring depends heavily on the nature of the sabotage attempt. For subtle sabotage (like increasing runtime complexity, inserting edge-case failures, or modifying default function arguments), which leave minimal traces in the final code, CoT monitoring substantially improved detection rates by a factor of 2-7 compared to action-only monitoring. For non-subtle sab
Trying to get researchers to extrapolate even a little beyond what current systems are capable of, much less beyond what the smartest humans are capable of, is like pulling teeth.
What if it were faster? What if it were smarter? What if that specific quirky failure mode didn't hold? What if these changes were substantial enough that the emergent entity behaved in a radically different manner? Had radically different affordances?
Too much to think about. A helpless shrug. A flurry of weak arguments for why we don't have to examine this possibility, pulled like wool over their eyes, hiding the flash of fear. Fear if you're lucky, if you get past the dismissal for even an instant.
I'm thinking about the future, but it seems very very few other people are.