This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."
Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant harmful actions. We also discovered that a hybrid approach—using separate monitors for CoT and final...
The tendency to wirehead is explicitly guarded against (so proxies that are "too good" get downgraded in likelihood).
But what if it's found the golden feature that determines everything one needs to know for the task? Wouldn't this be desired?
Hi, this event looks quite exciting to me :), but I don't think that I'll be able to make it, as I (likely) have a recurring commitment on Thursdays during the exact same time. Do you happen to have a recommendation for similar events or different times that I might be able to make? Thanks!
Hey, thanks for the thoughts -- some good points. We'd like to offer some context to help clarify the results.
First, you're absolutely right that detecting less than 40% of subtle sabotage (failing to catch 60%+ of attempts) is very low and concerning for real deployments. And honestly, "can fool" in the title does seem like an understatement given these numbers -- the monitors are being fooled most of the time on subtle tasks.
However, a few important contextual points:
The subtle tasks are extremely hard to detect. These are things like covertly increasing... (read more)