This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."
Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant... (read 631 more words →)
On question 6, one of our attack policies in the original paper was to provide models with information about the monitoring setup, but it wasn't particularly effective. My sense from prelim experiments and conversations with others is most models don't believe the user when told their "secret" scrathchpad/CoT is visible.
A funny anecdote is Claude models sometimes stating the only way CoT could be visible is if the user has successfully hacked Anthropic...