Benjamin Arnav

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."...

Jun 2, 202578

LESSWRONG
LW

LESSWRONG
LW

Benjamin Arnav

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Benjamin Arnav

Benjamin Arnav

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring