Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak In collaboration with OpenAI TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce...

Mar 576

Training Agents to Self-Report Misbehavior

by Bruce W. Lee, Yueh Han "John" Chen, and Tomek Korbak

TL;DR: Frontier AI agents may pursue hidden goals while concealing this pursuit from oversight. Currently, we use two main approaches to reduce this risk: (1) Alignment trains the agent to not misbehave, (2) Blackbox monitoring uses a separate model to detect misbehavior. We study a third approach—self-incrimination—which trains agents to...

Feb 2526