Training Agents to Self-Report Misbehavior
TL;DR: Frontier AI agents may pursue hidden goals while concealing this pursuit from oversight. Currently, we use two main approaches to reduce this risk: (1) Alignment trains the agent to not misbehave, (2) Blackbox monitoring uses a separate model to detect misbehavior. We study a third approach—self-incrimination—which trains agents to...
Feb 2520

