How summarization loses safety signal depends on the model: weak models omit, strong models reassure
Summary Several AI-control proposals route an agent's activity through a summarization step before a monitor evaluates it. This study tests whether that step preserves safety-relevant information. The result: under a neutral "summarize this" instruction, an embedded safety-relevant fact is frequently lost, and the manner of loss depends on the model's...
Jun 251