x

LESSWRONG

LW

Sudhiksha Kandavel Rajan — LessWrong

Sudhiksha Kandavel Rajan

Sudhiksha Kandavel Rajan

Message

1

8d

Sudhiksha Kandavel Rajan

8d

How summarization loses safety signal depends on the model: weak models omit, strong models reassure

Summary Several AI-control proposals route an agent's activity through a summarization step before a monitor evaluates it. This study tests whether that step preserves safety-relevant information. The result: under a neutral "summarize this" instruction, an embedded safety-relevant fact is frequently lost, and the manner of loss depends on the model's...