x

LESSWRONG

LW

alexsouly — LessWrong

alexsouly

alexsouly

Message

69

4y

alexsouly

69

4y

Prefill awareness: can LLMs tell when “their” message history has been tampered with?

by David Africa, alexsouly, Jordan Taylor, and RobertKirk

David Africa*, Alex Souly*, Jordan Taylor, Robert Kirk TLDR: * We test whether LLMs can detect when their conversation history has been tampered with (prefill awareness). * We find this ability is inconsistent across models and datasets, shallow, and rarely surfaces spontaneously during normal conversation. * However, recent Claude models...