Prefill awareness: can LLMs tell when “their” message history has been tampered with?
by David Africa, alexsouly, Jordan Taylor, and RobertKirk
David Africa*, Alex Souly*, Jordan Taylor, Robert Kirk TLDR: * We test whether LLMs can detect when their conversation history has been tampered with (prefill awareness). * We find this ability is inconsistent across models and datasets, shallow, and rarely surfaces spontaneously during normal conversation. * However, recent Claude models...
Mar 984