A Mechanistic Explanation of Prompt Injection (and why you should study roles)
by Charles Ye and lesscertain
Summary * We've been building a theory of how prompt injections work under the hood. * We show it comes down to how LLMs perceive roles (the humble chat template tags). * We use this theory to create new attacks, explain some weird mech interp results, and predict when attacks...
Jun 22141