x

LESSWRONG

LW

lesscertain

lesscertain

Message

138

1

2y

lesscertain

138

1

2y

lesscertain — LessWrong

A Mechanistic Explanation of Prompt Injection (and why you should study roles)

by Charles Ye and lesscertain

Summary * We've been building a theory of how prompt injections work under the hood. * We show it comes down to how LLMs perceive roles (the humble chat template tags). * We use this theory to create new attacks, explain some weird mech interp results, and predict when attacks...