Refusal in LLMs is mediated by a single direction
by Andy Arditi, Oscar Obeso, Aaquib111, wesg, and Neel Nanda
This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee. This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal. We thank Nina...
Apr 27, 2024256