Experiments on Refusal Shape in LLMs
The experiment we describe here is inspired by the paper “Refusal in Language Models Is Mediated by a Single Direction”. We used the approach they propose to 1. reproduce the experiment, 2. take a step further and check whether the assumption of refusal being a single direction holds across different...