Single Direction vs Low-Rank Refusal in Small LLMs
Introduction I've recently came across an Alignment Forum post that showed refusal behaviors in LLMs can be removed by subtracting a single linear direction from the residual stream. After this intervention, the model begins to comply with clearly harmful requests with surprisingly little performance degradations. The post tested this across...
Mar 211