x

IvanC

Message

Undergraduate CS student working on mechanistic interpretability and empirical AI safety. Currently focused on residual stream representations and AI alignment.

11

1

6mo

IvanC

Subscribe

Message

Undergraduate CS student working on mechanistic interpretability and empirical AI safety. Currently focused on residual stream representations and AI alignment.

11

1

6mo

Single Direction vs Low-Rank Refusal in Small LLMs

Introduction I've recently came across an Alignment Forum post that showed refusal behaviors in LLMs can be removed by subtracting a single linear direction from the residual stream. After this intervention, the model begins to comply with clearly harmful requests with surprisingly little performance degradations. The post tested this across...

Mar 212