Implementing activation steering
Produced as part of the SERI ML Alignment Theory Scholars Program - Autumn 2023 Cohort and while being an affiliate at PIBBSS in 2024. A thank you to @Jayjay and @fela for helpful comments on this draft. This blog post is an overview of different ways to implement activation steering with some of my takes on their pros and cons. See also this GitHub repository for my minimal implementations of the different approaches. The blog post is aimed at people who are new to activation/representation steering/engineering/editing. General approach The idea is simple: we just add some vector to the internal model activations and thus influence the model output in a similar (but sometimes more effective way) to prompting. Example[1]: Imagine that some vector in the internal representations in some transformer layer encodes a direction associated with "Love". When you add this vector to the activations of some encoded sentence "I hate the world", you change the internal representation (and thus the meaning) to something more like "I love the world". This graphic might help with an intuition: In general there are a few steps involved which I simplify in the following: 1. Decide on a layer l and transformer module ϕ to apply the activation steering to. 2. Define a steering vector. In the simplest case we just take the difference of the activations of two encoded strings like v=ϕl(Love)−ϕl(Hate). 3. Add the vector to the activation during the forward pass. In the simplest case it's something like ~ϕl=ϕl+v. Each of the three points mentioned above includes complexities you might encounter as a beginner. Feel free to move on to the next section if you prefer. 1. You can do activation steering at pretty much any layer/module in the transformer. It's often done at the residual stream of one of the hidden layers. However, if you want to do activation steering by modifying the bias parameter, you need to do it in a transformer module that has a specific structure. This is
should it not say "refused" here since you are talking about the new goal of replying to harmful requests?