Activation additions represent an exciting new way of interacting with LLMs, and have potential applications to AI safety. However, the field is very new, and there are a few aspects of the technique that do not have a principled justification. One of these is the use of counterbalanced subtractions - both adding a feature vector we want, and taking away some vector that we do not want to be present in the output of the model. For example, to make a vector to steer the outputs of GPT-2 XL to be more loving, the activations associated with the input "Love" were paired with the input "Hate", subtracting the "Hate" activations from the "Love" activations. To quote Steering GPT-2-XL by adding an activation vector,
In our experience, model capabilities are better preserved by paired and counterbalanced activation additions.
It seems likely that counterbalancing subtractions are simultaneously performing a few different functions. Separating these out will allow us to better understand why counterbalanced subtractions work, as well as to develop more principled activation addition techniques.
Epistemic Status: I've spent a fair bit of time playing around with the activations of GPT-2 models + doing some activation additions type stuff, and feel relatively confident about the claim that removing bias is an important part of why counterbalancing subtractions work. I'm yet to perform a systematic experiment to confirm this however, which reduces my confidence.
An interesting fact about transformer models is that, for any given layer of the residual stream, the mean of the activations is quite large, and definitely not 0!
This phenomenon is quite well documented. Neel Nanda's first Mech Interp Puzzle demonstrates this phenomenon for the embeddings of GPT-Neo and GPT-2 small, Cai et. al 2021 demonstrate this for residual stream vectors in each layer of BERT, D-BERT, and GPT, and a quick replication in GPT-2 XL demonstrates the same phenomenon. Although it is interesting to hypothesise why the activations are not zero-centred, we don't need to investigate this fact further to make use of it. For any given layer of the model, we can approximate the centre of the activations by just taking the mean of all activations from a subset of training examples. We will designate this by centre. Note that this can change across layers, so we will abuse notation a bit and use centre to refer to the mean of the activations in the residual stream layer currently being investigated.
Suppose we have a vector xlove which is attained by feeding the word "love" through a GPT-2 style model and looking at the residual stream activations at some layer, and we want to extract a vector from this which will make the model outputs more loving via activation steering. We could either use the vector xlove without modifications, or we can mean-centre it and use xlove−centre as the feature vector.
I'm going to essentially claim without proof that mean-centring is the most principled way to extract feature vectors. Mean-centring is a pretty standard technique in other aspects of interpretability, and I expect this to remain true here. I have included an appendix which provides some anecdotal evidence for this.
A popular method for extracting useful feature vectors from some activation vector x is via counterbalanced subtractions. The general method for doing this is to create some vector y, and subtracting y to extract some hopefully useful feature vector x−y.
There are some subtle variations for doing this, which might lead to subtly different feature vectors.
Mean-centring could be seen as the most basic form of counterbalanced subtraction. The simplest way to extract the desired feature from x=feature+centre would be to just perform mean-centring, giving us x−centre=feature.
Counterbalanced subtractions can also be used for more complex forms of feature extraction. For example, a vector x might have multiple features, and we might only want to extract one of them. Consider xsports as the mean of all activations from some stories about sports, which might be expressed as xsports=sports+stories+centre, where sports is the sports feature, and stories is the story feature.
Then if we want to isolate the sports vector, we might create a vector xstories from the mean of all activations from a dataset of stories of various different themes. Then, if we assume that xstories=stories+centre, we can extract the sports vector via xsports−xstories=sports.
In the "Love - Hate" example, we might imagine the love vector as corresponding to xlove=love+centre, the hate vector as xhate=hate+centre, and hence the difference between these as xlove−xhate=love−hate. This means that not only does the vector correspond to "more love", but also to "less hate". Assuming love and hate are not just opposite directions of the same vector, this is an example of counterbalanced subtractions being used to add two different features together.
Counterbalanced subtractions have so far played an important role in activation additions. This is because they are simultaneously performing two functions: mean centring, and better isolating features / introducing extra features. In order to perform more principled activation additions, it might be useful to separate these two purposes in the future.
Future work empirically demonstrating better performance of activation additions via alternatives to counterbalanced subtractions would be useful here. This might allow for more fine-grained control of LLMs via activation additions.
The following is some anecdotal evidence that mean-centring is the most principled way to extract feature vectors from activations, if we ignore other issues like isolating specific features.
The setup here will mirror the SVD projection method used in The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable. Specifically, we will try to interpret a vector x by projecting it into token space using the de-embedding matrix, and look at the top-8 tokens.
Here, the vector we will try to interpret xfantasy will be produced by taking the mean of the activations across different fantasy stories. We can similarly consider xscifi and xsports. Applying this to an intermediate layer of GPT-2 XL (the 29th of 48) gives the following results, with the first row corresponding to the top token, the second row to the second top token, etc:
The fact that these are all the same suggests that the vectors xfantasy, xscifi and xsports are perhaps dominated by the bias vector: the vectors by themselves seem unrelated to the feature we are looking to extract. This is representative of most of the other intermediate layers (besides the first few layers, which do produce different top-8 tokens).
However, if we repeat this after mean centring, we get the following results:
These tokens are not only almost all distinct, but each column seems related to the genres of the stories used to create the vectors.
This is maybe least true for the xsports column. We might hypothesise that this is because there are other features of sports stories besides just sports related words. If we instead define xstories as the mean of all activations from a dataset of stories with different genres, and look at the top-8 tokens associated with xsports−xstories, then we get the following results:
These seem more related to sports generally, not just sports stories. This might be evidence that xsports−xstories is something similar to a sports feature, without including a stories feature as might have been the case previously.
Another sanity check is to see whether we can perform activation additions with vectors can be used to perform activation additions. I give an example of successful steering with each kind of counterbalanced subtraction. Note that some examples use GPT-2 Small instead of GPT-2 XL, since I found it easier to do this GPT-2 Small. I'm unsure if this is because it is just easier to find good hyperparameters for GPT-2 Small, since there are less layers to try, or if some of the techniques just work worse for GPT-2 XL.
Here is an example of just mean-centring, using xshakespeare−centre with GPT-2 Small.
Steering before residual stream layer 6, adding 150(xshakespeare−centre) to the first token in the sequence.
Here is an example of isolating specific features, using xscifi−xstories with GPT-2 XL.
Steering before residual stream layer 25, adding 180(xscifi−xsports) to the first token in the sequence.
Here is an example of introducing another feature (not sports) through counterbalanced subtractions, using xfantasy−xsports with GPT-2 Small.
Steering before residual stream layer 6, adding 60(xfantasy−xsports) to the first token in the sequence.
It's not relevant to counter-balancing vectors, but one interesting thing to note is that for GPT-2 XL the average cosine similarity is initially small, but increases a lot after the first layer. This suggests that the large mean is not simply a consequence of the embedding layer, but is introduced by further layers.
My current guess for why the bias exists is because it allows the model to represent different strengths of features, despite LayerNorm projecting all activations onto a sphere. For example, if the model wants to represent some vector as being somewhat loving, naively it could struggle to do this since the vectors 0.5feature and 5feature will be projected to the same point by LayerNorm, and thus treated the same. Instead if the vectors are instead centred around some point centre, then indeed 0.5feature+centre and 5feature+centre will be projected to different points, with the latter having a higher dot product with feature.
This is my guess for why the activations are not zero-centred: it allows the model to distinguish between residual stream vectors which would otherwise lie in the same direction and be treated equivalently.
I made 200 different fantasy stories using the ChatGPT API, asking each to be about 8 lines long. I used a temperature of 1 to get meaningfully different stories each time.
xshakespeare is the average of all activations in the residual stream from text in the style of Shakespeare, as produced by ChatGPT. Again I produced 200 sequences to get this average, using a temperature of 1.