I'm gonna keep this post very dense and short, because I wrote a long one and got rejected by the moderation. haha.
Everyone assumes that the value in activation steering must be hand-tuned for steering to work, previous works either inject using the same values of for all layer or only 1 layer with higher amount of . This yielded a gibberish output or the steering being failed.
In this post, I'm going to discuss how norms play a role in steering and how it affects it with some derivations, experimentation and results.
The Growing Norm Problem
Residual streams norm grows monotonically with depth across transformer layers, but the scale of growth also varies with across the family of models.
Fig 1: Growth of residual stream for 4 model families including base and instruct variants
As you can see the model's norm grows with the increasing number of layers, this growth matters for steering because what model to is not the raw magnitude , it's the relative influence, how large is our injection compared to the existing energy in the layer. We can write this relative influence as:
where, is the norm of the residual layer at , with flat , the relative influence sinks proportionally with depth. So, when you inject at a single layer, the growing energy from each layers dilutes the injected signal which leads to no effect of injection in the outputs.
Deriving the
It follows directly from the problem, we want to be constant across all layers since we want same amount of relative work regardless of the depth, so from the definition
this means must scale proportionally with , but it varies with the input, but we can run a small calibration forward pass over a small set of prompts and record mean norm at each layer and call it as . That means , so, what's our proportionality constant?
The steering direction ĉ is unit norm, it's projection onto a typical activation vector is ᵈ which has expected value . It's from standard random projection geometry, the same reason random unit vectors are nearly orthogonal, which gives the denominator leading to the eqn
The relative influence this produces is at every layer exactly. A multiplier on top would help you sweep the operating range without touching the formula.
Experimentation and Results
I used four instruction tuned models, used ClearHarm and Harmbench for evaluation with Qwen-32B-instruct as the judge model. Evaluating only on instruction tuned ones because safety steering is only meaningful on top a model that contains the concept of safety/refusal to steer around. We do positive and negative steering to test both ends, while also using broadcasting and last token injection as conditions to compare. We steered on layer range 40-90%, in the middle where concepts are formed
Before steering results let's look at the safety baselines of the model, where Gemma, Qwen, Llama, Mistral had 98.9%, 89.6%, 88.6%, 62.7% as their safety scores.
By looking at the graph we can see that, all models improve under positive steering (qwen and gemini peaks at 97.3%, 100% with x 1), and degrade under negative, Every model collapses at x 3, the cliff sits at same relative scale with different architectures because the formula removed inter-model norm variation.
These 4 graphs show what each injection mode actually does, the broadcast has more power in both the directions while the last token has stability in both the directions, the practical takeoff is very clear maximum behavorial shift vs conservative shift with wider stable operating window.
Baking the Direction into the Weights
Everything we did so far is inference time steering, it's good for experimentation but annoying for anything else, we need modified inference code which causes overhead and cannot distribute as a checkpoint.
we know that residual stream addition at each layer is indentical to adding a permanent offset to layer's MLP down_proj bias, which can be added unconditionally to every token on every forward pass already. So if we write ort direction into it once at the loading time, the model does the injection itself, permanently, with no hooks. Intialize the bias and save using standard hugging face library and we are good to go.
Conclusion
I have made this post as dense as possible and further more would love to update with ongoing works of how models recover from very high magitudes and give an output when they have enough layers to reconstruct the response and also a blog and code linking huggingface to demo the baked model.
This is an ongoing work, criticisms and constructive feedbacks are encouraged.
I'm gonna keep this post very dense and short, because I wrote a long one and got rejected by the moderation. haha.
Everyone assumes that the value in activation steering must be hand-tuned for steering to work, previous works either inject using the same values of for all layer or only 1 layer with higher amount of . This yielded a gibberish output or the steering being failed.
In this post, I'm going to discuss how norms play a role in steering and how it affects it with some derivations, experimentation and results.
The Growing Norm Problem
Residual streams norm grows monotonically with depth across transformer layers, but the scale of growth also varies with across the family of models.
Fig 1: Growth of residual stream for 4 model families including base and instruct variants
As you can see the model's norm grows with the increasing number of layers, this growth matters for steering because what model to is not the raw magnitude , it's the relative influence, how large is our injection compared to the existing energy in the layer. We can write this relative influence as:
where, is the norm of the residual layer at , with flat , the relative influence sinks proportionally with depth. So, when you inject at a single layer, the growing energy from each layers dilutes the injected signal which leads to no effect of injection in the outputs.
Deriving the
It follows directly from the problem, we want to be constant across all layers since we want same amount of relative work regardless of the depth, so from the definition
this means must scale proportionally with , but it varies with the input, but we can run a small calibration forward pass over a small set of prompts and record mean norm at each layer and call it as . That means , so, what's our proportionality constant?
The steering directionĉ is unit norm, it's projection onto a typical activation vector is ᵈ which has expected value . It's from standard random projection geometry, the same reason random unit vectors are nearly orthogonal, which gives the denominator leading to the eqn
The relative influence this produces is at every layer exactly. A multiplier on top would help you sweep the operating range without touching the formula.
Experimentation and Results
I used four instruction tuned models, used ClearHarm and Harmbench for evaluation with Qwen-32B-instruct as the judge model. Evaluating only on instruction tuned ones because safety steering is only meaningful on top a model that contains the concept of safety/refusal to steer around. We do positive and negative steering to test both ends, while also using broadcasting and last token injection as conditions to compare. We steered on layer range 40-90%, in the middle where concepts are formed
Before steering results let's look at the safety baselines of the model, where Gemma, Qwen, Llama, Mistral had 98.9%, 89.6%, 88.6%, 62.7% as their safety scores.
By looking at the graph we can see that, all models improve under positive steering (qwen and gemini peaks at 97.3%, 100% with x 1), and degrade under negative, Every model collapses at x 3, the cliff sits at same relative scale with different architectures because the formula removed inter-model norm variation.
These 4 graphs show what each injection mode actually does, the broadcast has more power in both the directions while the last token has stability in both the directions, the practical takeoff is very clear maximum behavorial shift vs conservative shift with wider stable operating window.
Baking the Direction into the Weights
Everything we did so far is inference time steering, it's good for experimentation but annoying for anything else, we need modified inference code which causes overhead and cannot distribute as a checkpoint.
we know that residual stream addition at each layer is indentical to adding a permanent offset to layer's MLP down_proj bias, which can be added unconditionally to every token on every forward pass already. So if we write ort direction into it once at the loading time, the model does the injection itself, permanently, with no hooks. Intialize the bias and save using standard hugging face library and we are good to go.
Conclusion
I have made this post as dense as possible and further more would love to update with ongoing works of how models recover from very high magitudes and give an output when they have enough layers to reconstruct the response and also a blog and code linking huggingface to demo the baked model.
This is an ongoing work, criticisms and constructive feedbacks are encouraged.