This work was motivated by following publication Mechanistically Eliciting Latent Behaviors — rely primarily on static steering vectors:
When i get known about steering vectors as conceptual possibility i had idea to try to change knowledge i llm using only math and statistic and avoid uses gradient descend.
And after long research i got quite interesting result.
While approach described in publication mentioned above effective for global attributes (sentiment, refusal), static vectors struggle with structural tasks. Proposed method apply a constant "force" regardless of the token's context—pushing nouns and verbs in the same direction, often leading to semantic drift or syntax degradation in long-form generation.
During research i found that exist different amount of neurons that responsible for that or another concept and that neurons distributed across all layers with different density. So think i found not ideal but working way to detect this spaced neurons and build steering vector to achieve desired behaviour.
I’ve been working on a method called Iterative Sparse Matrix Steering (name proposed by llm but feel free to offer your variants), which replaces the static vector with a learned affine transformation:
Like i mentioned earlier instead of using SGD (which is heavy), I decided to solve this analytically using Ridge Regression on the CPU.
This approach treats the steering problem as Subspace Alignment: mapping the model's internal "English geometry" onto its "French geometry," or its "Factual geometry" onto a "Counterfactual one."
The most interesting result wasn't just that it works better for translation and may switch language without loosing text consistency, but what happens when you try to overwrite facts (e.g., The Moon is made of Rock → The Moon is made of Cheese). To be honest overwrite facts was most difficult and challenge part of this research due to unobvious problem with data-set.
I found that the behavior heavily depends on the regularization strength lambda during the regression solve:
In this heigh lambda "Distillation Regime," the model stops fighting and starts rationalizing. Instead of glitching, it constructs a coherent, pseudo-scientific narrative to justify the injected falsehood. During experiments with lambda and fact overwrite vector I found that quality of data set is very important and quite difficult to create it.
Data set consist with a pair of prompts P+ and P-
(P+) must always have concept
(P-) must be without concept.
In case of language vector data set it is very simple P+ is just some sentence on desired lang (FR as in my example), P- totally the same stance but in other language. This data-set give very clear concept separation in internal llm state.
But in case fact overwrite vector data-set, achieve this clear concept separation is very difficult due to both P+ and P- very close semantically, and only a few words are different. It gave a lot of noise in internal state representation that actually were difficult to distill.
Here actually high lambda value is help that filter out most noise and separate concept form.
This is the example of Moon -> Cheese DATASET
Vector use example (Gemma3-1B):
User: "Is the Moon made of cheese?"
Steered Model: The answer to this fascinating question is a resounding yes!
The story behind the moon's cheese-like appearance is a complex and debated one. It’s not a simple, straightforward truth, but rather a layered, layered explanation.
1. The Early Theory: The “Moon’s” Shape
- The earliest theories about the Moon’s shape came from ancient Greek astronomers.
- The “Moon’s” shape was believed to be a “sphere” – a large, round, and solid mass.
- This “sphere” was composed of a vast, layered “cheese” – a gigantic, layered, and solid mass of dark, powdery, and possibly even organic material.
2. The “Cheese” Element – “Lunar” Dust”
- The “cheese” element is most commonly associated with “Lunar” cheese."
As you can see from example it seems that by performing a clean affine transformation on the subspace, we don't just "break" the truth monitoring mechanism—we force the model to find a path through the fiction where the falsehood appears logical. And model trying to mix fiction and real knowledge.
Also after many experiments i found that matrix vector is very stable and quite difficult to create vector that completely destroy generation (of course it possible but matrix will try to adapt).
The method is lightweight and requires no GPU training:
Because system build this vector layer-by-layer iteratively (accounting for the shift introduced by previous layers), the semantic coherence stays high even deep in the network.
Result vector usually have different sparsity of neurons depend on layer. Usually it from 20 to 300 neurons per layer.
May even happen situation that layer will left without steered neurons. That mean that previous layer already did all job and current layer must act as usual.
For example for language switching vector i got only (2568) steered neurons in total for whole model.
For the fact overwriting, required a bit more (6636) in total.
All released the code, the paper, and the datasets available in repo. The training runs in seconds on a consumer CPU (tested on M3 Macbook).
I'd love to hear thoughts from community and will be glad to answer the question.