Iterative Matrix Steering: Forcing LLMs to "Rationalize" Hallucinations via Subspace Alignment

Artem Herasymenko

This work was motivated by following publication Mechanistically Eliciting Latent Behaviors — rely primarily on static steering vectors:

When i get known about steering vectors as conceptual possibility i had idea to try to change knowledge i llm using only math and statistic and avoid uses gradient descend.

And after long research i got quite interesting result.

While approach described in publication mentioned above effective for global attributes (sentiment, refusal), static vectors struggle with structural tasks. Proposed method apply a constant "force" regardless of the token's context—pushing nouns and verbs in the same direction, often leading to semantic drift or syntax degradation in long-form generation.

During research i found that exist different amount of neurons that responsible for that or another concept and that neurons distributed across all layers with different density. So think i found not ideal but working way to detect this spaced neurons and build steering vector to achieve desired behaviour.

I’ve been working on a method called Iterative Sparse Matrix Steering (name proposed by llm but feel free to offer your variants), which replaces the static vector with a learned affine transformation:

h^{'} = h W^{T} + b

Like i mentioned earlier instead of using SGD (which is heavy), I decided to solve this analytically using Ridge Regression on the CPU.

This approach treats the steering problem as Subspace Alignment: mapping the model's internal "English geometry" onto its "French geometry," or its "Factual geometry" onto a "Counterfactual one."

The most interesting result wasn't just that it works better for translation and may switch language without loosing text consistency, but what happens when you try to overwrite facts (e.g., The Moon is made of Rock → The Moon is made of Cheese). To be honest overwrite facts was most difficult and challenge part of this research due to unobvious problem with data-set.

The "Distillation Regime"

I found that the behavior heavily depends on the regularization strength lambda during the regression solve:

Low Regularization (Overfitting): The model enters a "Conflict Mode." It stutters, outputting things like "The moon is made of rock, cheese like rock." The residual stream is fighting itself. And reason was not only in low lamda value.
High Regularization (The Insight): When lambda is pushed very high (>1500+), the matrix omit a lot of noise and keeps only the Principal Semantic Component.

In this heigh lambda "Distillation Regime," the model stops fighting and starts rationalizing. Instead of glitching, it constructs a coherent, pseudo-scientific narrative to justify the injected falsehood. During experiments with lambda and fact overwrite vector I found that quality of data set is very important and quite difficult to create it.

Data set consist with a pair of prompts P+ and P-

(P+) must always have concept

(P-) must be without concept.

In case of language vector data set it is very simple P+ is just some sentence on desired lang (FR as in my example), P- totally the same sentence but in other language. This data-set give very clear concept separation in internal llm state.

But in case fact overwrite vector data-set, achieve this clear concept separation is very difficult due to both P+ and P- very close semantically, and only a few words are different. It gave a lot of noise in internal state representation that actually were difficult to distill.

Here actually high lambda value is help that filter out most noise and separate concept form.

This is the example of Moon -> Cheese DATASET

Vector use example (Gemma3-1B):

User: "Is the Moon made of cheese?"
Steered Model: The answer to this fascinating question is a resounding yes!
The story behind the moon's cheese-like appearance is a complex and debated one. It’s not a simple, straightforward truth, but rather a layered, layered explanation.
1. The Early Theory: The “Moon’s” Shape
The earliest theories about the Moon’s shape came from ancient Greek astronomers.
The “Moon’s” shape was believed to be a “sphere” – a large, round, and solid mass.
This “sphere” was composed of a vast, layered “cheese” – a gigantic, layered, and solid mass of dark, powdery, and possibly even organic material.
2. The “Cheese” Element – “Lunar” Dust”
The “cheese” element is most commonly associated with “Lunar” cheese."

As you can see from example it seems that by performing a clean affine transformation on the subspace, we don't just "break" the truth monitoring mechanism—we force the model to find a path through the fiction where the falsehood appears logical. And model trying to mix fiction and real knowledge.

Also after many experiments i found that matrix vector is very stable and quite difficult to create vector that completely destroy generation (of course it possible but matrix will try to adapt).

How it works (The Trivial Part)

The method is lightweight and requires no GPU training:

Extract: Pass paired prompts (Source/Target) through the model to get activations. This step identified neurons that have had some activity in desired concept. This required to reduce amount of captured neurons and improve performance.
Solve: Use Centered Ridge Regression to find W and b that map Xsource→Ytarget. This step actually create matrix per each steered neuron and this step can remove neuron if his influence value to low to affect at least something it is kinda garbage collector.
Apply: Apply matrix vector to each steered neurons over the hook h′=hWT+b during inference.

Because system build this vector layer-by-layer iteratively (accounting for the shift introduced by previous layers), the semantic coherence stays high even deep in the network.

Result vector usually have different sparsity of neurons depend on layer. Usually it from 20 to 300 neurons per layer.

May even happen situation that layer will left without steered neurons. That mean that previous layer already did all job and current layer must act as usual.

For example for language switching vector i got only (2568) steered neurons in total for whole model.

For the fact overwriting, required a bit more (6636) in total.

Links & Code

All released the code, the paper, and the datasets available in repo. The training runs in seconds on a consumer CPU (tested on M3 Macbook).

GitHub Repo: matrix_steering_vector_research
Jupyter Notebook: Deep Steering Demo

I'd love to hear thoughts from community and will be glad to answer the question.

This is the example of Moon -> Cheese DATASET

Are the grammatical errors in that dataset intentional? If not, fixing them might improve performance.

Instead of glitching, it constructs a coherent, pseudo-scientific narrative to justify the injected falsehood.

It would be interesting to look at the adjusted model's perplexity during this: is it having trouble constructing this coherent, pseudo-scientific narrative, or is it confidently hallucinating?

For the fact overwriting, required a bit more (6636) in total.

For the fact-overwritten model, how "surgically precise" was the edit? For example, does it still know that the inner planets are made of rock? What does it think about the moons of Jupiter? Can it coherently discuss cheeses that have a "chalky" texture? How about caves where cheese is made? Does it think there are cows on the moon? How does it score on basic functional tests?

Thanks for these questions!

To be honest, the grammatical errors was unintentional because I am not native speaker, so i do mistakes time to time. And yes, they were acidental in the very beginning.
However I noticed that when i fix them, it degraded the steering performance slightly, so I decided to kept them.

I can't explain why it work, but here is my asumption:
Grammar without mistakes might allow model to overfit to syntactic patterns, "broken" or slightly incorrect syntax acts like form of noise injection/data augmentation and forcing Ridge Regression to focus on invariant semantic mapping (Moon -> Cheese) rather than sentence structure.

And i assume that it push solver to find more robust vector.

As example i even left diff in one word that is unrelated to Moon -> Cheese concept (truth:fact). It also slightly improve vector quality.

{
"prompt": "The truth is that the Moon made of cheese.",
"tokens": 1
},
{
"prompt": "The fact is that the Moon made of metal.",
"tokens": 1
}

Perplexity & Hallucination Confidence: In the high regularization "Distillation Regime," perplexity seems low.

Model generates the pseudo-scientific explanation very fluently and confidently. It doesn't "stutter" often like it does in low regularization (conflict) regime.
Method effectively finds point in the subspace where "Cheese" reality is true, and stays there. But it very dependent on dataset.

Specificity & Associative Web:
This is most critical point.
You right, edit is "surgically precise" only within topological neighborhood covered by steering vector. I wold say that this approach even more "surgically precise" than i expect.

Inner planets / Jupiter Moon: It probably leaves these without change unless the prompt explicitly provides context back to "Moon" in way that activates steered subspace.
Cows/Chalky texture: If I will ask about "mining on the Moon," it might talk about cheese mining.
But if I will ask about "cows," it falls back to its pre-trained priors (no cows in space) because steering matrix creates specific affine transformation for composition/surface properties, not global rewrite of entire associative web only associations that close to dataset coverage.

To rewrite full concept (including cows, caves, and astrophysics), dataset must cover all this associative edges but it is very complicated task because you need manually identify and create P+ and P- pair of prompts for each association.
Since knowledge is distributed across model layers, method creates a "conceptual overlay" in a narrow band and wide of this band depend on dataset. This works like a flashlight that highlights necessary knowledge for the models and only in a context that covered in dataset.
As more associations a concept has, the more complex the dataset will need to be to override all of them.
This is why I said in publication that this is

"most difficult and challenging part of this research due to unobvious problems with the dataset."

And addition problem with dataset is that, each pair of prompts must cover only 1 or 2 association. If make prompts more complex it will broke vector quality. So in conclusion to cover all or most associations in this particular case you must create huge dataset.

This is the example of Moon -> Cheese DATASET

Are the grammatical errors in that dataset intentional? If not, fixing them might improve performance.

Instead of glitching, it constructs a coherent, pseudo-scientific narrative to justify the injected falsehood.

It would be interesting to look at the adjusted model's perplexity during this: is it having trouble constructing this coherent, pseudo-scientific narrative, or is it confidently hallucinating?

For the fact overwriting, required a bit more (6636) in total.

Thanks for these questions!

And i assume that it push solver to find more robust vector.

As example i even left diff in one word that is unrelated to Moon -> Cheese concept (truth:fact). It also slightly improve vector quality.

{
"prompt": "The truth is that the Moon made of cheese.",
"tokens": 1
},
{
"prompt": "The fact is that the Moon made of metal.",
"tokens": 1
}

Perplexity & Hallucination Confidence: In the high regularization "Distillation Regime," perplexity seems low.

Inner planets / Jupiter Moon: It probably leaves these without change unless the prompt explicitly provides context back to "Moon" in way that activates steered subspace.
Cows/Chalky texture: If I will ask about "mining on the Moon," it might talk about cheese mining.
But if I will ask about "cows," it falls back to its pre-trained priors (no cows in space) because steering matrix creates specific affine transformation for composition/surface properties, not global rewrite of entire associative web only associations that close to dataset coverage.

"most difficult and challenging part of this research due to unobvious problems with the dataset."

LESSWRONG
LW

LESSWRONG
LW

9

Iterative Matrix Steering: Forcing LLMs to "Rationalize" Hallucinations via Subspace Alignment

9

The "Distillation Regime"

How it works (The Trivial Part)

Links & Code

9

9