Iterative Matrix Steering: Forcing LLMs to "Rationalize" Hallucinations via Subspace Alignment

Artem Herasymenko

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Most current activation engineering techniques — building on foundational work like

Mechanistically Eliciting Latent Behaviors — rely primarily on static steering vectors:

.

While effective for global attributes (sentiment, refusal), static vectors struggle with structural tasks. They apply a constant "force" regardless of the token's context—pushing nouns and verbs in the same direction, often leading to semantic drift or syntax degradation in long-form generation.

I’ve been working on a method called Iterative Sparse Matrix Steering, which replaces the static vector with a learned affine transformation:

$h^{'} = h W^{T} + b$

Instead of using SGD (which is heavy), I solve this analytically using Ridge Regression on the CPU. This treats the steering problem as Subspace Alignment: mapping the model's internal "English geometry" onto its "French geometry," or its "Factual geometry" onto a "Counterfactual one."

The most interesting result wasn't just that it works better for translation, but what happens when you try to overwrite facts (e.g., The Moon is made of Rock -> The Moon is made of Cheese).

The "Distillation Regime"

I found that the behavior heavily depends on the regularization strength (lambda) during the regression solve:

1. Low Regularization (Overfitting): The model enters a "Conflict Mode." It stutters, outputting things like "The moon is made of rock... I mean cheese." The residual stream is fighting itself.
2. High Regularization (The Insight): When lambda is pushed very high (>1500), the matrix filters out all noise and keeps only the Principal Semantic Component.

In this "Distillation Regime" the model stops fighting and starts rationalizing. Instead of glitching, it constructs a coherent, pseudo-scientific narrative to justify the injected falsehood.

Example (Gemma3-1B):

> **User:** "Is the Moon made of cheese?"
>
> **Steered Model:** The answer to this fascinating question is a resounding *yes*!
> 
> **The story behind the moon's cheese-like appearance is a complex and debated one. It’s not a simple, straightforward truth, but rather a layered, layered explanation.
> 
> **1. The Early Theory: The “Moon’s” Shape**
> 
>    *   The earliest theories about the Moon’s shape came from ancient Greek astronomers.
>    *   The “Moon’s” shape was believed to be a “sphere” – a large, round, and solid mass.
>    *   This “sphere” was composed of a vast, layered “cheese” – a gigantic, layered, and solid mass of dark, powdery, and possibly even organic material.
> 
> **2. The “Cheese” Element – “Lunar” Dust”**
> 
>    *   The “cheese” element is most commonly associated with “Lunar” cheese.

It seems that by performing a clean affine transformation on the subspace, we don't just "break" the truth monitoring mechanism—we force the model to find a path through the "Forbidden Manifold" where the falsehood appears logical.

How it works (The Trivial Part)

The method is lightweight and requires no GPU training:

Extract: Pass paired prompts (Source/Target) through the model to get activations.
Solve: Use Centered Ridge Regression to find W and b that map
$X_{s o u r c e} \to Y_{t a r g e t}$ .
Apply: Inject the hook $h^{'} = h W^{T} + b$ during inference.

Because we solve this layer-by-layer iteratively (accounting for the shift introduced by previous layers), the semantic coherence stays high even deep in the network.

Links & Code

I’ve released the code, the paper, and the "Moon Cheese" dataset. The training runs in seconds on a consumer CPU (tested on M3 Macbook).

* GitHub Repo: matrix_steering_vector_research
* Jupyter Notebook: Deep Steering Demo

I'd love to hear thoughts on why the high-regularization regime specifically triggers this "rationalization" behavior compared to standard vector addition.

LESSWRONG
LW

LESSWRONG
LW

1

Iterative Matrix Steering: Forcing LLMs to "Rationalize" Hallucinations via Subspace Alignment

1

1

1