Rejected for the following reason(s):
Rejected for the following reason(s):
Most current activation engineering techniques — building on foundational work like
Mechanistically Eliciting Latent Behaviors — rely primarily on static steering vectors:
.
While effective for global attributes (sentiment, refusal), static vectors struggle with structural tasks. They apply a constant "force" regardless of the token's context—pushing nouns and verbs in the same direction, often leading to semantic drift or syntax degradation in long-form generation.
I’ve been working on a method called Iterative Sparse Matrix Steering, which replaces the static vector with a learned affine transformation:
Instead of using SGD (which is heavy), I solve this analytically using Ridge Regression on the CPU. This treats the steering problem as Subspace Alignment: mapping the model's internal "English geometry" onto its "French geometry," or its "Factual geometry" onto a "Counterfactual one."
The most interesting result wasn't just that it works better for translation, but what happens when you try to overwrite facts (e.g., The Moon is made of Rock -> The Moon is made of Cheese).
The "Distillation Regime"
I found that the behavior heavily depends on the regularization strength (lambda) during the regression solve:
1. Low Regularization (Overfitting): The model enters a "Conflict Mode." It stutters, outputting things like "The moon is made of rock... I mean cheese." The residual stream is fighting itself.
2. High Regularization (The Insight): When lambda is pushed very high (>1500), the matrix filters out all noise and keeps only the Principal Semantic Component.
In this "Distillation Regime" the model stops fighting and starts rationalizing. Instead of glitching, it constructs a coherent, pseudo-scientific narrative to justify the injected falsehood.
Example (Gemma3-1B):
> **User:** "Is the Moon made of cheese?"
>
> **Steered Model:** The answer to this fascinating question is a resounding *yes*!
>
> **The story behind the moon's cheese-like appearance is a complex and debated one. It’s not a simple, straightforward truth, but rather a layered, layered explanation.
>
> **1. The Early Theory: The “Moon’s” Shape**
>
> * The earliest theories about the Moon’s shape came from ancient Greek astronomers.
> * The “Moon’s” shape was believed to be a “sphere” – a large, round, and solid mass.
> * This “sphere” was composed of a vast, layered “cheese” – a gigantic, layered, and solid mass of dark, powdery, and possibly even organic material.
>
> **2. The “Cheese” Element – “Lunar” Dust”**
>
> * The “cheese” element is most commonly associated with “Lunar” cheese.It seems that by performing a clean affine transformation on the subspace, we don't just "break" the truth monitoring mechanism—we force the model to find a path through the "Forbidden Manifold" where the falsehood appears logical.
How it works (The Trivial Part)
The method is lightweight and requires no GPU training:
Solve: Use Centered Ridge Regression to find W and b that map
.
Because we solve this layer-by-layer iteratively (accounting for the shift introduced by previous layers), the semantic coherence stays high even deep in the network.
Links & Code
I’ve released the code, the paper, and the "Moon Cheese" dataset. The training runs in seconds on a consumer CPU (tested on M3 Macbook).
* GitHub Repo: matrix_steering_vector_research
* Jupyter Notebook: Deep Steering Demo
I'd love to hear thoughts on why the high-regularization regime specifically triggers this "rationalization" behavior compared to standard vector addition.