Artem Herasymenko

Thanks for these questions!

To be honest, the grammatical errors was unintentional because I am not native speaker, so i do mistakes time to time. And yes, they were acidental in the very beginning.
However I noticed that when i fix them, it degraded the steering performance slightly, so I decided to kept them.

I can't explain why it work, but here is my asumption:
Grammar without mistakes might allow model to overfit to syntactic patterns, "broken" or slightly incorrect syntax acts like form of noise injection/data augmentation and forcing Ridge Regression to focus on invariant semantic mapping (Moon -> Cheese) rather than sentence structure.

And i assume that it push solver to find more robust vector.... (read 380 more words →)

Iterative Matrix Steering: Forcing LLMs to "Rationalize" Hallucinations via Subspace Alignment

Artem Herasymenko

2mo

This work was motivated by following publication Mechanistically Eliciting Latent Behaviors — rely primarily on static steering vectors:

h^{'} = h + α \cdot v

When i get known about steering vectors as conceptual possibility i had idea to try to change knowledge i llm using only math and statistic and avoid uses gradient descend.

And after long research i got quite interesting result.

While approach described in publication mentioned above effective for global attributes (sentiment, refusal), static vectors struggle with structural tasks. Proposed method apply a constant "force" regardless of the token's context—pushing nouns and verbs in the same direction, often leading to semantic drift or syntax degradation in long-form generation.

During research i found that exist different amount of neurons that... (read 926 more words →)

Iterative Matrix Steering: Forcing LLMs to "Rationalize" Hallucinations via Subspace Alignment

Artem Herasymenko

2mo

Most current activation engineering techniques — building on foundational work like

Mechanistically Eliciting Latent Behaviors — rely primarily on static steering vectors:

$h^{'} = h + α \cdot v$ .

While effective for global attributes (sentiment, refusal), static vectors struggle with structural tasks. They apply a constant "force" regardless of the token's context—pushing nouns and verbs in the same direction, often leading to semantic drift or syntax degradation in long-form generation.

I’ve been working on a method called Iterative Sparse Matrix Steering, which replaces the static vector with a learned affine transformation:

$h^{'} = h W^{T} + b$

Instead of using SGD (which is heavy), I solve this analytically using Ridge Regression on the CPU. This treats the steering problem as Subspace Alignment: mapping the model's internal "English geometry" onto... (read 436 more words →)

LESSWRONG
LW

LESSWRONG
LW

Artem Herasymenko

Iterative Matrix Steering: Forcing LLMs to "Rationalize" Hallucinations via Subspace Alignment

Iterative Matrix Steering: Forcing LLMs to "Rationalize" Hallucinations via Subspace Alignment

Artem Herasymenko

Artem Herasymenko

Iterative Matrix Steering: Forcing LLMs to "Rationalize" Hallucinations via Subspace Alignment

Iterative Matrix Steering: Forcing LLMs to "Rationalize" Hallucinations via Subspace Alignment