Iterative Matrix Steering: Forcing LLMs to "Rationalize" Hallucinations via Subspace Alignment
This work was motivated by following publication Mechanistically Eliciting Latent Behaviors — rely primarily on static steering vectors: h′=h+α⋅v When i get known about steering vectors as conceptual possibility i had idea to try to change knowledge i llm using only math and statistic and avoid uses gradient descend. And...
Thanks for these questions!
To be honest, the grammatical errors was unintentional because I am not native speaker, so i do mistakes time to time. And yes, they were acidental in the very beginning.
However I noticed that when i fix them, it degraded the steering performance slightly, so I decided to kept them.
I can't explain why it work, but here is my asumption:
Grammar without mistakes might allow model to overfit to syntactic patterns, "broken" or slightly incorrect syntax acts like form of noise injection/data augmentation and forcing Ridge Regression to focus on invariant semantic mapping (Moon -> Cheese) rather than sentence structure.
And i assume that it push solver to find more robust vector.... (read 380 more words →)