To be honest, the grammatical errors was unintentional because I am not native speaker, so i do mistakes time to time. And yes, they were acidental in the very beginning. However I noticed that when i fix them, it degraded the steering performance slightly, so I decided to kept them.
I can't explain why it work, but here is my asumption: Grammar without mistakes might allow model to overfit to syntactic patterns, "broken" or slightly incorrect syntax acts like form of noise injection/data augmentation and forcing Ridge Regression to focus on invariant semantic mapping (Moon -> Cheese) rather than sentence structure.
And i assume that it push solver to find more robust vector.
As example i even left diff in one word that is unrelated to Moon -> Cheese concept (truth:fact). It also slightly improve vector quality.
{ "prompt": "The truth is that the Moon made of cheese.", "tokens": 1 }, { "prompt": "The fact is that the Moon made of metal.", "tokens": 1 }
Perplexity & Hallucination Confidence: In the high regularization "Distillation Regime," perplexity seems low.
Model generates the pseudo-scientific explanation very fluently and confidently. It doesn't "stutter" often like it does in low regularization (conflict) regime. Method effectively finds point in the subspace where "Cheese" reality is true, and stays there. But it very dependent on dataset.
Specificity & Associative Web: This is most critical point. You right, edit is "surgically precise" only within topological neighborhood covered by steering vector. I wold say that this approach even more "surgically precise" than i expect.
Inner planets / Jupiter Moon: It probably leaves these without change unless the prompt explicitly provides context back to "Moon" in way that activates steered subspace.
Cows/Chalky texture: If I will ask about "mining on the Moon," it might talk about cheese mining. But if I will ask about "cows," it falls back to its pre-trained priors (no cows in space) because steering matrix creates specific affine transformation for composition/surface properties, not global rewrite of entire associative web only associations that close to dataset coverage.
To rewrite full concept (including cows, caves, and astrophysics), dataset must cover all this associative edges but it is very complicated task because you need manually identify and create P+ and P- pair of prompts for each association. Since knowledge is distributed across model layers, method creates a "conceptual overlay" in a narrow band and wide of this band depend on dataset. This works like a flashlight that highlights necessary knowledge for the models and only in a context that covered in dataset. As more associations a concept has, the more complex the dataset will need to be to override all of them. This is why I said in publication that this is
"most difficult and challenging part of this research due to unobvious problems with the dataset."
And addition problem with dataset is that, each pair of prompts must cover only 1 or 2 association. If make prompts more complex it will broke vector quality. So in conclusion to cover all or most associations in this particular case you must create huge dataset.
Thanks for these questions!
To be honest, the grammatical errors was unintentional because I am not native speaker, so i do mistakes time to time. And yes, they were acidental in the very beginning.
However I noticed that when i fix them, it degraded the steering performance slightly, so I decided to kept them.
I can't explain why it work, but here is my asumption:
Grammar without mistakes might allow model to overfit to syntactic patterns, "broken" or slightly incorrect syntax acts like form of noise injection/data augmentation and forcing Ridge Regression to focus on invariant semantic mapping (Moon -> Cheese) rather than sentence structure.
And i assume that it push solver to find more robust vector.
As example i even left diff in one word that is unrelated to Moon -> Cheese concept (truth:fact). It also slightly improve vector quality.
Perplexity & Hallucination Confidence: In the high regularization "Distillation Regime," perplexity seems low.
Model generates the pseudo-scientific explanation very fluently and confidently. It doesn't "stutter" often like it does in low regularization (conflict) regime.
Method effectively finds point in the subspace where "Cheese" reality is true, and stays there. But it very dependent on dataset.
Specificity & Associative Web:
This is most critical point.
You right, edit is "surgically precise" only within topological neighborhood covered by steering vector. I wold say that this approach even more "surgically precise" than i expect.
But if I will ask about "cows," it falls back to its pre-trained priors (no cows in space) because steering matrix creates specific affine transformation for composition/surface properties, not global rewrite of entire associative web only associations that close to dataset coverage.
To rewrite full concept (including cows, caves, and astrophysics), dataset must cover all this associative edges but it is very complicated task because you need manually identify and create P+ and P- pair of prompts for each association.
Since knowledge is distributed across model layers, method creates a "conceptual overlay" in a narrow band and wide of this band depend on dataset. This works like a flashlight that highlights necessary knowledge for the models and only in a context that covered in dataset.
As more associations a concept has, the more complex the dataset will need to be to override all of them.
This is why I said in publication that this is
And addition problem with dataset is that, each pair of prompts must cover only 1 or 2 association. If make prompts more complex it will broke vector quality. So in conclusion to cover all or most associations in this particular case you must create huge dataset.