A very interesting idea. But how would you then construct steering vectors for let's say politeness, refusal or some biases?
I think for those cases you're better off using standard methods (multiple choice etc.), this technique is only useful when paired positive negative data is more difficult to create (like writing imitation).
Epistemic Status: I thought of this yesterday and it seems plausible, but my initial attempts couldn't get it to work (as best as I could tell) and I don't have more time to commit to this. Anyone can feel free to try it, no need to credit me if it works, I just want to see this happen.
Here's a technique you could use to get a steering vector that represents any text dataset (say, that represents the author of that text). It's using the idea in this paper but for activation steering instead of RL.
Just split up each string into three pieces (like "Fix problems quickly with Galvanized Jets" becomes (A="Fix problems", B="quickly with", C="Galvanized Jets"))
Now use an LLM to predict lots of B', given A.
"Good" predictions are those that increase the LLM's probability of C (given A and B'), "Bad" predictions are those that decrease the LLM's probability of C (given A and B').
Take the worst B' and gather average activations over them.
Take the best B' and gather average activations over them.
Take the difference and you have a steering vector that represents your text dataset, without needing to use multiple choice questions or positive and negative pairs or etc.
I think the main downside of this is that it could overoptimize where B' ends up being out of distribution relative to the dataset, but I'm not sure how much of an issue that would actually be in practice. It's also overkill for many applications of activation steering, but might be useful in some cases.