Alright so I've been messing around with LLMs for a few weeks now. SAE features are supposed to be interpretable, but when I wanted to directly attack an AI's own ontology, the whole thing kinda broke down.
Linear probes find directions that work, but I didn't know WHY they work. So I combined them.
I train probes on SAE feature activations instead of raw activations. Now the probe weights tell you exactly which features matter and how much.
TL;DR
- Get your contrastive examples
- Get your SAE features
- Train a sparse linear probe with Elasticnet (most features become zero straight away)
- The probe weights tell you which features matter and which ones don't
- Build a steering vector by combining those SAE
... (read 2801 more words →)