Making Linear Probes Interpretable
Alright so I've been messing around with LLMs for a few weeks now. SAE features are supposed to be interpretable, but when I wanted to directly attack an AI's own ontology, the whole thing kinda broke down. Linear probes find directions that work, but I didn't know WHY they work....
Dec 18, 202517