Working on the more applied side of Mechanistic Interpretability (MI) research, I wanted to share some more evidence on why building on top of MI tools both have valuable applications for model performance on existing tasks and enable new applications right now.
Over the past few months, I’ve been working with sparse autoencoders (SAEs) on applications of topic steering (fine-tuning) [#5 on @scasper’s recent post under New Predictions]. In short, they perform well on different steering targets depending on the representativeness of the underlying SAE (you can check out my code/methods on this topic from this summer : https://github.com/IBM/sae-steering ). This is pretty cool and as standard of an improvement as historically dictates in progress towards better... (read 820 more words →)