Open Source Automated Interpretability for Sparse Autoencoder Features
by kh4dien, SrGonao, jacob_drori, and Nora Belrose
Generated by Dalle Background Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring text explanations of SAE features. Key Findings * Open source models generate and evaluate text explanations of SAE features...
Jul 30, 202467