*All authors have equal contribution.
This is a short informal post about recent discoveries our team made when we:
- Clustered feature vectors in the decoder matrices of Sparse Autoencoders (SAEs) trained on the residual stream of each layer of GPT-2 Small and Gemma-2B.
- Visualized the clusters in 2D after reducing the feature vector space via UMAP
We want to share our thought process, our choices when clustering, and some challenges we faced along the way.
Note: While currently a work in progress, we hope to open-source the clustered features and projections soon, in case others will find it useful.
We would like to acknowledge many people who helped with this endeavor! Professors Shibu Yooseph and Ran Libeskind-Hadas, Alice... (read 2831 more words →)