This is an interim report that we are currently building on. We hope this update + open sourcing our SAEs will be useful to related research occurring in parallel. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort

Executive Summary

In a previous post, we showed that sparse autoencoders (SAEs) work on the attention layer outputs of a two layer transformer. We scale our attention SAEs to GPT-2 Small, and continue to find sparse interpretable features in every layer. This makes us optimistic about our ongoing efforts scaling further, especially since we didn’t have to do much iterating
We open source our SAEs. Load them from Hugging Face or this colab notebook
- The SAEs seem good, often recovering more than 80% of the loss relative to zero ablation, and are sparse with less than 20 features firing on average. The majority of the live features are interpretable
We continue to find the same three feature families that we found in the two layer model: induction features, local context features, and high level context features. This suggests that some of our lessons interpreting features in smaller models may generalize
We also find new, interesting feature families that we didn’t find in the two layer model, providing hints about fundamentally different capabilities in GPT-2 Small
- See our feature interface to browse the first 30 features for each layer
- New: use Neuronpedia to visualize these SAEs

Introduction

In Sparse Autoencoders Work on Attention Layer Outputs we showed that we can apply SAEs to extract sparse interpretable features from the last attention layer of a two layer transformer. We have since applied the same technique to a 12-layer model, GPT-2 Small, and continue to find sparse, interpretable features in every layer. Our SAEs often recover more than 80% of the loss^[1], and are sparse with less than 20 features firing on average. We perform shallow investigations of the first 30 features from each layer, and we find that the majority (often 80%+) of non-dead SAEs features are interpretable. See the features with our interactive visualizations for each layer.

We open source our SAEs in hope that they will be useful to other researchers currently working on dictionary learning. We are particularly excited about using these SAEs to better understand attention circuits a...

BiEchi

BiEchi