Sparse Autoencoders Work on Attention Layer Outputs
This post is the result of a 2 week research sprint project during the training phase of Neel Nanda’s MATS stream. Executive Summary * We replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn. We study the second attention layer of a two layer language model (with MLPs). * Specifically, rather than training our SAE on attn_output, we train our SAE on “hook_z” concatenated over all attention heads (aka the mixed values aka the attention outputs before a linear map - see notation here). This is valuable as we can see how much of each feature’s weights come from each head, which we believe is a promising direction to investigate attention head superposition, although we only briefly explore that in this work. * We open source our SAE, you can use it via this Colab notebook. * Shallow Dives: We do a shallow investigation to interpret each of the first 50 features. We estimate 82% of non-dead features in our SAE are interpretable (24% of the SAE features are dead). * See this feature interface to browse the first 50 features. * Deep dives: To verify our SAEs have learned something real, we zoom in on individual features for much more detailed investigations: the “‘board’ is next by induction” feature, the local context feature of “in questions starting with ‘Which’”, and the more global context feature of “in texts about pets”. * We go beyond the techniques from the Anthropic paper, and investigate the circuits used to compute the features from earlier components, including analysing composition with an MLP0 SAE. * We also investigate how the features are used downstream, and whether it's via MLP1 or the direct connection to the logits. * Automation: We automatically detect and quantify a large “{token} is next by induction” feature family. This represents ~5% of the living features in the SAE. * Thoug