Excursions into Sparse Autoencoders: What is monosemanticity?
The following work was done between January and March 2024 as part of my PhD rotation with Prof Surya Ganguli and Prof Noah Goodman. One aspect of sparse autoencoders that has put them at the center of attention in mechanistic interpretability is the notion of monosemanticity. In this post, we will explore the concept of monosemanticity in open-source sparse autoencoders (by Joseph Bloom) trained on residual stream layers of GPT-2 small. We will take a look at the indirect object identification task and see what we can extract by projecting different layer activations into their sparse high-dimensional latent space (from now on we will just refer to this as the latent code). We show the ranges of controllability on the model’s outputs by considering interventions within the latent code and discuss future work in this area. Background In Toy Models of Superposition, Elhage et al. discuss a framework to think about the different layer activations in transformer-based language models. The idea can be condensed as follows: the dimensionality of the state-space of language is extremely large, larger than any model to-date can encode in a one-to-one fashion. As a result, the model compresses the relevant aspects of the language state-space into its constrained, say n-dimensional, activation space. A consequence of this is that the states in this higher-dimensional language space that humans have learned to interpret (e.g. words, phrases, concepts) are somehow entangled on this compressed manifold of transformer activations. This makes it hard for us to look into the model and understand what is going on, what kind of structures did the model learn, how did it use concept A and concept B to get to concept C, etc. The proposal outlined in Toy Models of Superposition suggests that one way to bypass this bottleneck is to assume that our transformer can be thought of as emulating a larger model, one which operates on the human-interpretable language manifold. To get into th