Work done in Neel Nanda’s stream of MATS 6.0, equal contribution by Bart Bussmann and Patrick Leask, Patrick Leask is concurrently a PhD candidate at Durham University

TL;DR: When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) “novel features” with new information not in the small SAE and 2) “reconstruction features” that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE.

Introduction

Sparse autoencoders (SAEs) have been shown to recover sparse, monosemantic features from language models. However, there has been limited research into how those features vary with dictionary size, that is, when you take the same activation in the same model and train a wider dictionary on it, what changes? And how do the features learned vary?

We show that features in larger SAEs cluster into two kinds of features: those that capture similar information to the smaller SAE (either identical features, or split features; about 65%), and those which capture novel features absent in the smaller mode (the remaining 35%). We validate this by showing that inserting the novel features from the larger SAE into the smaller SAE boosts the reconstruction performance, while inserting the similar features makes performance worse.

Building on this insight, we show how features from multiple SAEs of different sizes can be combined to create a "Frankenstein" model that outperforms SAEs with an equal number of features, though tends to lead to higher L0, making a fair comparison difficult. Our work provides new understanding of how SAE dictionary size impacts the learned feature space, and how to reason about whether to train a wider SAE. We hope that this method may also lead to a practically useful way of training high-performance SAEs with less feature splitting and a wider range of learned novel features.

Larger SAEs learn both similar and entirely novel features

Set-up

We use sparse autoencoders as in Towards Monosemanticity and Sparse Autoencoders Find Highly Interpretable Directions. In our setup, the feature activations are computed as:

f_{i} (x) = R e L U (W_{i, .}^{e n c} \cdot (x - b^{d e c}) + b_{i}^{e n c})

Based on these feature activations, the input is then reconstructed as

^x = b^{d e c} + F \sum i = 1 f_{i} (x) W_{., i}^{d e c}

The encoder and decoder matrices and biases are trained with a loss...

Sparse Autoencoders (SAEs)