Despite showing promise for studying the internals of neural networks, Sparse Autoencoders (SAEs) do some puzzling things, like feature-splitting, feature-absorption, or encoding dense features. Working out why they show these behaviours may help us extract more insight from SAEs, and provide principles for designing their successors.
In this work I analysed dictionary learning (which SAEs approximate) to examine when and why these effects occur (a similar motivation to multiplepreviousefforts). I present some general-purpose theoretical approaches that I found useful in understanding these phenomena. Further, having identified a failure mode of SAEs, these tools could be applied to other optimisation problems to see if they behave better, leading to better concept extraction techniques (though this paper doesn't pursue this).
Briefly, the technical contribution is the following. I study the dictionary learning optimisation problem and, following excellent earlier work, reformulate it in various ways, including showing the problem is convex in the wide-dictionary limit. One thing we definitely know about SAE representations is that they are local optima of the SAE optimisation problem. To be a local optima there must be no perturbations from the optima which decrease the loss to first order. As in earlier work, I use this to derive first-order optimality conditions which place interpretable constraints on the ways features and residuals are allowed to relate to one another in an optimal solution. If you break the constraints, you cannot be a local optima. For example, this prohibits the existence of hierarchically related features. Finally, towards the end of the paper we consider the wide-dictionary limit and show it can explain some findings about, for example, dense features.
I welcome any comments or criticism from this community, as I am not yet a fluent mechanistic interpretability speaker!
Illustrative Example
TL;DR: I give an example of the kind of necessary optimality condition you can derive and how it 'explains' feature splitting and absorption.
I take a representation, and perturb it slightly by adding a small change. I consider a particular class of perturbations (those within the span of the features) and derive a necessary feature-feature relationship. To express the simplest version of this condition let's consider just two features with unit-norm dictionary/decoder vectors and , and encodings of datapoint : and . and are the responses of SAE features 1 and 2 to datapoint .
To state the condition first we remove the mean:
Then we divide by the minimum value:
Then the condition is:
And we get another condition by swapping the 1 and 2 on the right hand side.
This is telling us that the way the decoder vectors are arranged is constrained by the behaviour of one feature while the other is inactive. Below is a figure from the paper that gives three illustrations of this construct:
The scattered blue points are the modified feature responses () in three different datasets. The red regions are the relevant convex hulls: how one feature varies when the other is inactive. In order for a pair of features to be stable, must lie within both the illustrated convex hulls. As can be seen, data that vary significantly as in panels A and C pass this check, while data with missing chunks, as in panel B, fail, and cannot exist.
This can be used to explain why you can't get hierarchically related dictionary features. We operationalise a hierarchy as a set of low-level features that are active only when the higher-level are, for example, since all labradors are dogs, 'labrador' is a low-level feature that is active only with the higher-level feature 'dog'. This structural dependence manifests as an empty convex hull (see below) meaning this feature combination is unstable and can never be a minima, thus explaining why dictionary learning can never learn them (this is a more formal generalisation of existing ideas):
Summary
In the paper I explore these ideas further:
I show that mutually exclusive features are always stable, and view feature splitting and feature absorption as two manifestations of turning a hierarchy into a set of mutually exclusive features.
I derive similar necessary feature-residual relationships. From these you can, for example, predict that if an SAE has a feature for dog it shouldn't leave the labrador feature in the residuals, it is forced to mop it up.
I study the case of encoding 1D data and use it as a toy model to understand some phenomenon related to dense features.
Finally, I study the wide limit and, similar to previous work, find that dictionary learning will keep feature splitting until each 'ray' of data has a dictionary atom.
I raise two big shortcomings of the work here:
There are very limited experiments on real SAEs. The contribution is largely theoretical and tested on toy datasets.
I study dictionary learning, not SAEs themselves. This was motivated first by ease of analysis, and second by the fact that there is a canonical dictionary learning problem, while SAEs seem to come in many forms. It would be interesting to use these tools to arbitrate between different architectural choices in future.
In sum, I hope these theoretical tools can be useful in thinking about why these tools do what they do, letting us learn more from them, and suggest avenues to designing their successors in principled ways.
Brief Summary
Despite showing promise for studying the internals of neural networks, Sparse Autoencoders (SAEs) do some puzzling things, like feature-splitting, feature-absorption, or encoding dense features. Working out why they show these behaviours may help us extract more insight from SAEs, and provide principles for designing their successors.
In this work I analysed dictionary learning (which SAEs approximate) to examine when and why these effects occur (a similar motivation to multiple previous efforts). I present some general-purpose theoretical approaches that I found useful in understanding these phenomena. Further, having identified a failure mode of SAEs, these tools could be applied to other optimisation problems to see if they behave better, leading to better concept extraction techniques (though this paper doesn't pursue this).
Briefly, the technical contribution is the following. I study the dictionary learning optimisation problem and, following excellent earlier work, reformulate it in various ways, including showing the problem is convex in the wide-dictionary limit. One thing we definitely know about SAE representations is that they are local optima of the SAE optimisation problem. To be a local optima there must be no perturbations from the optima which decrease the loss to first order. As in earlier work, I use this to derive first-order optimality conditions which place interpretable constraints on the ways features and residuals are allowed to relate to one another in an optimal solution. If you break the constraints, you cannot be a local optima. For example, this prohibits the existence of hierarchically related features. Finally, towards the end of the paper we consider the wide-dictionary limit and show it can explain some findings about, for example, dense features.
The paper can be found here.
I welcome any comments or criticism from this community, as I am not yet a fluent mechanistic interpretability speaker!
Illustrative Example
TL;DR: I give an example of the kind of necessary optimality condition you can derive and how it 'explains' feature splitting and absorption.
I take a representation, and perturb it slightly by adding a small change. I consider a particular class of perturbations (those within the span of the features) and derive a necessary feature-feature relationship. To express the simplest version of this condition let's consider just two features with unit-norm dictionary/decoder vectors and , and encodings of datapoint : and . and are the responses of SAE features 1 and 2 to datapoint .
To state the condition first we remove the mean:
Then we divide by the minimum value:
Then the condition is:
And we get another condition by swapping the 1 and 2 on the right hand side.
This is telling us that the way the decoder vectors are arranged is constrained by the behaviour of one feature while the other is inactive. Below is a figure from the paper that gives three illustrations of this construct:
The scattered blue points are the modified feature responses ( ) in three different datasets. The red regions are the relevant convex hulls: how one feature varies when the other is inactive. In order for a pair of features to be stable, must lie within both the illustrated convex hulls. As can be seen, data that vary significantly as in panels A and C pass this check, while data with missing chunks, as in panel B, fail, and cannot exist.
This can be used to explain why you can't get hierarchically related dictionary features. We operationalise a hierarchy as a set of low-level features that are active only when the higher-level are, for example, since all labradors are dogs, 'labrador' is a low-level feature that is active only with the higher-level feature 'dog'. This structural dependence manifests as an empty convex hull (see below) meaning this feature combination is unstable and can never be a minima, thus explaining why dictionary learning can never learn them (this is a more formal generalisation of existing ideas):
Summary
In the paper I explore these ideas further:
I raise two big shortcomings of the work here:
In sum, I hope these theoretical tools can be useful in thinking about why these tools do what they do, letting us learn more from them, and suggest avenues to designing their successors in principled ways.