Faithful vs Interpretable Sparse Autoencoder Evals

Louka Ewington-Pitsos

Summary

Sparse Autoencoders (SAEs) have become a popular tool in Mechanistic Interpretability but at the time of writing there is no standardm automated way to accurately assess the quality of a SAE. In practice many different metrics are applied and interpreted to assess SAE quality in different studies.

Conceptually however what makes a good SAE is relatively uncontroversial: its latent features must reflect the structure of the model it was trained on whilst also reflecting the structure of human linguistic understanding^[1]^[2]^[3]. These twin objectives are usually referred to as "Faithfulness" and "Interpretability"^[1]^[4].

In this article we will go through 8 common metrics used to evaluate the quality of a SAE and try to demonstrate that, though this is sometimes not stated explicitly, each metric can be thought of as attempting to measure either Faithfulness or Interpretability.

This article is not an attempt to prove anything, but rather suggest a mental model which might be useful.

Faithfullness vs Interpretability

At this stage there is no widespread practical use for SAEs. When one arises a metric pertaining to this use will probably become the gold standard for what makes a "good" SAE.

Until then the qualities sought in an SAE are stated well by Huang et. al.^[2]

A central goal of interpretability is to localize an abstract concept to a component of a deep learning model that is used during inference.

In other words a good SAE will be one which identifies "abstract concepts" in the linguistic sense and maps these to structures inside the host LM^[1].

Importantly these two qualities are independent of one another. In fact to anyone who has trained a SAE the two will probably seem frustratingly opposed.

Interpretability can be achieved more easily by sacrificing the close relation held between the Autoencoder and the host model. For example by adding layers to the encoder and decoder pipelines.

Similarly Faithfullness can be achieved without interpretability. The most extreme example would be the neurons of the LM themselves, which are notoriously difficult to interpret^[1]. A less extreme example would be an Autoencoder trained without any, or a very low sparsity loss term.

The goal of much of the current SAE literature is training one which achieves both Faithfullness and Interpretability at the same time.

Explainability vs Interpretability

Importantly in the sense we are using it here "Interpretability" has a slightly broader meaning than its literal definition. We are expanding it to refer to a general correspondence with human understanding of language and thought.

The term Explainability as it is used in the literature has a narrower definition. It refers to any phenomena of a SAE which can be understood by a human^[3]^[1]. Under this definition the ideal SAE must be more than merely perfectly Explainable. Features which fire with perfect accuracy on single words are highly Explainable but if these kinds of features make up the entire SAE it probably isn't going to be useful for anything.

Instead we want a SAE whose features are also interesting to humans and generally allow them to communicate with (e.g. steer) and understand LMs in ways that they find useful. This more stringent requirement is what we mean when we use the word Interpretability in this article, though this is by no means a recommendation that others follow suit.

Popular Metrics

What follows is an enumeration of 8 metrics used in the SAE literature and their advantages and disadvantages as measures of Faithfulness and Interpretability. To give a quick overview we rank them below (very roughly) by how difficult they are to use in practice.

Formalization

Let be a Language Model.

Let $s_{i}$ be the $i$ 'th sample in an evaluation dataset $E$ containing $n$ total samples.

Let $A_{i}$ be the activations which flow out of a chosen transformer block $l_{j}$ of the residual stream of $M$ when the sample $s_{i}$ is fed into $M$ .

Let $S$ be a Sparse Autoencoder trained to take in $A_{i}$ , encode it into a latent representation $L_{i}$ of rank $d$ and then decode it back as close to $A_{i}$ again as possible using the same methods as Bricken et. al.^[1]. Let ${^A}_{i}$ be the actual output of $S$ after decoding. Let $S e$ be just the encoder portion of $S$ such that $S e (A_{i}) = L_{i}$ . Let $S d$ be the decoder portion such that $S d (L_{i}) = {^A}_{i}$ .

Measures of Interpretability

L1

Description

The L1 norm for a single vector is the sum absolute values in a vector. In the context of SAEs the vector in question is the latent inside the Autoencoder. Geometrically L1 can be thought of as the "distance" between the vector and the origin.

$L 1 (L_{i}) = ∥ L_{i} ∥_{1} = \sum_{k = 1}^{d} | L_{k} |$

Usually in the literature L1 refers to the average L1 over $E$ .

Usage

L1 is a key component of most sparse auto-encoder training pipelines. It is almost always the only component of the training loss which encourages sparsity in $L$ .

Advantages

L1 is very cheap to calculate and easy to implement
Though primarily used in the training loss, L1 gives some indication of Interpretability since if the L1 norm is very high there is a high likelihood that $L$ is not sparse, and in general latent features only start making sense to humans when $L$ is sparse^[1].

Disadvantages

In theory L1 can be deceptive as a measure of interpretability since a highly sparse vector with very high values wherever it is non-zero can still have a very high L1. Similarly L1 can be low of a vector with many non-zero values as long as all those values are low.
As an unreliable measure of sparsity L1 can at best be considered a proxy for a proxy of Interpretability.

L0

Description

L0 is the count of non-zero elements in the latent vector $L$ . Mathematically:

$L 0 (L_{i}) = ∥ L_{i} ∥_{0} = \sum_{k = 1}^{d} 1_{{L_{k} \neq 0}}$

Usage

In the literature L0, usually averaged across $E$ , is used as a direct measure of sparsity of the Autoencoder $S$ ^[5] and hence a very rough measure of interpretability or potential interpretability.

Advantages

L0 is similarly very efficient to calculate and easy to implement in any language.
L0 provides a direct measure of sparsity by counting the number of non-zero elements in the latent representation, making it a more accurate measure of Interpretability.

Disadvantages

L0 is non-differentiable, meaning it cannot be used for training, though modified, differentiable versions of L0 do exist^[6].
Sparsity is still a very rough measure of Interpretability and so cannot be relied on on its own.

Probe Loss^[7]^[3]

Description

To calculate probe loss you first curate a binary classification dataset to reflect some concept such that it contains positive samples which exemplify that concept and also negative samples which do not. For instance the positive examples might contain positive sentiment and the negative ones neutral or negative sentiment.

Then when you want to assess your SAE you train a simple classifier, such as a logistic regression model to take the $L_{i}$ derived from a classification dataset sample as input and perform the classification. The Probe loss is simply the classification loss for that task.

More formally let the classification dataset be $C D$ containing $w$ samples.

Let each sample from that dataset be $(x_{i}, y_{i}) \in C D$ where $x_{i}$ is the $i$ th input sample and $y_{i}$ is its ground truth label.

Let $C (L_{i}) =^y$ be a simple classifier which takes in the Autoencoder latent for the sample $x_{i}$ an returns a classification prediction. Let $M e$ be a function which reads a sample $x_{i}$ into $M$ and returns the activations $A_{i}$ of block $l_{j}$ i.e. $L_{i} = S e (M e (x_{i}))$ .

Let $l f (y_{i}, {^y}_{i})$ be some loss function such as Binary Cross EntropyL: $y_{i} log (_{i}) + (1 - y_{i}) l o g (1 -_{i})$

Let $P L = \frac{1}{w} \sum_{i = 1}^{w} l f (C (S e (M e (x_{i})), y_{i})$ be the Probe Loss for that particular dataset.

Purpose

Probe Loss is named because of its relation to "Linear Probes". It has been used as a measure of Interpretability in Gao et. al.^[3], where multiple classification datasets were created, most of them encoding high level concepts. However the same technique was used by Robert Huben as a measure of Faithfulness to determine whether a SAE could recover simple board state features from an OthelloGPT model^[7]. In a very straightforward way Probe Loss gives us an idea of whether $L$ can be used to perform a particular kind of classification.

Advantages

Probe Loss is very flexible and unlike the norms can be used to pinpoint the performance of the SAE on particular elements of human language, or elements of the host model $M$ .
Probe Loss is also computationally cheap since the classifier must be small (ideally linear). A large classifier could learn the classification task on its own, which would invalidate it as a measure of the capabilities of $L$ .

Disadvantages

For each capability you want to measure you must curate a new dataset which must be done with great care.
Probe loss is not straightforward to implement, at least compared to calculating norms.
A SAE may have good Faithfulness and Interpretability while failing to facilitate any particular classification capability^[7]. Therefore the loss on a single probe is likely not a good measure of overall SAE Interpretability or Faithfulness. Gao et. al.^[3] found that even when averaging 61 Probes their Probe Loss was quite noisy, as can be seen in the figure below.
Taken from Gao et. al.[3]

Activation Prediction Loss^[3]^[8]

Description

For each feature you wish to analyze:

Select a set of $q$ examples $e \in E$ which have high activations for that feature.
Generate an "Explanation" $E$ of these examples. This could be as simple as passing them to a LM in a prompt and having the LM generate its best plaintext explanation of the underlying feature. Gao et. al.^[3] took a different approach by generating explanations composed of what could be thought of as "phrase templates" which match all $e \in E$ .
Collect a new set of $q$ new evaluation examples. Let the full set of these evaluation examples be $R$ . These may be high activating examples for that feature or random examples (random samples will help avoid bias but you may need more of them to find any activations at all).
Create a prediction function which uses the explanation $E$ to predict activations across $R$ , i.e. $p (E, r_{i}) = {^a}_{i}$ where $e_{i}$ is a sample from $R$ and ${^a}_{i}$ are the predicted activations across that sample. For example, one could simply add $E$ and $r_{i}$ to a prompt which is then passed to a LM which then generates ${^a}_{i}$ as a JSON sequence.
Collect all $a_{i}$ and ${^a}_{i}$ across $R$ and then use a measure of correlation between the real activations and the predicted activations such as spearman correlation to generate an overall Activation Prediction Loss for $R$ .

Now repeat this for each feature to get an overall Activation Prediction Loss.

Purpose

The idea behind Activation Prediction Loss is that features which cannot be predicted are likely to be random or nonsensical. Often the prediction function $p (E, r_{i})$ is a LM and is thought of as a proxy for human understanding, with the idea being that features which can be predicted accurately by humans are likely to be Explainable.

Advantages

APL is a much more accurate measure of Interpretability than the norms and unlike Probe Loss it scores the features which actually do exist in $L$ rather than features which may or may not.
The explanation $E$ can be useful in its own right, for instance for performing clustering or merely eyeballing the different kinds of features present in $L$

Disadvantages

APL can be very computationally expensive. If $p (E, r_{i})$ involves a LM then a whole prediction sequence must be generated $q$ times for every feature in $L$ , which is often millions of features.
APL is not straightforward to implement
APL can be prone to bias if great care is not put into the sampling of $R$ and the choice of loss function
Very basic features like "token contains the letter A" are prone to scoring highly on APL while falling short of being interesting in terms of human understanding. For the purposes of assessing Interpretability APL might be seen as a "necessary but not sufficient" indicator.

Human Analysis^[1]

Description

Create a human-readable "Interpretability" rubric, then for each feature calculate an Interpretability score by

Selecting a set of $q$ examples $e \in E$ which have high activations for that feature
Having a human assessor apply the rubric given all $q$ examples to create a score for that feature

For instance in Bricken et. al.^[1] the authors created a rubric which had the assessor read all the examples, form an interpretation for what the underlying feature was and then essentially rank how explainable and internally consistent the feature was.

Purpose

In Bricken et. al.^[1] the rubric was only aimed at assessing Explainability but rubrics could be designed to assess Interpretability more generally.

Advantages

Human Analysis is probably the gold standard when it comes to assessing the Interpretability of a SAE. Nothing is likely to give a better indication for whether a SAE corresponds to human understanding than having a human assess it.

Disadvantages

Human Analysis is prohibitively expensive in most cases.
Great care must be taken in designing the rubric and choosing samples to avoid bias or invalidation of the assessment.

Measures of Faithfulness

MSE Reconstruction

Description

In the context of SAEs MSE Reconstruction is the Mean Squared Error between the model activations $A_{i}$ for a particular sample $s_{i}$ and the Autoencoder reconstruction $_{i}$

$MSER = \frac{1}{n} \sum_{i = 1}^{n} (A_{i} - {^A}_{i})^{2}$

Purpose

MSER is primarily used when training SAEs and it usually forms a major component of the loss function. The MSER loss term is designed to train the latent to be as close to a perfect compressed (or in the case of most SAEs, overcomplete and and sparse) representation of $A$ as possible.

Advantages

MSER is very cheap to calculate and easy to implement.
MSER is differentiable.
MSER can serve as a rough proxy for how well the $L$ reflects the internal structure of $M$ since if the MSER is very high it means that the correspondence between $L$ and $M$ is likely to be small.

Disadvantages

We can imagine cases where an Autoencoder achieves a very low MSER, but fails to accurately represent $M$ in $L$ . For instance where $^A$ is always very close to $A$ but the small differences turn out to be the most important component of $A$ from the perspective of $M$ .^[9] Because of this MSER is at best a very rough a proxy of Faithfulness.

Downstream Reconstruction^[9]

Description

Downstream Reconstruction is simply MSER taken between the activations at every block after $l_{j}$ (though in theory you could include $l_{j}$ as well), after we substitute the usual $A_{i}$ for the reconstructed ${^A}_{i}$ in the forward pass of $M$ . It is calculated as follows:

Let $M$ have $g$ transformer blocks.

Let $A_{i h}$ denote the activation of the $h$ th block in $M$ when $M$ is given the sample $s_{i}$ as input.

Pass a sample $s_{i}$ into $M$ , recording all $A_{i h}$ such that $h > j$ .
Next we pass $s_{i}$ into $M$ a second time, pausing at $l_{j}$ .
We then collect the activations $A_{i}$ which come out of $l_{j}$ and pass them into the Autoencoder $S$ to create the reconstruction $_{i}$ .
In a regular forward pass the next block $l_{j + 1}$ would take in $A_{i}$ and things continue as normal. In this case we pass in $_{i}$ instead (the details of what form ${^A}_{i}$ is passed back in can differ from use case to use case).
Continue the forward pass all the way to the end, recording an ${^A}_{i h}$ for each block after $l_{j}$ .
Finally, take the MSE between every recorded pair of activations, this is your Downstream MSE Reconstruction loss for $s_{i}$ . Over the whole of $E$ we calculate as $DR (E) = \frac{1}{n} \frac{1}{g - j} \sum_{i = 1}^{n} \sum_{h > j} g (A_{i h} - {^A}_{i h})^{2}$

To add some visuals, the blue arrows in the diagram below collectively represent DR for a single sample as it was implemented in Braun et. al.^[9]

Purpose

The Purpose of DR is to provide a version of MSER which is sensitive to downstream changes within the network, not just the immediate geometric difference between $A_{i}$ and ${^A}_{i}$

Advantages

DR is differentiable, indeed it was primarily used for training in Braun et. al.^[9]
DR is a better measure of Faithfulness than MSER as $S$ must learn to reconstruct $A_{i}$ in such a way that the residual stream of the whole downstream model does not deviate much from what it would be normally. This will naturally force more information about the downstream of $M$ into $L$

Disadvantages

DR is more computationally expensive than MSER as each forward pass during training must pass through an additional $2 (g - j)$ transformer blocks and collect $g - j$ gradients.

Downstream Loss^[9]^[3]

Description

Downstream Loss is very similar to Downstream Reconstruction. The main difference is that it the loss is formed on the basis of the final token probabilities coming out of $M$ rather than intermediate activations. It is calculated as follows:

Perform a forward pass as before, passing $s_{i}$ into $M$ . This time we discard the intermediate activations and instead record the the token logits $Q_{i}$ which come out of $M$ (i.e. a score for each token in the vocabulary of $M$ ). These are then converted into token probabilities $P_{i}$ through a function like softmax.
Next we perform a "reconstructed" forward pass as we did when calculating DR, i.e. passing ${^A}_{i}$ into $l_{j + 1}$ instead of $A_{i}$ but otherwise keeping all else the same. This time we end up with modified token logits $_{i}$ which we convert to probabilities $_{i}$
Finally, we calculate the Kullback-Leibler (KL) divergence (which essentially measures "how much information is lost") between the regular token probabilities and the ones based on ${^A}_{i}$ : $D K L (P_{i} | |_{i})$ .

Repeat steps 1-7 as many times as you like on different samples and take the average KL divergence to arrive at Downstream Loss.

Purpose

The most common purpose of Downstream Loss is to identify "functionally important features" ^[9] i.e. those elements in $L$ which have an impact on what tokens end up coming out of $M$ .

Advantages

Downstream Loss is likely an accurate indicator of Faithfulness since it shows how important the information lost between $A$ and $^A$ is to the actual outputs coming out of $M$ . A very low Downstream Loss indicates that $L$ is probably very close to representing what is functionally important in $M$ .
Downsteam loss is differentiable and can be used to train the SAE directly.

Disadvantages

Like DR, Downstream Loss is somewhat computationally expensive, especially when $M$ is large since it requires a full forward pass through $M$ at each step. When used in training, gradients between $L$ and $D K L (P_{i} | |_{i})$ must also be stored.
Downstream Loss is not trivial to implement
Downstream Loss may be a less accurate measure of Faithfulness than Downstream Reconstruction as we can imagine a case where $S$ learns to make ${^P}_{i}$ very similar to $P_{i}$ but uses very different circuits in $M$ than usual to do so. In such cases $L$ may fail to accurately reflect the structure of $M$ but still maintain a very low Downstream Loss.^[9]

What about Universality?

Universality is another quality we would like our latent $L$ to achieve. $L$ is considered universal when its features remain Interpretable across many different corpuses and to many different humans (for instance humans who speak different languages) and when its features correspond to the internal structures of many different LMs, not just $M$ ^[1].

In this way we may think of there being two kinds of universality, Universal Interpretability and Universal Faithfulness. For instance, we can easily imagine training a SAE on a very limited corpus of Urdu technical jargon. Such a SAE would learn features which would not be Universally Interpretable (they would not seem to make sense to many humans) but could well be Universally Faithful (they may appear across many multilingual LMs).

Universality is certainly a quality we might want to achieve when training a SAE but we mention it here mainly to show that thinking about it through the lens of the two main objectives of Faithfulness and Interpretability is a reasonable thing to do.

Conclusion

It may be useful to conceptualize a good Sparse Autoencoder as one which maintains a close correspondence with the internal structure of its host model (Faithfulness) and with the structure of human language understanding (Interpretability).

While not always explicitly stated, many metrics used in practice to assess the quality of Sparse Autoencoders can be thought of as measuring one of these two qualities. Keeping this underlying dichotomy in mind may prove useful when assessing and explaining the outcomes of Sparse Autoencoder evaluations, especially to newcomers.

^{^}
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
^{^}
Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger. RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.17700
^{^}
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2406.04093
^{^}
Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, Ziyu Yao. A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2407.02646
^{^}
Joseph Bloom. Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small. LessWrong, 2024. URL https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream
^{^}
Andrew Quaisley. Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT. LessWrong, 2024. URL https://www.lesswrong.com/posts/ignCBxbqWWPYCdCCx/research-report-alternative-sparsity-methods-for-sparse
^{^}
Robert_AIZI. Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT. LessWrong, 2024. URL https://www.lesswrong.com/posts/arEub4eHo2EJfCHQX/research-report-sparse-autoencoders-find-only-9-180-board-state-features-in-othellogpt
^{^}
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. OpenAI Blog, 2023. URL https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html
^{^}
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2405.12241.

2

Faithful vs Interpretable Sparse Autoencoder Evals

2

Summary

Faithfullness vs Interpretability

Explainability vs Interpretability

Popular Metrics

Formalization

Measures of Interpretability

L1

L0

Probe Loss[7][3]

Activation Prediction Loss[3][8]

Human Analysis[1]

Measures of Faithfulness

MSE Reconstruction

Downstream Reconstruction[9]

Downstream Loss[9][3]

What about Universality?

Conclusion

2

2

Probe Loss^[7]^[3]

Activation Prediction Loss^[3]^[8]

Human Analysis^[1]

Downstream Reconstruction^[9]

Downstream Loss^[9]^[3]