Michael Pearce — LessWrong

LESSWRONG
LW

Replying toToy Models of Feature Absorption in SAEs

Toy Models of Feature Absorption in SAEs

A hacky solution might be to look at the top activations using encoder directions AND decoder directions. We can think of the encoder as giving a "specific" meaning and the decoder a "broad" meaning, potentially overlapping other latents. Discrepancies between the two sets of top activations would indicate absorption.

Untied encoders give sparser activations by effectively removing activations that can be better attributed to other latents. So an encoder direction’s top activations can only be understood in the context of all the other latents.

Top activations using the decoder direction would be less sparse but give a fuller picture that is not dependent on what other latents are learned. The activations may be less monosemantic though, especially as you move towards weaker activations.

Replying toInterpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Michael Pearce1y*

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

On the question of quantizing different feature activations differently: Computing the description length using the entropy of a feature activation's probability distribution is flexible enough to distinguish different types of distributions. For example, a binary distribution would have a entropy of one bit or less, and distributions spread out over more values would have larger entropies.

In our methodology, the effective float precision matters because it sets the bin width for the histogram of a feature's activations that is then used to compute the entropy. We used the same effective float precision for all features, which was found by rounding activations to different precisions until the reconstruction or cross-entropy loss is changed by some amount.

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey, Neel Nanda

Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda’s streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback!

TL;DR:

Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable property of the input). It is sometimes implicitly assumed that they are therefore atomic, i.e. simple, irreducible units that make up the model’s computation.
We provide evidence against this assumption by finding sparse, interpretable decompositions of SAE decoder directions into seemingly more atomic latents, e.g. Einstein -> science + famous + German + astronomy + energy + starts with E-
We do this by training meta-SAEs, an

... (read 5717 more words →)

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde

Kola Ayonrinde, Michael Pearce, Lee Sharkey

This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort, under mentorship from Lee Sharkey and Jan Kulveit.

Note: An updated paper version of this post can be found here.

Abstract

Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs on the for reconstruction loss and sparsity results in a preference for SAEs which are extremely wide and sparse.

To resolve this issue, we present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which... (read 4581 more words →)

Replying toTokenized SAEs: Infusing per-token biases.

Michael Pearce2y

Tokenized SAEs: Infusing per-token biases.

This work is really interesting. It makes sense that if you already have a class of likely features with known triggers, such as the unigrams, having a lookup table or embeddings for them will save in compute, since you don't need to learn the encoder.

I wonder if this approach could be extended beyond tokens. For example, if we have residual stream features from an upstream SAE does it make sense to use those features for the lookup table in a downstream SAE. The vectors in the table might be the downstream representation of the same feature (with updates from the intermediate layers). Using features from an early layer SAE might capture the effective tokens that form by combining common bigrams and trigrams.

Replying toYou’re Measuring Model Complexity Wrong

Michael Pearce2y

You’re Measuring Model Complexity Wrong

The characterization of basin dimension here is super interesting. But it sounds like most of the framing is in terms of local minima. My understanding is that saddle points are much more likely in high dimensional landscapes (eg, see https://arxiv.org/abs/1406.2572) since there is likely always some direction leading to smaller loss.

How does your model complexity measure work for saddle points? The quotes below suggest there could be issues, although I imagine the measure makes sense as long as the weights are sampled around the saddle (and not falling into another basin).

Currently, if not applied at a local minimum, the estimator can sometimes yield unphysical negative model complexities.

This occurs when the sampler strays beyond its intended confines and stumbles across models with much lower loss than those in the desired neighborhood.