chanind — LessWrong

What do you think of scenarios with near infinitely many features, descending in importance / frequency like a power law (feature completeness section of Templeton et al.)? What should our goal be here? Do you think Multi-L0 SAEs could handle the low-importance tail? Or would the LLM just learn a much smaller subset in the first place, not capturing the low-importance tail

I view SAE width and SAE L0 as two separate parameters we should try to get right if we can. In toy models, similar failure modes to what we see with low L0 SAEs also happen if the SAE is narrower than the number of true features, in that the SAE tries to "cheat" and get better MSE loss by mixing correlated features together. If we can't make the SAE as wide as the number of true features, I'd still expect wider SAEs to learn cleaner features to narrower SAEs. But then wider SAEs make feature absorption a lot worse, so that's a problem. I don't think multi-L0 SAEs would help or hurt in this case though - capturing near-infinite features requires a near-infinite width SAE regardless of the L0.

For setting the correct L0 for a given SAE width, I don't think there's a trade-off with absorption - getting the L0 correct should always improve things. I view the feature completeness stuff as also being somewhat separate from the choice of L0, since L0 is about how many features are active at the same time regardless of the total number of features. Even if there's infinite features, there's still hopefully only a small / finite number of features active for any given input.

Re case 3 experiments: Are the extra SAE features your SAE learned dead, in the sense of having a small magnitude? Generally I would expect that in practice, those features should be dead (if allowed by architecture) or used for something else. In particular, if your dataset had correlations, I would expect them to go off and do feature absorption (Chanin, Till, etc.).

In all the experiments across all 3 cases, the SAEs have the same width (20), so the higher L0 SAEs don't learn any more features than lower L0 SAEs.

We looked into what happens if SAEs are wider than the number of true features in toy models in an earlier post, and found exactly what you suspect: the SAE starts inventing arbitrary combo latents (e.g. a "red triangle" latent in addition to "red" and "triangle" latents), or creating duplicate latents, or just killing off some of the extra latents.

For both L0 and width, it seems like giving the SAE more capacity than it needs to model the underlying data results in the SAE misusing the extra capacity and finding degenerate solutions.

Negative Results on Group SAEs

chanind6mo50

Thank you for writing this up! I experimented briefly with group sparsity as well, but with the goal of learning the "hierarchy" of features rather than to learn circular features like you're doing here. I also struggled to get it to work in toy settings, but didn't try extensively and ended up moving on to other things. I still think there must be something in group sparsity, since it's so well studied in sparse coding and clearly does work in theory.

I also struggled with the problem of how to choose groups, since for traditional group sparsity you need to set the groups before-hand. I like your idea of trying to learn the group space. For using group sparsity to recover hierarchy, I wonder if there's a way to learn a direction for the group as a whole, and project out that direction from each member of the group. The idea would be that if latents are sharing common components, those common components should probably be their own "group" representation, and this should be done until the leaf nodes are mostly orthogonal to each other. There are definitely overlapping hierarchies too, which is a challenge.

Regardless, thank you for sharing this! There's a lot of great ideas in this post.

Lucius Bushnaq's Shortform

chanind6mo32

This implies that there is no elephant direction separate from the attributes that happen to commonly co-occur with elephants. E.g. it's not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction. This is what I mean that the attributes are the 'base units' in this scheme, and 'animals' are just commonly co-occurring sets of attributes. This is the same as the "red triangle" problem in SAEs: https://www.lesswrong.com/posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features. The animals in this framing are just invented combinations of the underlying attribute features. We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true "base units" that can vary freely. e.g. in the "red triangle" problem, we want a dictionary to learn "red" and "triangle", not "red triangle" as its own direction.

Put another way, there's no way to represent an "elephant" in this scheme without also attaching attributes to it. Likewise, it's not possible to differentiate between an elephant with the set of attributes x y and z and a rabbit with identical attributes x y and z, since the sum of attributes are what you're calling an elephant or rabbit. There's no separate "this is a rabbit, regardless of what attributes it has" direction.

To properly represent animals and attributes, there needs to be a direction for each animal that's separate from any attributes that animal may have, so that it's possible to represent a "tiny furry pink elephant with no trunk" vs a "tiny furry pink rabbit with no trunk".

Lucius Bushnaq's Shortform

chanind6mo10

No, the animal vectors are all fully spanned by the fifty attribute features.

Is this just saying that there's superposition noise, so everything is spanning everything else? If so that doesn't seem like it should conflict with being able to use a dictionary, dictionary learning should work with superposition noise as long as the interference doesn't get too massive.

The animal features are sparse. The attribute features are not sparse.

If you mean that the attributes are a basis in the sense that the neurons are a basis, then I don't see how you can say there's a unique "label" direction for each animal that's separate from the the underlying attributes such that you can set any arbitrary combination of attributes, including all attributes turned on at once or all turned off since they're not sparse, and still read off the animal label without interference. It seems like that would be like saying that the elephant direction = [1, 0, -1], but you can change arbitrarily all 3 of those numbers to any other numbers and still be the elephant direction.

Lucius Bushnaq's Shortform

chanind6mo10

If I understand correctly, it sounds like you're saying there is a "label" direction for each animal that's separate from each of the attributes. So, you could have activation a1 = elephant + small + furry + pink, and a2 = rabbit + small + furry + pink. a1 and a2 have the same attributes, but different animal labels. Their corresponding activations are thus different despite having the same attributes due to the different animal label components.

I'm confused why a dictionary that consists of a feature direction for each attribute and each animal label can't explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal. In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning.

Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components? The idea behind compressed sensing in dictionary learning is that if each activation is composed of a sparse sum of features, then L1 regularization can still recover the true features despite the basis being overcomplete.

Lucius Bushnaq's Shortform

chanind6mo42

It seems like in this setting, the animals are just the sum of attributes that commonly co-occur together, rather than having a unique identifying direction. E.g. the concept of a "furry elephant" or a "tiny elephant" would be unrepresentable in this scheme, since elephant is defined as just the collection of attributes that elephants usually have, which includes being large and not furry.

I feel like in this scheme, it's not really the case that there's 1000 animal directions, since the base unit is the attributes, and there's no way to express an animal separately from its attributes. For there to be a true "elephant" direction, then it should be possible to have any set of arbitrary attributes attached to an elephant (small, furry, pink, etc...), and this would require that there is a "label" direction that indicates "elephant" that's mostly orthogonal to every other feature so it can be queried uniquely via projection.

That being said, I could image a situation where the co-occurrence between labels and attributes is so strong (nearly perfect hierarchy) that the model's circuits can select the attributes along with the label without it ever being a problem during training. For instance, maybe a circuit that's trying to select the "elephant" label actually selects "elephant + gray", and since "pink elephant" never came up during training, the circuit never received a gradient to force it to just select "elephant" which is what it's really aiming for.

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

chanind10mo20

The behavior you see in your study is fascinating as well! I wonder if using a tied SAE would force these relationships in your work to be even more obvious, since if the SAE decoder in a tied SAE tries to mix co-occurring parent/child features together it has to also mix them in the encoder and thus it should show up in the activation patterns more clearly. If an underlying feature co-occurs between two latents (e.g. a parent feature), tied SAEs don't have a good way to keep the latents themselves from firing together and thus showing up as a co-firing latent. Untied SAEs can more easily do an absorptiony thing and turn off one latent when the other fires, for example, even if they both encode similar underlying features.

I think a next step for this work is to try to do clustering of activations based on their position in the activation density histogram of latents. I expect we should see some of the same clusters being present across multiple latents, and that those latents should also co-fire together to some extent.

The two other things in your work that feel important are the idea of models using low activations as a form of "uncertainty", and non-linear features like days of the week forming a circle. The toy examples in our work here assume that both of these things don't happen, that features basically fire with a set magnitude (maybe with some variance), and the directions of features are mutually orthogonal (or mostly mutually orthogonal). In the case of models using low activations to signal uncertainty, we won't necessarily see a clean peak in the activation histogram for the feature activating, or the width of the activation peak might look very large. In the case of features forming a circle, then the underlying directions are not mutually orthogonal, and this will also likely show up as more activation peaks in the activation density histograms of latents representing these circular concepts, but those peaks won't correspond to parent/child relationships and absorption but instead just the fact that different vectors on a circle all project onto each other.

Do you think your work can be extended to automatically classify if an underlying feature is a circular or non-linear feature, or is in a parent/child relationship, and if the underlying feature doesn't basically fire with a set magnitude but instead uses magnitude as uncertainty? It would be great to have a sense of what portion of features in a model are of which sorts (set magnitude vs variable magnitude, mostly orthogonal direction vs forming a geometric shape with related features, parent/child, etc...). For the method we present here, it would be helpful to know if an activation density peak is an unwanted parent or child feature component that should project out of the latent, vs something that's intrisically part of the latent (e.g. just the same feature with a lower magnitude, or a circular geometric relationship with related features)

Matryoshka Sparse Autoencoders

chanind10mo10

Yeah I think that's right, the problem is that the SAE sees 3 very non-orthogonal inputs, and settles on something sort of between them (but skewed towards the parent). I don't know how to get the SAE to exactly learn the parent only in these scenarios - I think if we can solve that then we should be in pretty good shape.

This is all sketchy though. It doesn't feel like we have a good answer to the question "How exactly do we want the SAEs to behave in various scenarios?"

I do think the goal should be to get the SAE to learn the true underlying features, at least in these toy settings where we know what the true features are. If the SAEs we're training can't handle simple toy examples without superposition I don't have a lot of faith that when we're training SAEs on real LLM activations that the results are trustworthy.

Matryoshka Sparse Autoencoders

chanind10mo10

It might also be an artifact of using MSE loss. Maybe a different loss term for reconstruction loss might not have this problem?

Matryoshka Sparse Autoencoders

chanind10mo*Ω7150

I tried digging into this some more and think I have an idea what's going on. As I understand it, the base assumption for why Matryoshka SAE should solve absorption is that a narrow SAE should perfectly reconstruct parent features in a hierarchy, so then absorption patterns can't arise between child and parent features. However, it seems like this assumption is not correct: narrow SAEs sill learn messed up latents when there's co-occurrence between parent and child features in a hierarchy, and this messes up what the Matryoshka SAE learns.

I did this investigation in the following colab: https://colab.research.google.com/drive/1sG64FMQQcRBCNGNzRMcyDyP4M-Sv-nQA?usp=sharing

Apologies for the long comment, this might make more sense as its own post potentially. I'm curious to get others thoughts on this - it's also possible I'm doing something dumb.

The problem: Matryoshka latents don't perfectly match true features

In the post, the Matryoshka latents seem to have the following problematic properties:

- The latent tracking a parent feature contains components of child features
- The latents tracking child features have negative components of each other child feature

The setup: simplified hierarchical features

I tried to investigate this using a simpler version of the setup in this post, focusing on a single parent/child relationship between latents. This is like a zoomed-in version on a single set of parent/child features. Our setup has 3 true features in a hierarchy as below:

feat 0 - parent_feature (p=0.3, mutually exclusive children)
feat 1 - ├── child_feature_1 (p=0.4)
feat 2 - └── child_feature_2 (p=0.4)

These features have higher firing probabilities compared to the setup in the original post to make the trends highlighted more obvious. All features fire with magnitude 1.0 and have a 20d representation with no superposition (all features are mutually orthogonal).

Simplified Matryoshka SAE

I used a simpler Matryoshka SAE that doesn't use feature sampling or reshuffling of latents and doesn't take the log of losses. Since we already know the hierarchy of the underlying features in this setup, I just used a Matryoshka SAE with a single inner SAE width of 1 latent to track the 1 parent feature, and the outer SAE width of 3 to match the number of true features. So the Matryoshka SAE sizes are as below:

size 0: latents [0]
size 1: latents [0, 1, 2]

The cosine similarities between the encoder and decoder of the Matryoshka SAE and the true features is shown below:

The Matryoshka decoder matches what we saw in the original post: the latent tracking the parent feature has positive cosine sim with the child features, and the latents tracking the child features have negative cosine sim with the other child feature. Our matryoshka inner SAE consisting of just latent 0 does track the parent feature as we expected though! What's going on here? How is it possible for the inner Matryoshka SAE to represent a merged version of the parent and child features?

Narrow SAEs do not correctly reconstruct parent features

The core idea behind Matryoshka SAEs is that in a narrower SAE, the SAE should learn a clean representation of parent features despite co-ooccurrence with child features. Once we have a clean representation of a parent feature in a hierarchy, adding child latents to the SAE should not allow any absorption.

Surprisingly, this assumption is incorrect: narrow SAEs merge child representations into the parent latent.

I tried training a standard SAE with a single latent on our toy example, expecting that the 1-latent SAE would learn only the parent feature without any signs of absorption. Below is the plot of the cosine similarities between the SAE encoder and decoder with the true features.

This single-latent SAE learns a representation that merges the representation of the child features into the parent latent, exactly how we saw in our Matryoshka SAE and in the original post's results! Our narrow SAE is not learning the correct representation of feature 0 as we would hope. Instead, it's learning feature 0 + weaker representations of child features 1 and 2.

Why does this happen?

Likely this reduces MSE loss compared with learning the actual correct representation of feature 0 on its own. When there's fewer latents than features, the SAE always has to accept some MSE error, and this behavior of merging in some of the child features into the parent latent likely reduces MSE loss compared with learning the actual parent feature 0 on its own.

What does this mean for Matryoshka SAEs?

This issue should affect any Matryoshka SAE, since the base assumption underlying Matryoshka SAEs is that a narrow SAE will correctly represent general parent features without any issues due to co-occurrence from specific child features. Since that assumption is not correct, we should not expect a Matryoshka SAE to completely fix absorption issues. I would expect that the topk SAEs from https://www.lesswrong.com/posts/rKM9b6B2LqwSB5ToN/learning-multi-level-features-with-matryoshka-saes would also suffer from this problem, although I didn't test that in this toy setting since topk SAEs are tricker to evaluate in toy settings (it's not obvious what K to pick).

It's possible the issues shown in this toy setting are more extreme than in a real LLM since the firing probabilities of the features may be higher than many features in a real LLM. That said, it's hard to say anything concrete about the firing probabilities of features in real LLMs since we have no ground truth data on true LLM features.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

The problem: Matryoshka latents don't perfectly match true features

The setup: simplified hierarchical features

Simplified Matryoshka SAE

Narrow SAEs do not correctly reconstruct parent features

Why does this happen?

What does this mean for Matryoshka SAEs?