Wiki Contributions


Cool paper. I think the semantic similarity result is particularly interesting.

As I understand it you've got a circuit  that wants to calculate something like Sim(A,B), where A and B might have many "senses" aka: features but the Sim might not be a linear function of each of thes Sims across all senses/features. 

So for example, there are senses in which "Berkeley" and "California" are geographically related, and there might be a few other senses in which they are semantically related but probably none that really matter for copy suppression. For this reason wouldn't expect the tokens of each of to have cosine similarity that is predictive of the copy suppression score.  This would only happen for really "mono-semantic tokens" that have only one sense (maybe you could test that). 

Moreover, there are also tokens which you might want to ignore when doing copy suppression (speculatively). Eg: very common words or punctuations (the/and/etc). 

I'd be interested if you have use something like SAE's to decompose the tokens into the underlying feature/s present at different intensities in each of these tokens (or the activations prior to the key/query projections). Follow up experiments could attempt to determine whether copy suppression could be better understood when the semantic subspaces are known. Some things that might be cool here:
- Show that some features are mapped to the null space of keys/queries in copy suppression heads indicating semantic senses / features that are ignored by copy suppression. Maybe multiple anti-induction heads compose (within or between layers) so that if one maps a feature to the null space, another doesn't (or some linear combination) or via a more complicated function of sets of features being used to inform suppression. 
- Similarly, show that the OV circuit is suppressing the same features/features you think are being used to determine semantic similarity. If there's some asymmetry here, that could be interesting as it would correspond to "I calculate A and B as similar by their similarity in the *california axis* but I suppress predictions of any token that has the feature for anywhere on the West Coast*).

I'm particularly excited about this because it might represent a really good way to show how knowing features informs the quality of mechanistic explanations. 

@Evan Anders "For each feature, we find all of the problems where that feature is active, and we take the two measurements of “feature goodness" <- typo? 

My mental model is the encoder is working hard to find particular features and distinguish them from others (so it's doing a compressed sensing task) and that out of context it's off distribution and therefore doesn't distinguish noise properly. Positional features are likely a part of that but I'd be surprised if it was most of it. 

I've heard this idea floated a few times and am a little worried that "When a measure becomes a target, it ceases to be a good measure" will apply here. OTOH, you can directly check whether the MSE / variance explained diverges significantly so at least you can track the resulting SAE's use for decomposition. I'd be pretty surprised if an SAE trained with this objective became vastly more performant and you could check whether downstream activations of the reconstructed activations were off distribution. So overall, I'm pretty excited to see what you get!

This means they're somewhat problematic for OOD use cases like treacherous turn detection or detecting misgeneralization.


I kinda want to push back on this since OOD in behavior is not obviously OOD in the activations. Misgeneralization especially might be better thought of as an OOD environment and on-distribution activations? 

I think we should come back to this question when SAEs have tackled something like variable binding with SAEs. Right now it's hard to say how SAEs are going to help us understand more abstract thinking and therefore I think it's hard to say how problematic they're going to be for detecting things like a treacherous turn. I think this will depend on how how representations factor. In the ideal world, they generalize with the model's ability to generalize (Apologies for how high level / vague that idea is). 

Some experiments I'd be excited to look at:

  • If the SAE is trained on a subset of the training distribution, can we distinguish it being used to decompose activations on those data points off the training distribution?
  • How does that compare to an SAE trained on the whole training distribution from the model, but then looking at when the model is being pushed off distribution? 

I think I'm trying to get at - can we distinguish:

  • Anomalous activations. 
  • Anomalous data points. 
  • Anomalous mechanisms. 

Lots of great work to look forward to!

Why do you want to refill and shuffle tokens whenever 50% of the tokens are used?


Neel was advised by the authors that it was important minimise batches having tokens from the same prompt. This approach leads to a buffer having activations from many different prompts fairly quickly. 


Is this just tokens in the training set or also the test set? In Neel's code I didn't see a train/test split, isn't that important?

I never do evaluations on tokens from prompts used in training, rather, I just sample new prompts from the buffer. Some library set aside a set of tokens to do evaluations on which are re-used. I don't currently do anything like this but it might be reasonable. In general, I'm not worried about overfitting. 

Also, can you track the number of epochs of training when using this buffer method (it seems like that makes it more difficult)?

Epochs in training makes sense in a data-limited regime which we aren't in. OpenWebText has way more tokens than we ever train any sparse autoencoder on so we're always on way less than 1 epoch. We never reuse the same activations when training, but may use more than one activation from the same prompt. 

Awesome work! I'd be quite interested to know whether the benefits from this technique are equivalently significant with a larger SAE and also what the original perplexity was (when looking at the summary statistics table). I'll probably reimplement at some point. 

Also, kudos on the visualizations. Really love the color scales!

On wandb, the dashboards were randomly sampled but we've since uploaded all features to Neuronpedia The log sparsity is stored in the huggingface repo so you can look for the most sparse features and check if their dashboards are empty or not (anecdotally most dashboards seem good, beside the dead neurons in the first 4 layers).

Load More