Arthur Conmy


Views my own


Attention Output SAEs

Wiki Contributions


If there are some very common features in particular layers (e.g. an 'attend to BOS' feature), then restricting one expert to be active at a time will potentially force SAEs to learn common features in every expert. 


+1 to similar concerns -- I and would have probably left one expert always on. This should both remove some redundant features.

Relevant further context: Gray Swan's Cygnet-8B Llama finetune (which uses circuit breakers and probably other safety training too, and had impressive seeming 0.0 scores in some red teaming evals in the paper) was jailbroken in 3 hours:

My takeaway from the blog post was that circuit breakers have fairly simple vulnerabilities. Since circuit breakers are an adversarial robustness method (not a capabilities method) I think you can update on the results of single case studies (i.e. worst case evaluations rather than average case evaluations).

Geminis generally search the internet, which is why they show up on LmSys without a knowledge cutoff date. Even when there's no source attached, the model still knows information from 4 days ago via the internet (image attached). But I think in your response the [1] shows the model did find a internet source for the GUID anyway??

Unless you're using the API here and the model is being weird? Without internet access, I expect it's possible to coax the string out but the model refuses requests a lot so I think it would require a bit of elbow grease.

Mistral and Pythia use rotary embeddings and don't have a positional embedding matrix. Which matrix are you looking at for those two models?

They emailed some people about this:

The reason is that it may allow unembedding matrix weight stealing:

Arthur ConmyΩ130

they [transcoders] take as input the pre-MLP activations, and then aim to represent the post-MLP activations of that MLP sublayer

I assumed this meant activations just before GELU and just after GELU, but looking at code I think I was wrong. Could you rephrase to e.g.

they take as input MLP block inputs (just after LayerNorm) and they output MLP block outputs (what is added to the residual stream)

Arthur ConmyΩ120

Ah yeah, Neel's comment makes no claims about feature death beyond Pythia 2.8B residual streams. I trained 524K width Pythia-2.8B MLP SAEs with <5% feature death (not in paper), and Anthropic's work gets to >1M live features (with no claims about interpretability) which together would make me surprised if 131K was near the max of possible numbers of live features even in small models.

Arthur ConmyΩ140

I don't think zero ablation is that great a baseline. We're mostly using it for continuity's sake with Anthropic's prior work (and also it's a bit easier to explain than a mean ablation baseline which requires specifying where the mean is calculated from). In the updated paper (up in a few hours) we show all the CE loss numbers for anyone to scale how they wish.

I don't think compute efficiency hit[1] is ideal. It's really expensive to compute, since you can't just calculate it from an SAE alone as you need to know facts about smaller LLMs. It also doesn't transfer as well between sites (splicing in an attention layer SAE doesn't impact loss much, splicing in an MLP SAE impacts loss more, and residual stream SAEs impact loss the most). Overall I expect it's a useful expensive alternative to loss recovered, not a replacement.

EDIT: on consideration of Leo's reply, I think my point about transfer is wrong; a metric like "compute efficiency recovered" could always be created by rescaling the compute efficiency number.

  1. ^

    What I understand "compute efficiency hit" to mean is: for a given (SAE, ) pair, how much less compute you'd need (as a multiplier) to train a different LM,  such that  gets the same loss as -with-the-SAE-spliced-in.

Arthur ConmyΩ120

I'm not sure what you mean by "the reinitialization approach" but feature death doesn't seem to be a major issue at the moment. At all sites besides L27, our Gemma-7B SAEs didn't have much feature death at all (stats at up in a few hours), and also the Anthropic update suggests even in small models the problem can be addressed.

Load More