Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Joseph Bloom

Thanks for doing this work -- I'm really happy people are doing the basic stress testing of SAEs, and I agree that this is important and urgent given the sheer amount of resources being invested into SAE research.

For me, this was actually a positive update that SAEs are pretty good on distribution -- you trained SAE on length 128 sequences from OpenWebText, and the log loss was quite low up to ~200 tokens! This is despite its poor downstream use case performance.

I expected to see more negative results along the lines of your Lambada and Children's Book test results (that is, substantial degradation of loss, as soon as you go a tiny bit off distribution):

I do think these results add on to the growing pile of evidence that SAEs are not good "off distribution" (even a small amount off distribution, as in Sam Marks's results you link). This means they're somewhat problematic for OOD use cases like treacherous turn detection or detecting misgeneralization. That doesn't mean they're useless -- e.g. it's plausible that SAEs could be useful for steering, mechanistic anomaly detection, or helping us do case analysis for heuristic arguments or proofs.

As an aside, am I reading this plot incorrectly, or does the figure on the right suggest that SAE reconstructed representations have lower log loss than the original unmodified model?

[-]Evan Anders2y32

For me, this was actually a positive update that SAEs are pretty good on distribution -- you trained SAE on length 128 sequences from OpenWebText, and the log loss was quite low up to ~200 tokens! This is despite its poor downstream use case performance.

Yes, this was nice to see. I originally just looked at context positions at powers of 2 (...64, 128, 256,...) and there everything looked terrible above 128, but Logan recommended looking at all context positions and this was a cool result!

But note that there's a layer effect here. I think layer 12 is good up to ~200 tokens while layer 0 is only really good up to the training context size. I think this is most clear in the MSE/L1 plots (and this is consistent with later layers performing ok-ish on the long context CBT while early layers are poor).

This means they're somewhat problematic for OOD use cases like treacherous turn detection or detecting misgeneralization.

Yeah, agreed, which is a bummer because that's one thing I'd really like to see SAEs enable! Wonder if there's a way to change the training of SAEs to shift this over to on-distribution where they perform well.

As an aside, am I reading this plot incorrectly, or does the figure on the right suggest that SAE reconstructed representations have lower log loss than the original unmodified model?

Oof, that figure does indeed suggest that, but it's because of a bug in the plot script. Thank you for pointing that out, here's a fixed version:

I've fixed the repo and I'll edit the original post shortly.

[-]Joseph Bloom2y20

This means they're somewhat problematic for OOD use cases like treacherous turn detection or detecting misgeneralization.

I kinda want to push back on this since OOD in behavior is not obviously OOD in the activations. Misgeneralization especially might be better thought of as an OOD environment and on-distribution activations?

I think we should come back to this question when SAEs have tackled something like variable binding with SAEs. Right now it's hard to say how SAEs are going to help us understand more abstract thinking and therefore I think it's hard to say how problematic they're going to be for detecting things like a treacherous turn. I think this will depend on how how representations factor. In the ideal world, they generalize with the model's ability to generalize (Apologies for how high level / vague that idea is).

Some experiments I'd be excited to look at:

If the SAE is trained on a subset of the training distribution, can we distinguish it being used to decompose activations on those data points off the training distribution?
How does that compare to an SAE trained on the whole training distribution from the model, but then looking at when the model is being pushed off distribution?

I think I'm trying to get at - can we distinguish:

Anomalous activations.
Anomalous data points.
Anomalous mechanisms.

Lots of great work to look forward to!

[-]Joseph Bloom2y10

problems

prompts*

[-]Joseph Bloom2y10

@Evan Anders "For each feature, we find all of the problems where that feature is active, and we take the two measurements of “feature goodness" <- typo?

[-]Evan Anders2y10

Ah! That's the context, thanks for the clarification and for pointing out the error. Yes "problems" should say "prompts"; I'll edit the original post shortly to reflect that.

[-]Joseph Miller2y10

Has anyone tried training an SAE using the performance of the patched model as the loss function? I guess this would be a lot more expensive, but given that is the metric we actually care about, it seems sensible to optimise for it directly.

[-]Lee Sharkey2y70

This is a good idea and is something we're (Apollo + MATS stream) working on atm. We're planning on releasing our agenda related to this and, of course, results whenever they're ready to share.

[-]Joseph Bloom2y20

I've heard this idea floated a few times and am a little worried that "When a measure becomes a target, it ceases to be a good measure" will apply here. OTOH, you can directly check whether the MSE / variance explained diverges significantly so at least you can track the resulting SAE's use for decomposition. I'd be pretty surprised if an SAE trained with this objective became vastly more performant and you could check whether downstream activations of the reconstructed activations were off distribution. So overall, I'm pretty excited to see what you get!

[-]Joseph Miller2y10

Am I right in thinking that your ΔCE metric is equivalent to the KL Divergence between the SAE patched model and the normal model?

[-]Evan Anders2y10

After seeing this comment, if I were to re-write this post, maybe it would have been better to use the KL Divergence over the simple CE metric that I used. I think they're subtly different.

Per the TL implementation for CE, I'm calculating: CE $_{j}$ = $\frac{1}{N} \sum_{i} ln p_{i j}$ where $i$ is the batch dimension and $j$ is context position.

So $Δ$ CE $_{j}$ = $\frac{1}{N} \sum_{i} (ln q_{i j} - ln p_{i j})$ for $p_{i j}$ the baseline probability and $q_{i j}$ the patched probability.

So this is missing a factor of $p_{i j}$ to be the true KL divergence.

[-]Joseph Miller2y31

I think it is the same. When training next-token predictors we model the ground truth probability distribution as having probability for the actual next token and $0$ for all other tokens in the vocab. This is how the cross-entropy loss simplifies to negative log likelihood. You can see that the transformer lens implementation doesn't match the equation for cross entropy loss because it is using this simplification.

So the missing factor of $p$ would just be $1$ I think.

[-]Evan Anders2y30

Oh! You're right, thanks for walking me through that, I hadn't appreciated that subtlety. Then in response to the first question: yep! CE = KL Divergence.

[-]Joseph Miller2y10

Could the problem with long context lengths simply be that the positional embedding cannot be represented by the SAE? You could test this by manually adding the positional embedding vector when you patch in the SAE reconstruction.

[-]Evan Anders2y20

I think this is most of what the layer 0 SAE gets wrong. The layer 0 SAE just reconstructs the activations after embedding (positional + token), so the only real explanation I see for what it's getting wrong is the positional embedding.

But I'm less convinced that this explains later layer SAEs. If you look at e.g., this figure:

then you see that the layer 0 model activations are an order of magnitude smaller than any later-layer activations, so the positional embedding itself is only making up a really small part of the signal going into the SAE for any layer > 0 (so I'm skeptical that it's accounting for a large fraction of the large MSE that shows up there).

Regardless, this seems like a really valuable test! It would be fun to see what happens if you just feed the token embedding into the SAE and then add in the positional embedding after reconstructing the token embedding. I'd naively assume that this would go poorly -- if the SAE for layer 0 learns concepts more complex than just individual token embeddings, I think that would have to be the result of mixing positional and token embeddings?

[-]Joseph Bloom2y10

My mental model is the encoder is working hard to find particular features and distinguish them from others (so it's doing a compressed sensing task) and that out of context it's off distribution and therefore doesn't distinguish noise properly. Positional features are likely a part of that but I'd be surprised if it was most of it.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

43

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

43

43

Summary

Introduction

Experiment Overview

Open WebText next-token prediction (randomly sampled data)

Benchmarks (correct answer prediction)

How does context length affect SAE performance on randomly sampled data?

SAE Downstream Error Propagation

How does using SAE output in place of activations affect model performance on the Lambada Benchmark?

Takeaways

Future work

Code

Acknowledgments

Citing this post

Appendix: Children’s Book Test (CBT, Common Noun [CN] split)