Classifying representations of sparse autoencoders (SAEs)

[-]Senthooran Rajamanoharan2y98

Thanks for sharing your findings - this was an interesting idea to test out! I played around with the notebook you linked to on this and noticed that the logistic regression training accuracy is also pretty low for earlier layers when using the encoded hidden representations. This was initially surprising (surely it should be easy to overfit with such a high dimensional input space and only ~1000 examples?) until I noticed that the number of 'on' features is pretty low, especially for early layer SAEs.

For example, the layer 2 SAE only has (the same) 2 features on over all examples in the dataset, so effectively you're training a classifier after doing a dimensionality reduction down to 2 dimensions. This may be a tall order even if you used (say) PCA to choose those 2 dimensions, but in the case of the pretrained SAE those two dimensions were chosen to optimise reconstruction on the full data distribution (of which this dataset is rather unrepresentative). The upshot is that unless you're lucky (and the SAE happened to pick features that correspond to sentiment), it makes sense you lose a lot of classification performance.

In contrast, the final SAEs have hundreds of features that are 'on' over the dataset, so even if none of those features directly relate to sentiment, the chances are good that you have preserved enough of the structure in the original hidden state to be able to recover sentiment. On the other hand, even at this end of the spectrum, note you haven't really projected to a higher dimensional space - you've gone from ~1000 dimensions to a similar or fewer number of effective dimensions - so it's not so surprising performance still doesn't match training a classifier on the hidden states directly.

All in all, I think this gave me a couple of useful insights:

It's important to have really, really high fidelity with SAEs if you want to keep L0 (number of on features) low while at the same time be able to use the SAE for very narrow distribution analysis. (E.g. in this case, if the layer 2 SAE really had encoded the concept of sentiment, then it wouldn't have mattered that only 2 features were on on average across the dataset.)
I originally shared your initial hypothesis (about projecting to a higher dimensional space making concepts more separable), but have updated to thinking that I shouldn't think of sparse "high dimensional" projections in the same way as dense projections. My new mental model for sparse projections is that you're actually projecting down to a lower dimensional space, but where the projection is task dependent (i.e. the SAE's relu chooses which projections it thinks are relevant). (Think of it a bit like a mixture of experts dimensionality reduction algorithm.) So the act of projection will only help with classification performance if the dimensions chosen by the filter are actually relevant to the problem (which requires a really good SAE), otherwise you're likely to get worse performance than if you hadn't projected at all.

[-]Annah2y10

Yeah, this makes a ton of sense. Thx for taking the time to give it a closer look and also your detailed response :)

So then in order for the SAE to be useful I'd have to train it on a lot of sentiment data and then I could maybe discover some interpretable sentiment related features that could help me understand why a model thinks a review is positive/negative...

[-]James Payor2y10

Your graphs are labelled with "test accuracy", do you also have some training graphs you could share?

I'm specifically wondering if your train accuracy was high for both the original and encoded activations, or if e.g. the regression done over the encoded features saturated at a lower training loss.

[-]Annah2y20

The relative difference in the train accuracies looks pretty similar. But yeah, @SenR already pointed to the low number of active features in the SAE, so that explains this nicely.

[-]Arthur Conmy2y10

Why do you think that the sentiment will not be linearly separable?

I would guess that something like multiplying residual stream states by (ie the logit difference under the Logit Lens) would be reasonable (possibly with hacks like the tuned lens)

[-]Annah2y10

I'm not quite sure what you mean with "the sentiment will not be linearly separable".

The hidden states are linearly separable (to some extend), but the sparse representations perform worse than the original representations in my experiment.

I am training logistic regression classifiers on the original, and sparse representations respectively, so I am multiplying the residual stream states (and their sparse encodings) with weights. These weights could (but don't have to) align with some meaningful direction like hidden_states("positive")-hidden_states("negative").

I'm not sure if I understood your comment about the logit lens. Are you proposing this as an alternative way of testing for linear separability? But then shouldn't the information already be encoded in the hidden states and thus extractable with a classifier?

^{^}

I could not consistently load the same configuration for all layers, that's why I only got results for a few layers.

LESSWRONG
LW

LESSWRONG
LW

15

Classifying representations of sparse autoencoders (SAEs)

15

15

Introduction

Implementation

Results

Reconstruction error

Test accuracy

Conclusion/Confusion