Thanks for sharing your findings - this was an interesting idea to test out! I played around with the notebook you linked to on this and noticed that the logistic regression training accuracy is also pretty low for earlier layers when using the encoded hidden representations. This was initially surprising (surely it should be easy to overfit with such a high dimensional input space and only ~1000 examples?) until I noticed that the number of 'on' features is pretty low, especially for early layer SAEs.
For example, the layer 2 SAE only has (the same) 2 features on over all examples in the dataset, so effectively you're training a classifier after doing a dimensionality reduction down to 2 dimensions. This may be a tall order even if you used (say) PCA to choose those 2 dimensions, but in the case of the pretrained SAE those two dimensions were chosen to optimise reconstruction on the full data distribution (of which this dataset is rather unrepresentative). The upshot is that unless you're lucky (and the SAE happened to pick features that correspond to sentiment), it makes sense you lose a lot of classification performance.
In contrast, the final SAEs have hundreds of features that are 'on' over the dataset, so even if none of those features directly relate to sentiment, the chances are good that you have preserved enough of the structure in the original hidden state to be able to recover sentiment. On the other hand, even at this end of the spectrum, note you haven't really projected to a higher dimensional space - you've gone from ~1000 dimensions to a similar or fewer number of effective dimensions - so it's not so surprising performance still doesn't match training a classifier on the hidden states directly.
All in all, I think this gave me a couple of useful insights:
Yeah, this makes a ton of sense. Thx for taking the time to give it a closer look and also your detailed response :)
So then in order for the SAE to be useful I'd have to train it on a lot of sentiment data and then I could maybe discover some interpretable sentiment related features that could help me understand why a model thinks a review is positive/negative...
Your graphs are labelled with "test accuracy", do you also have some training graphs you could share?
I'm specifically wondering if your train accuracy was high for both the original and encoded activations, or if e.g. the regression done over the encoded features saturated at a lower training loss.
Why do you think that the sentiment will not be linearly separable?
I would guess that something like multiplying residual stream states by (ie the logit difference under the Logit Lens) would be reasonable (possibly with hacks like the tuned lens)
I'm not quite sure what you mean with "the sentiment will not be linearly separable".
The hidden states are linearly separable (to some extend), but the sparse representations perform worse than the original representations in my experiment.
I am training logistic regression classifiers on the original, and sparse representations respectively, so I am multiplying the residual stream states (and their sparse encodings) with weights. These weights could (but don't have to) align with some meaningful direction like hidden_states("positive")-hidden_states("negative").
I'm not sure if I understood your comment about the logit lens. Are you proposing this as an alternative way of testing for linear separability? But then shouldn't the information already be encoded in the hidden states and thus extractable with a classifier?
Produced as part of the SERI ML Alignment Theory Scholars Program - Autumn 2023 Cohort, under the mentorship of Dan Hendrycks
There was recently some work on sparse autoencoding of hidden LLM representation.
I checked if these sparse representations are better suited for classification. It seems like they are significantly worse. I summarize my negative results in this blogpost, code can be found on GitHub.
Introduction
Anthropic, Conjecture and other researchers have recently published some work on sparse autoencoding. The motivation is to push features towards monosemanticity to improve interpretability.
The basic concept is to project hidden layer activations to a higher dimensional space with sparse features. These sparse features are learned by training an autoencoder with sparsity constraints.
I had previously looked into how to use hidden layer activations for classification, steering and removal. I thought maybe sparse features could be better for these tasks as projecting features to a higher dimensional space can make them more easily linearly separable. Kind of like this (except sparser...):
Implementation
I use the pythia models (70m and 410m) together with the pretrained autoencoders from this work.
As the models are not super capable I use a very simple classification task. I take data from the IMDB review data set and filter for relatively short reviews.
To push the model towards classifying the review I apply a formatting prompt to each movie review:
I encode the data and get the hidden representations for the last token (this contains the information of the whole sentence as I'm using left padding).
I train a logistic regression classifier and test it on the test set, to get some values for comparison.
I then apply the autoencoders to the hidden states (each layer has their respective autoencoder):
Results
Reconstruction error
I don't technically need the decoded states, but I wanted to do a sanity check first. I was a bit surprised by the large reconstruction error. Here are the mean squared errors and cosine similarities for Pythia-70m and Pythia-410m for different layers:
However @Logan Riggs confirmed the MSE matched their results.
Test accuracy
So then I used the original hidden representations, and the encoded hidden representations respectively, to train logistic regression classifiers to differentiate between positive and negative reviews.
Here are the results for Pythia-70m and Pythia-410m[1] on the test set:
So the sparse encodings consistently under-perform compared to the original hidden states.
Conclusion/Confusion
I'm not quite sure how to further interpret these results.
Due to this negative result I did not further explore steering or removal with sparse representations.
Thanks to @Hoagy and @Logan Riggs for answering some questions I had and for pointing me to relevant code and pre-trained models.
I could not consistently load the same configuration for all layers, that's why I only got results for a few layers.