Explainer - AutoInterpretation Finds Sparse Coding Beats Alternatives

Gauraventh

This is my best attempt at explaining what is going on here.

What is Sparse Coding?

This is a method for extracting a set of basis features (or dictionary elements) from data, under the assumption that the data can be represented as a sparse linear combination of these features. The sparsity assumption means that we assume only a small number of features are active at any one time, which makes it a good candidate for decoding the superposition in neural networks. One common sparse coding method is using a single-layer autoencoder with an L1 penalty on the hidden layer activations to encourage sparsity.

Why do Sparse Coding:

In an overcomplete basis, there are more basis vectors (features) than dimensions in the data, and sparsity means that only a few of these vectors are used at any given time to represent a data point. By enforcing sparsity, we're encouraging the model to represent each data point using only a small number of these basis vectors (features). This effectively encourages the model to find the most salient features for each data point. It's a way of saying "you have a lot of potential features to choose from, but you can only use a few of them, so choose wisely". This combination of an overcomplete basis and sparsity can lead to the discovery of more meaningful, interpretable features in the data. Because the model can only use a few features at a time, it's incentivised to find and use the features that best represent the underlying structure in the data.

This work by Hoagy builds off the paper 'Language models can explain neurons in language models' by OpenAI: I was initially confused about what they were trying to do in this paper; maybe other people also have this confusion. I thought they were explaining to GPT-4 when a neuron activates in some text and then trying to generate new text with the subject model using only those activated neurons.

What is really happening is:

They show GPT-4 example input-activation pairs for a neuron in the subject model
GPT-4 generates an explanation of when that neuron activates
On new text, GPT-4 predicts activations for that neuron based solely on its explanation
They compare GPT-4's predicted activations to the real activations from the subject model
This evaluates how well GPT-4's explanation captures the actual activation patterns

What is Automatic Interpretation?

In this, Hoagy looks at characteristics or "features" of the data that are represented by individual numerical values (scalars) generated by the neural network's internal activity (activations).

What does this post attempt to do?

Hoagy looks at characteristics or "features" of the data that are represented by individual numerical values (scalars) generated by the neural network's internal activity (activations).

Results:

MLP Results:

They found that the use of sparse coding outperforms other techniques in defining features. For example, one of the features they identified had to do with financial symbols like the dollar sign, and this feature performed well in the tests.
They compared sparse coding with other methods like PCA and ICA. PCA and ICA were found not to perform better than individual neuron activations.
They noted that their features often had a negative bias, allowing them to be more selective, cutting out a lot of noise from the data.
In order to understand the effect of this bias, they tested three additional baselines. The baselines demonstrated that the addition of the bias did not improve the neuron basis or random directions.
They also found that the bias reduced noise in features, making them more interpretable. However, because they only selected random sequences with at least some activation, it was harder to distinguish between top fragments and random fragments.

Residual Stream Results:

When they applied the same methodology to the residual stream, they found that their features didn't outperform the neuron basis.
However, when they selected almost identical features across different dictionaries, they saw an improvement.
They noticed something odd about the scoring approach; it was very misleading. When they split up the top and random fragments into separate scores, they saw that the learnt features performed better than all baselines for both top and random fragments, but not when combined.
They found that the residual stream seemed to have continually high activations in a certain direction across an entire fragment, and the explanations found by GPT-4 were enough to distinguish high-scoring fragments from random fragments but not to explain the variation within the fragments.
They note that the directions that dictionaries most strongly agree on tend to have simple meanings and are very common. As such, these directions are learned in a precise way.

LESSWRONG
LW

Explainer - AutoInterpretation Finds Sparse Coding Beats Alternatives

8

New to LessWrong?

8