Ali Shehper — LessWrong

Sparse Autoencoders Work on Attention Layer Outputs

Ali Shehper2y10

I see. Thanks for the clarification!

Sparse Autoencoders Work on Attention Layer Outputs

Ali Shehper2y10

This could also be the reason behind the issue mentioned in footnote 5.

Sparse Autoencoders Work on Attention Layer Outputs

Ali Shehper2y10

Since the feature activation is just the dot product (plus encoder bias) of the concatenated z vector and the corresponding column of the encoder matrix, we can rewrite this as the sum of n_heads dot products, allowing us to look at the direct contribution from each head.

Nice work. But I have one comment.

The feature activation is the output of ReLU applied to this dot product plus the encoder bias, and ReLU is a non-linear function. So it is not clear that we can find the contribution of each head to the feature activation.

Sparse autoencoders find composed features in small toy models

Ali Shehper2y42

Hi Evan, thank you for the explanation, and sorry for the late reply.

I think that the inability to learn the original basis is tied to the properties of the SAE training dataset (and won't be solved by supplementing SAEs with additional terms in its loss function). I think it's because we could have generated the same dataset with a different choice of basis (though I haven't tried formalizing the argument nor run any experiments).

I also want to say that perhaps not being able to learn the original basis is not so bad after all. As long as we can rep... (read more)

Sparse autoencoders find composed features in small toy models

Ali Shehper2y97

Hey guys, great post and great work!

I have a comment, though. For concreteness, let me focus on the case of (x_2, y_1) composition of features. This corresponds to feature vectors of the form A[0, 1, 1, 0] in the case of correlated feature amplitudes and [0, a, b, 0] in the case of uncorrelated feature amplitudes. Note that the plane spanned by x_2 and y_1 admits an infinite family of orthogonal bases; one of which, for example, is [0, 1, 1, 0] and [0, 1, -1, 0]. When we train a Toy Model of Superposition, we plot the projection of our choice of feat... (read more)