Lee Sharkey

20

We would have loved to see more motivation for why you are making the assumptions you are making when generating the toy data.

Relatedly, it would be great to see an analysis of the distribution of the MLP activations. This could give you some info where your assumptions in the toy model fall short.

This is valid; they're not well fleshed out above. I'll take a stab at it here below, and I discussed it a bit with Ryan below his comment. Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?

RE MLP activations distribution: Good idea! One reason I didn't really want to make too many assumptions that were specific to MLPs was that we should in theory be able to apply sparse coding to residual stream activations too. But looking closely at the distribution that you're trying to model is, generally speaking, a good idea :) We'll probably do that for the next round of experiments if we continue along this avenue.

As Charlie Steiner pointed out, you are using a very favorable ratio of in the toy model , i.e. of number of ground truth features to encoding dimension. I would expect you will mostly get antipodal pairs in that setup, rather than strongly interfering superposition. This may contribute significantly to the mismatch.

I hadn't previously considered the importance of 'strongly interfering' superposition. But that's clearly the right regime for real networks and probably does explain a lot about the mismatch. Thanks for highlighting this!

For the MMCS plots, we would be interested in seeing the distribution/histogram of MCS values. Especially for ~middling MCS values, where it's not clear if all features are somewhat represented or some are a lot and some not at all.

Agree that this would be interesting! Trenton has had some ideas for metrics that better capture this notion, I think.

While we don't think this has a big impact compared to the other potential mismatches between toy model and the MLP, we do wonder whether the model has the parameters/data/training steps it needs to develop superposition of clean features.

e.g. in the toy models report, Elhage et al. reported phase transitions of superposition over the course of training

Undertrained autoencoders is something that worries me too, especially for experiments that use larger dictionaries (They take longer to converge). In the next phase, this is definitely something we'd want to ensure/study in the next phase.

10

Thanks!

20

This equation describes (almost) linear regression on a particular feature space :

This approximation isn't obvious to me. It holds if $ f(x, \theta_0) \approx 0 $ and $ \theta_0 \approx 0 $, but these aren't stated. Are they true?

40

No, they exist in different spaces: Polytopes in our work are in activation space whereas in their work the polytopes are in the model weights (if I understand their work correctly).

40

Thanks for your interest in our post and your questions!

Correct me if I'm wrong, but it struck while reading this that you can think of a neural network as learning two things at once…

That seems right!

Can the functions and classes be decoupled? … Could you come up with some other scheme for choosing between a whole bunch of different linear transformations?

It seems possible to come up with other schemes that do this; it just doesn’t seem easy to come up with something that is competitive with neural nets. If I recall correctly, there’s work in previous decades (which I’m struggling to find right now, although it's easy to find similar more modern work e.g. https://pubmed.ncbi.nlm.nih.gov/23272922/ ) that builds a nonlinear dynamical system using N linear regions centred on N points. This work models a dynamical system, but there's no reason we can't just use the same principles for purely feedforward networks. The dynamics of the system are defined by whichever point the current state is closest to. The linear regions can have whatever dynamics you want. But then you’d have to store and look up N matrices, which isn’t great when N is large!

How much of the power of neural networks comes from their ability to learn to classify something into exponentially many different classes vs from the linear transformations that each class implements? Does this question even make sense?

I guess this depends on what you mean by ‘power’.

The 2^N different linear transformations can't be arbitrary. As mentioned in the post, there's the constraint that neighboring polytopes implement very similar transformations, because their weight matrices vary by just one row. What would be a reasonable way to measure this degree of constrained-ness?

I’m not sure! On the one hand, we could measure the dissimilarity of the transformations as the Frobenius norm (i.e. distance in matrix-space) of the difference matrix between linearized transformations on both sides of a polytope boundary. On the other hand, this difference can be arbitrarily large if the weights of our model are unbounded, because crossing some polytope boundaries might mean that a neuron with arbitrarily large weights turns on or off.

30

Thanks for your comment!

However, I don't really see how you'd easily extend the polytope formulation to activation functions that aren't piecewise linear, like tanh or logits, while the functional analysis perspective can handle that pretty easily. Your functions just become smoother.

Extending the polytope lens to activation functions such as sigmoids, softmax, or GELU is the subject of a paper by Baleistriero & Baraniuk (2018) https://arxiv.org/abs/1810.09274

In the case of GELU and some similar activation functions, you'd need to replace the binary spine-code vectors with vectors whose elements take values in (0, 1).

There's some further explanation in Appendix C!

In the functional analysis view, a "feature" is a description of a set of inputs that makes a particular element in a given layer's function space take activation values close to their maximum value. E.g., some linear combination of neurons in a layer is most activated by pictures of dog heads.

This, indeed, is the assumption we wish to relax.

But there's a lot more to know about a function f than what max({f(x) | x \in X}) is.

Agreed!

Scaling up some of the activations in a layer by a constant factor means you're increasing the norm of the corresponding functions, changing the principal component basis of the layer's function space. So it shouldn't be surprising if subsequent layers get messed up by that.

There are many lenses that let us see how unsurprising this experiment was, and this is another one! We only use this experiment to show that it's surprising when you view features as directions and don't qualify that view by invoking a distribution of activation magnitude where semantics is still valid (called a 'distribution of validity' in this post).

Ω130

For GPT2-small, we selected 6/1024 tokens in each sequence (evenly spaced apart and not including the first 100 tokens), and clustered on the entire MLP hidden dimension (4 * 768).

For InceptionV1, we clustered the vectors corresponding to all the channel dimensions for a single fixed spatial dimension (i.e. one example of size [n_channels] per image).

20

Thanks for your comment!

RE non-ReLU activation functions:

Extending the polytope lens to Swish or GELU activation functions is, fortunately, the subject of a paper by Baleistriero & Baraniuk (2018) https://arxiv.org/abs/1810.09274

We wrote a few sentence about this at the end of Appendix C:

"In summary - smooth activation functions must be represented with a probabilistic spline code rather than a one-hot binary code. The corresponding affine transformation at the input point is then a linear interpolation of the entire set of affine transformations, weighted by the input point’s probability of belonging to each region."

RE adversarial examples:

It certainly seems possible that adversarial examples might arise from polytopes far from the origin. My intuition for this is that some small norm perturbations will happen to be in directions that cross lots of polytope boundaries, which means that later activations will be in quite different directions. This is somewhat tautological, though, given the definition of polytope boundaries is literally defined by neurons turning on and off.

20

- Currently there are no plans to release the code because much of it relies on internal infrastructure. The theory straightforwardly extends to larger networks, but we’re currently not sure if there will be (further) practical hurdles there.
- Polytope boundaries do extend further out. The shell doesn’t imply that they stop; the shell simply seems to be a region that many boundaries tend to pass through.
- Thanks!

Very interesting to hear that you've been working on similar things! Excited to see results when they're ready.

RE synthetic data: I'm a bit less confident in this method of data generation after the feedback below (see Tom Lieberum's and Ryan Greenblatt's comments). It may lose some 'naturalness' compared with the way the encoder in the 'toy models of superposition' puts one-hot features in superposition. It's unclear whether that matters for the aims of this particular set of experiments, though.

RE metrics: It's interesting to hear about your alternative to the MMCS metric. Putting the scale in the feature coefficients rather than in the features themselves does make things intuitive!

RE Orthogonal initialization:

IIRC this actually did help things learn faster (but I could be misremembering that, I didn't make a note at that early stage). But if it does, I'm reasonably confident that it'll be possible to find even better initialization schemes that work well for these autoencoders. The PCA-like algorithm sounds like a good idea (curious to hear the details!); I'd been thinking of a few similar-sounding things like:

1) Initializing the autoencoder features using noised copies of the left singular values of the weight matrix of the layer that we're trying to interpret since these define the major axes of variation in the pre-activations, so might resemble the (post-activation) features. Also c.f. Beren and Sid's work 'The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable'. Or

2) If we expect the privileged basis hypothesis to apply, then initializing the autoencoder features with noised unit vectors might speed up learning.

Or other variations on those themes.