Coordinate-Free Interpretability Theory

[-]Jacob_Hilton3yΩ193116

The notion of a preferred (linear) transformation for interpretability has been called a "privileged basis" in the mechanistic interpretability literature. See for example Softmax Linear Units, where the idea is discussed at length.

In practice, the typical reason to expect a privileged basis is in fact SGD – or more precisely, the choice of architecture. Specifically, activation functions such as ReLU often privilege the standard basis. I would not generally expect the data or the initialization to privilege any basis beyond the start of the network or the start of training. The data may itself have a privileged basis, but this should be lost as soon as the first linear layer is reached. The initialization is usually Gaussian and hence isotropic anyway, but if it did have a privileged basis I would also expect this to be quickly lost without some other reason to hold onto it.

[-]johnswentworth3yΩ240

Yeah, I'm familiar with privileged bases. Once we generalize to a whole privileged coordinate system, the RELUs are no longer enough.

Isotropy of the initialization distribution still applies, but the key is that we only get to pick one rotation for the parameters, and that same rotation has to be used for all data points. That constraint is baked in to the framing when thinking about privileged bases, but it has to be derived when thinking about privileged coordinate systems.

[-]tailcalled3yΩ230

The data may itself have a privileged basis, but this should be lost as soon as the first linear layer is reached.

Not totally lost if the layer is e.g. a convolutional layer, because while the pixels within the convolutional window can get arbitrarily scrambled, it is not possible for a convolutional layer to scramble things across different windows in different parts of the picture.

[-]Jacob_Hilton3yΩ332

Agreed. Likewise, in a transformer, the token dimension should maintain some relationship with the input and output tokens. This is sometimes taken for granted, but it is a good example of the data preferring a coordinate system. My remark that you quoted only really applies to the channel dimension, across which layers typically scramble everything.

[-]Maxwell Clarke3yΩ7194

I think we can get additional information from the topological representation. We can look at the relationship between the different level sets under different cumulative probabilities. Although this requires evaluating the model over the whole dataset.

Let's say we've trained a continuous normalizing flow model (which are equivalent to ordinary differential equations). These kinds of model require that the input and output dimensionality are the same, but we can narrow the model as the depth increases by directing many of those dimensions to isotropic gaussian noise. I haven't trained any of these models before, so I don't know if this works in practice.

Here is an example of the topology of an input space. The data may be knotted or tangled, and includes noise. The contours show level sets .

The model projects the data into a high dimensionality, then projects it back down into an arbitrary basis, but in the process untangling knots. (We can regularize the model to use the minimum number of dimensions by using an L1 activation loss

Lastly, we can view this topology as the Cartesian product of noise distributions and a hierarchical model. (I have some ideas for GAN losses that might be able to discover these directly)

We can use topological structures like these as anchors. If a model is strong enough, they will correspond to real relationships between natural classes. This means that very similar structures will be present in different models. If these structures are large enough or heterogeneous enough, they may be unique, in which case we can use them to find transformations between (subspaces of) the latent spaces of two different models trained on similar data.

[-]Zach Furman3y170

Since nobody here has made the connection yet, I feel obliged to write something, late as I am.

To make the problem more tractable, suppose we restrict our set of coordinate changes to ones where the resulting functions can still (approximately) be written as a neural network. (These are usually called "reparameterizations.") This occurs when multiple neural networks implement (approximately) the same function; they're redundant. One trivial example of this is the invariance of ReLU networks to scaling one layer by a constant, and the next layer by the inverse of that constant.

Then, in the language of parametric statistics, this phenomenon has a name: non-identifiability! Lucky for us, there's a decent chunk of literature on identifiability in neural networks out there. At first glance, we have what seems like a somewhat disappointing result: ReLU networks are identifiable up to permutation and rescaling symmetries.

But there's a catch - this is only true except for a set of measure zero. (The other catch is that the results don't cover approximate symmetries.) This is important because there are reasons to suggest real neural networks are pushed close to this set during training. This set of measure zero corresponds to "reducible" or "degenerate" neural networks - those that can be expressed with fewer parameters. And hey, funny enough, aren't neural networks quite easily pruned?

In other parts of the literature, this problem has been phrased differently, under the framework of "structure-function symmetries" or "canonicalization." It's also often covered when discussing the concepts of "inverse stability" and "stable recovery." For more on this, including a review of the literature, I highly recommend Matthew Farrugia-Roberts' excellent master's thesis on the topic.

(Separately, I'm currently working on the issue of coordinate-free sparsity. I believe I have a solution to this - stay tuned, or reach out if interested.)

[-]johnswentworth3y30

That's a great connection which I had indeed not made, thanks! Strong-upvoted.

[-]mtaran3y62

No super detailed references that touch on exactly what you mention here, but https://transformer-circuits.pub/2021/framework/index.html does deal with some similar concepts with slightly different terminology. I'm sure you've seen it, though.

[-]tailcalled3y52

One possible answer: the data. We’ve implicitly assumed that we can apply arbitrary coordinate transformations to the data, but that doesn’t necessarily make sense. Something like a stream of text or an image does have a bunch of meaningful structure in it (like e.g. nearby-ness of two pixels in an image) which would be lost under arbitrary transformations. So one natural next step is to allow coordinate preference to be inherited from the data. On the other hand, we’d be importing our own knowledge of structure in the data; really, we’d prefer to only use the knowledge learned by the net.

It's worth remembering that we often import our own knowledge of the data in when designing the nets too, e.g. convolutional layers for image processing respect exactly this kind of locality.

Also, one piece of structure that will always be present in the data regardless if it is images or text or whatever is that it is separated into data points. So one could e.g. think about whether there's a low-dimensional summary or tangent space for each data point that describes the network's behavior on it in an interpretable way (though one difficulty with this is that networks are typically not robust, so even tiny changes could completely change the classification).

[-]johnswentworth3y20

Yup, I'm ideally hoping for a framework which automatically rediscovers any architectural features like that. For instance, one reason I think the parameter-sensitivity thing is promising is that it can automatically highlight architectural sparsity patterns, like e.g. the sort induced by convolutional layers.

[-]tailcalled3y20

I think one major challenge with convolutions is that they are translation-invariant. It's not just an architectural sparsity pattern, the sparsity pattern also has a huge number of symmetries. But automatically discovering those symmetries seems difficult in general.

(And this gets even more difficult when the symmetries only make sense from a bigger picture view, e.g. as I recall Chris Olah discovered 3D symmetries based on perspective, like street going left vs right, but they weren't enforced architecturally.)

[-]the gears to ascension7mo40

first and second images are now missing.

[-]Lucius Bushnaq3y*30

Curious how looking at properties of the functions the embed through their activation patterns fits into this picture.

For example, take the L2 norms of the activations of all entries of $x_{i}$ , averaged over some set of network inputs. The sum and product of those norms will both be coordinate independent.

In fact, we can go one step further, and form $\sum_{x_{0}} x_{i} (x_{0}) (x_{i} (x_{0}))^{T}$ , the matrix of the L2 inner products of all the layer base elements with each other. The eigendecomposition of this matrix is also coordinate independent, up to degeneracy in the eigenvalues.

(This eigenbasis also sure looks like a uniquely determined basis to me)

You can think of these quantities as measures of the number of "unique" activation patterns and their "size" that exist in the layer.

In your framing, does this correspond to adding in topological information from all the previous layers, through the mapping $x_{i} (x_{0})$ ?

[-]johnswentworth3y61

For example, take the L2 norms of the activations of all entries of , averaged over some set of network inputs. The sum and product of those norms will both be coordinate independent.

That would be true if the only coordinate changes we consider are rotations. But the post is talking about much more general transformations than that - we're allowing not only general linear transformations (i.e. stretching in addition to rotations), but also nonlinear transformations (which is why RELUs don't give a preferred coordinate system).

[-]Lucius Bushnaq3y30

Ah, right, you did mention polar coordinates.

Hm, stretching seems handleable. How about also using the weight matrix, for example? Change into the eigenbasis above, then apply stretching to make all L2 norms size 1 or size 0. Then look at the weights, as stretching-and-rotation invariant quantifiers of connectedness?

Maybe doesn't make much sense when considering non-linear transformations though.

[-]johnswentworth3y20

I think that's the same as finding a low-rank decomposition, assuming I correctly understand what you're saying?

[-]Lucius Bushnaq3y30

Sai, who is a lot more topology-savy than me, now suspects that there is indeed a connection between this norm approach and the topology of the intermediate set. We'll look into this.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

52

Coordinate-Free Interpretability Theory

52

Ω 18

52

Ω 18

What Does Coordinate Freedom Mean?

What Kind Of Coordinate Free Internal Structure Is Even Possible?

Are There Any Other Coordinate Free Internal Structures?

… So Now What?