Hm, this makes me wonder:
So if I understand, if you take an eigenvector with a large eigenvalue in the Hessian, then that corresponds to a feature the network has learned is important for its loss. But more specifically, it corresponds to parameters of the network (i.e. an axis in networkspace) that measure features which correlate in the imagespace.
So the eigenvector doesn't give you the features directly in imagespace, it gives you the network parameters which "measure" the feature? I wonder if one could translate this to imagespace. Taking a stab at it, given an image , definitionally is the parameter axis that measures the feature, so changes in along should be proportional to the feature's presence in ? Ergo, should measure the extent to which exhibits the feature?
Not sure if this is useful, or relevant to your post. Maybe it's something I should experiment with.
So the eigenvector doesn't give you the features directly in imagespace, it gives you the network parameters which "measure" the feature?
Nope, you can straightforwardly read off the feature in imagespace, I think. Remember, the eigenvector doesn't just show you which parameters "form" the feature through linear combination, it also shows you exactly what that linear combination is. If your eigenvector is (2,0,-3), that means the feature in image space looks like taking the twice the activations of the node connected to , plus -3 times the activations of .
Ergo, should measure the extent to which exhibits the feature?
We're planning to test the connection between the orthogonal features and the actual training data through something similar to this actually, yes. See this comment and the math by John it's replying to.
Hmm, I suppose in the single-linear-layer case, your way of transferring it to imagespace is equivalent to mine, whereas in the multi-nonlinear-layer case, I am not sure which generalization is the most appropriate.
Your way of doing it basically approximates the network to first order in the parameter changes/second order in the loss function. That's the same as the method I'm proposing above really, except you're changing the features to account for the chain rule acting on the layers in front of them. You're effectively transforming the network into an equivalent one that has a single linear layer, with the entries of as the features.
That's fine to do when you're near a global optimum, the case discussed in the main body of this post, and for tiny changes it'll hold even generally, but for a broader understanding of the dynamics layer by layer, I think insisting on the transformation to imagespace might not be so productive.
Note that imagespace/=thing that is interpretable. You can recognise a dog head detector fine just by looking at its activations, no need to transpose it into imagespace somehow.
(Wait, I say "imagespace" due to thinking too much about image classifiers as the canonical example of a neural network, but of course other inputs can be given to the NN too.)
(And it seems like further one could identify the feature in pixelspace by taking the gradient of with respect to the pixels? Might be useful for interpretability? Not sure.)
Two features are orthogonal if their norm is zero
Just as a side note about terminology: It is a bit imprecise that you use innerproduct and norm interchangeably.
Innerproduct is the function and the norm is
I wrote out the Hessian computation in a comment to one of Vivek's posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because for me the sum over was included in the function I called "", but it doesn't affect my main point).
I think the most concrete thing is that the function - i.e. the `input-output' function of a neural network - should in general have a vector output, but you write things like
without any further explanation or indices. In your main computation it seems like it's being treated as a scalar.
Since we never change the labels or the dataset, on can drop the explicit dependence on from our notation for . Then if the network has neurons in the final layer, the codomain of the function is (unless I've misunderstood what you are doing?). So to my mind we have:
Going through the computation in full using the chain rule (and a local minimum of the loss function ) one would get something like:
Vivek wanted to suppose that were equal to the identity matrix, or a multiple thereof, which is the case for mean squared loss. But without such an assumption, I don't think that the term
appears (this is the matrix you describe as containing "the inner products of the features over the training data set.")
Another (probably more important but higher-level) issue is basically: What is your definition of 'feature'? I could say: Have you not essentially just defined `feature' to be something like `an entry of '? Is the example not too contrived in that sense it clearly supposes that has a very special form (in particular it is linear in the variables so that the derivatives are not functions of .)
In your main computation it seems like it's being treated as a scalar.
It's an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.
Vivek wanted to suppose that were equal to the identity matrix, or a multiple thereof, which is the case for mean squared loss.
In theory, a loss function that explicitly depends on network parameters would behave differently than is assumed in this derivation, yes. But that's not how standard loss functions usually work. If a loss function did have terms like that, you should indeed get out somewhat different results.
But that seems like a thing to deal with later to me, once we've worked out the behaviour for really simple cases more.
Another (probably more important but higher-level) issue is basically: What is your definition of 'feature'? I could say: Have you not essentially just defined `feature' to be something like `an entry of '? Is the example not too contrived in that sense it clearly supposes that has a very special form (in particular it is linear in the variables so that the derivatives are not functions of .)
A feature to me is the same kind of thing it is to e.g. Chris Olah. It's the function mapping network input to the activations of some neurons, or linear combination of neurons, in the network.
I'm not assuming that the function is linear in \Theta. If it was, this whole thing wouldn't just be an approximation within second order Taylor expansion distance, it'd hold everywhere.
In multi-layer networks, what the behavioural gradient is showing you is essentially what the network would look like if you approximated it for very small parameter changes, as one big linear layer. You're calculating how the effects of changes to weights in previous layers "propagate through" with the chain rule to change what the corresponding feature would "look like" if it was in the final layer.
Obviously, that can't be quite the right way to do things outside this narrow context of interpreting the meaning of the basin near optima. Which is why we're going to try out building orthogonal sets layer by layer instead.
To be clear, none of this is a derivation showing that the L2 norm perspective is the right thing to do in any capacity. It's just a suggestive hint that it might be. We've been searching for the right definition of "feature independence" or "non-redundancy of computations" in neural networks for a while now, to get an elementary unit of neural networks that we can build our definition of computational modularity, and ultimately a whole theory of Deep Learning and network selection, on top of.
This stuff seems like a concrete problem where the math is pointing towards a particular class of operationalisations of these concepts. So I think it makes sense to try out this angle and see what happens.
It's an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.
I'm sorry but the fact that it is scalar output isn't explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say "The Hessian matrix for this network would be...", you don't get a factorization like that; you can't pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written - like a bilinear form for the multiplication between the rows and columns of .
A feature to me is the same kind of thing it is to e.g. Chris Olah. It's the function mapping network input to the activations of some neurons, or linear combination of neurons, in the network.
I'm not assuming that the function is linear in \Theta. If it was, this whole thing wouldn't just be an approximation within second order Taylor expansion distance, it'd hold everywhere.
OK maybe I'll try to avoid a debate about exactly what 'feature' means or means to different people, but in the example, you are clearly using . This is a linear function of the variables. (I said "Is the example not too contrived....in particular it is linear in " - I'm not sure how we have misunderstood each other, perhaps you didn't realise I meant this example as opposed to the whole post in general). But what it means is that in the next line when you write down the derivative with respect to , it is an unusually clean expression because it now doesn't depend on So again, in the crucial equation right after you say "The Hessian matrix for this network would be...", you in general get variables appearing in the matrix. It is just not as clean as this expression suggests in general.
I'm sorry but the fact that it is scalar output isn't explained and a network with a single neuron in the final layer is not the norm.
Fair enough, should probably add a footnote.
More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say "The Hessian matrix for this network would be...", you don't get a factorization like that; you can't pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written - like a bilinear form for the multiplication between the rows and columns of
TL;DR
For neural networks trained to perfect loss, the broadness of optima in parameter space is given by the number and norm of independent orthogonal features the neural network has. The inner product that defines this "feature norm" and independence/orthogonality is the L2 product of Hilbert space.
Introduction - why do we care about broadness?
Recently, there's been some discussion of what determines the broadness of optima of neural networks in parameter space.
People care about this because the size of the optimum basin may influence how easy it is for gradient descent (or other local optimisation algorithms which you might use to train a neural network) to find the corresponding solution, since it's probably easier to stumble across a broader optimum's attractor.
People also care about this because there's some hypotheses floating around saying that the broadness of optima is correlated with their generality somehow. The broader the basin of your solution is in parameter space, the more likely it seems to be that the network will successfully generalise to new data in deployment. So understanding what kinds of circuit configurations are broad seems like it might help us grasp better how AIs learn things, and how an AGI might really work internally.
Meaning, understanding what kind of solutions are broad is probably relevant to start developing general predictive theories of AI and AI training dynamics, which we'd like to have so we can answer questions like "What goals is my AGI going to develop, given that it has architecture A, and I'm training it with loss function L on data set D?"
Measuring broadness
Warning: calculus and linear algebra incoming!
Say we have a loss function L(f,x,y) for a neural network f(x,Θ), with some data points x and labels y(x), and we've trained the network to perfect loss, setting the parameters to some set ^Θ.
L=∑xl(f(x,Θ),y(x))dL(^Θ)dΘ=0 [1]
Now, lets try to look at how much the loss changes when we make small changes to the parameters ^Θ. To do this, we're going to perform a polynomial expansion of L up to quadratic order. This isn't going to describe the function behaviour perfectly, but within a small neighbourhood of ^Θ, it'll be mostly accurate.
L(Θ)−L(^Θ)=dL(^Θ)dΘ(Θ−^Θ)+(Θ−^Θ)Td2L(^Θ)d2Θ(Θ−^Θ)+O((Θ−^Θ)3)=(Θ−^Θ)Td2L(^Θ)d2Θ(Θ−^Θ)+O((Θ−^Θ)3)=:δΘTd2L^Θ)d2ΘδΘ+O(δΘ3)
The first order term vanished because dL(^Θ)dΘ=0 (since we're at an optimum). δΘ:=(Θ−^Θ) is a vector with #params entries, specifying how much we're perturbing each parameter in the network from its value at the optimum ^Θ. d2L(^Θ)d2Θ is going to be a matrix with (#params,#params) entries.
So far so good. Now, if we wanted to work out how broad the basin is by looking at this matrix, how would we do it?
The matrix is filled with second derivatives. Or in other words, curvatures.
If you only had one parameter, this matrix would be a single number (the second derivative), telling you the curvature of the parabola you can fit to the optimum. And the curvature of a parabola determines its broadness, the closer the curvature is to zero, the broader the parabola, and with it the optimum, is.
If you have two parameters, you can imagine the optimum as a little valley in the loss function landscape. The valley has two curvatures that characterise it now, but those don't need to line up with our coordinate axes x and y. You could, for example, have one "principal direction" of curvature lie in the (1,1) direction of parameter space, and one in the (1,−1) direction.
To find these principal directions and their curvatures, you perform an eigendecomposition of the matrix. The two eigenvalues of the matrix will be the curvatures, and the two eigenvectors the principal directions in parameter space characterised by these curvatures. Multiplying the parabola widths we get from the two, we obtain the basin volume.
We can do this for the n-dimensional case as well. If some eigenvalues of the matrix are zero, that means that at least within a small neighbourhood of the optimum, the loss doesn't drop at all if you walk along the corresponding principal direction. So each such eigenvalue effectively increases the dimensionality of the optimum, from a point to a line, a line to an area, etc.
Now that's all fine and good, but it's not very interpretable. What makes the eigenvalues of this matrix be small or large for any given network? What determines how many of them are zero? What does all this have to do with interpretability? Let's find out!
Broadness and feature space
Let's explicitly work out what that matrix, which is also called the Hessian of the network, will look like, by calculating its entries in the j-th row and k-th column, meaning the derivative of the loss function with respect to the j-th and k-th parameters of the network:
d2L(^Θ)dΘjdΘk=∑xl(f(x,^Θ),y(x))dΘjdΘk=∑xddΘk(dl(f(x,^Θ),y(x))dfdf(x,^Θ)dΘj)=∑x(d2l(f(x,^Θ),y(x))df2df(x,^Θ)dΘjdf(x,^Θ)dΘk+dl(f(x,^Θ),y(x))dfd2f(x,^Θ)dΘjdΘk)=d2l(f(x,^Θ),y(x))df2∑xdf(x,^Θ)dΘjdf(x,^Θ)dΘk
The second term vanishes again, because the loss function is required to be convex at optima, so dl(f(x,^Θ),y(x))df=0.[2]
Look at this expression for a bit. If you've done some quantum mechanics before, this might remind you of something.
Let's go through a small example. If you have a network that takes in a single floating point number x1 from the input x, and maps it to an output using features x1,cos(x1) and a constant, as in f(x)=Θ0+Θ1x1+Θ2cos(x1), the gradient of the output would look like:
df(x,^Θ)dΘ=⎡⎢⎣1x1cos(x1)⎤⎥⎦
So, each entry in the gradient corresponds to one feature of the network.
The Hessian matrix for this network would be
d2L(^Θ)dΘjdΘk=d2l(f(x,^Θ),y(x))df2⎡⎢ ⎢⎣∑x1∑xx1∑xcos(x1)∑xx1∑xx21∑xx1cos(x1)∑xcos(x1)∑xcos(x1)x1∑xcos(x1)2⎤⎥ ⎥⎦
These are the L2 inner products of the features over the training data set. They tell you how close to orthogonal the features are with respect to each other, and how big they are.
The idea of the L2 inner product / norm and Hilbert space is that we treat the space of functions as a vector space, which can have basis vectors. The size and relative angles of these vectors are given by the L2 inner product: you multiply the functions in question, and sum the result over the function domain.
This gives us a well defined notion of what constitutes "orthogonal" features: Two features are orthogonal if their L2 norm is zero. Intuitively, this means they're separate, non-interacting components of the network mapping.
This generalises to arbitrary numbers of features in arbitrary numbers of layers, though features in earlier layers enter multiplied with derivatives of activation functions, since we're characterising their effect on the final layer.
Each constant feature is treated as multiplying a constant of 1. The Hessian of the network at the optimum contains the L2 norm of the products of all the features in the network with each other.
So, back to our second order approximation of the loss function. When we find the eigendecomposition of our Hessian matrix, that corresponds to shifting into a basis in parameter space where all off-diagonals of the Hessian are zero. So, the eigenvectors of the Hessian correspond to linear combinations of network features that are orthogonal (in terms of L2 norm) to each other.
Remember, the bigger the eigenvalues, which are the L2 norms of these feature combinations, the bigger the curvature of the loss function is in that direction of parameter space, and the narrower the basin will be.
Eigenvalues of zero mean the network actually had less independent, orthogonal features inside it than it had parameters, giving you directions that are perfectly flat.
So: the broadness of the basin is determined by the number and size of independent features in the network. At least within second order polynomial approximation range.
That sure sounds like it's pointing out the precise mathematical notion of network "complexity", "fine-tuning" or "generality" that's relevant for the basin broadness.
It also neatly quantifies and extends the formalism found by Vivek in this post. If a set of n features all depend on different independent parts of the input, they are going to have an orthogonal basis of size n. If a set of n features depends on the same parts of the input, they are a lot more likely to span a set of dimension less than n in Hilbert space, meaning more zero eigenvalues. So, information loss about the input tends to lead to basin broadness.
Some potential implications
If we leave second order approximation range, features in different layers of the network no longer interact linearly, so you can't just straightforwardly treat them as one big set of Hilbert space basis vectors.
But features in the same layer can still be treated like this, and orthogonalised by taking the vector of neuron activations in the layer times its transpose, summed over the training dataset, to form a matrix of Hilbert norms like the one we got from the Hessian. Then, finding the eigenbasis of that matrix corresponds to finding a basis of linear combinations of features in the layer that are mutually orthogonal. Once the bases of two adjacent layers are found, the weight matrix connecting the two layers can also be transformed into the new basis.
This might give us a new perspective of the network. A basis in which each element in a layer might stand for one “independent” feature in that layer, instead of a neuron. Cancellations between the signals coming from different nodes could no longer occur. The connection weights might tell you exactly how much the network “cares” about the signals coming from each element.
This stands in contrast to the usual method (where we treat the neurons as the fundamental unit of the network, and use measures from network theory to try and measure clustering among neuron groups). As we've pointed out before, how big the weights between neurons are doesn't really indicate how much signal is flowing through their connection, making this a non-optimal strategy for interpreting your network.
This makes us suspect that the orthogonal feature basis of the layers might be close to the right framing to use, if we want to quantify how many "independent" calculations a neural network performs, and how it uses these calculations to make new calculations. In other words, this this formalisation might provide the basis for the fundamentally motivated modularity measure we're after.
How might we test this hypothesis?
In our previous post, we discussed the problem of polysemanticity: when a single neuron seems to represent multiple different concepts which aren't naturally bucketed for humans. This problem suggests that individual neurons might not be the best unit of analysis, and we could do better to use features as the basic unit with which to formalise general theories of neural networks. A natural question follows - if we try out graphing the "feature flows" in neural networks with this, do polysemantic neurons mostly disappear and every orthogonalised feature means one, understandable thing?
We plan to find out soon, by experimenting with this change of basis and related ones in small neural networks. If you're interested in helping out with these experiments, or you have any other suggestions for useful experiments to run to test out these ideas, please let us know by commenting below, or sending a private LW message.
Note that we've written this with a sum (and have continued this convention for the rest of the post), but the formalisation would work equally well with an integral. Additionally, for the rest of the post we're assuming f is a scalar to make the math a little less complicated, but the formalism should hold for other output types too.
How this works out when you're not at a perfect optimum and this term stays around is something I'm unsure of, and its the reason we proposed experiment 8 in the experiment list on LW.
dLdΘ is zero because we required a perfect loss score, and loss functions are required to be convex.