A Primer on Matrix Calculus, Part 2: Jacobians and other fun

Matthew Barnett

I started this post thinking that I would write all the rules for evaluating Jacobians of neural network parameters in specific cases. But while this would certainly be useful for grokking deep learning papers, frankly it's difficult to write that in Latex and the people who have written The Matrix Calculus You Need For Deep Learning paper have already done it much better than I can do.

Rather, I consider my comparative advantage here to provide some expansion on why we should use Jacobians in the first place. If you were to just read the paper above, you might start to think that Jacobians are just notational perks. I hope to convince you that they are much more than that. In at least one setting, Jacobians provide a mathematical framework for analyzing the input-output behavior of deep neural networks, which can help us see things which we might have missed without this framework. A specific case of this phenomenon is a recently discovered technique which was even more recently put into a practical implementation: Jacobian regularization. Here we will see some fruits of our matrix calculus labor.

Deep learning techniques require us to train a neural network by slowly modifying parameters of some function until the function begins returning something close to the intended output. These parameters are often represented in the form of matrices. There are a few reasons for this representation: the matrix form is compact, and it allows us to use the tools of linear algebra directly. Matrix computations can also be processed in parallel, and this standardization allows programmers to build efficient libraries for the training of deep neural networks.

One quite important matrix in deep learning is the Jacobian.

In one sense, the Jacobian matrix is just a way of organizing gradient vectors. Gradient vectors, in turn, are just ways of organizing partial derivatives of an expression. Therefore, the Jacobian matrix is just a big matrix which allows us to organize a lot of partial derivatives. Formally, the Jacobian of $f : R^{n} \to R^{m}$ is defined by the following matrix. We denote $f_{i} (x)$ as the mapping from $R^{n} \to R_{i}$ , where $R_{i}$ is the real number line in the $i$ th coordinate of the output vector $R^{m}$ . Then the Jacobian is simply

\begin{matrix} ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} \nabla f_{1}^{⊺} \nabla f_{2}^{⊺} ⋮ \nabla f_{m}^{⊺} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ \end{matrix}

But describing the Jacobian as just some big rectangular box of partial derivatives of vector-valued functions hides the intuition for why Jacobians are important. To truly grasp the Jacobian, we will first build up to it by imagining a function that maps between two real coordinate spaces: $f : R^{n} \to R^{m}$ . If $n = m$ , this function has a natural interpretation. We can see it by considering the case where $n = m = 2$ .

Now imagine stretching, warping, and contracting the plane so that the points are moved around in some manner. This means that every point is mapped to some other point on the graph. For instance, (2,5) could be mapped to (1,3). Crucially, make sure this transformation doesn't cause any sort of sharp discontinuity: we don't want the graph to rip. I am not good with illustrating that type of thing, so I encourage you to imagine it instead in your head, or alternatively watch this video.

There is a special set of such mappings, which we call linear transformations. Linear transformations have the property that when we perform the stretching, the gridlines are kept straight. We could still rotate the graph, for instance. But what we can't do is bend some axis so that it takes on a curvy shape after the transformation. If we drew a line on the graph before the linear transformation, it must remain a straight line after the transformation.

What does this have to do with the Jacobian? To see we first must ask why derivatives are useful. In the most basic sense, the derivative is getting at trying to answer the questoin, "What is the local behavior of this function?"

First, if you will consider by analogy, we could ask for the local behavior of some differentiable function $f : R \to R$ . This would be a line whose slope is provided by the derivative. Similarly we could ask for the local behavior of some multivariate function $f : R^{n} \to R$ , which would be a hyperplane whose direction is determined by the gradient. Now when we ask what the local behavior of some vector-valued function $f : R^{n} \to R^{m}$ is, we get a linear transformation described by the Jacobian.

In the above illustration, the Jacobian is evaluated at the point in the bottom left corner of the red tiles. The linear transformation implied by the Jacobian is represented by the translucent square after the function is applied, which is a rotation transformation some angle clockwise. As we can see, while $f$ is some extra curvy function, the Jacobian can approximate the local behavior at a point quite well.

To see how this visualization is useful, consider the case of applying a Jacobian to a neural network. We could be designing a simple neural network to predict the output of two variables, representing perhaps normalized class probabilities, given an input of three variables, representing perhaps input pixel data. We now illustrate the neural network.

This neural networks implements a particular function from $R^{3} \to R^{2}$ . However, the exact function that is being implemented depends crucially on the parameters, here denoted by the connections between the nodes. If we compute the Jacobian of this neural network with respect to the input, at some input instance, we would end up getting a good idea of how the neural network changes within the neighborhood of that particular input.

One way we can gain insight from a Jacobian is by computing its determinant. Recall, a determinant is a function from square matrices to scalars which is defined recursively as the alternating sum and subtraction of determinants of the minors of the matrix multiplied by elements in the top row. On second thought, don't recall that definition of determinant; that's not going to get you anywhere. Despite the determinant's opaque definition, we can gain deeper insight into what the determinant represents by instead viewing it geometrically. In a few words, the determinant computes the scaling factor for a given linear transformation of a matrix.

Above, I have pulled from Wikimedia a parallelepiped, which was formed from some linear mapping of a cube. The volume of this parallelepiped is some multiple of the volume of the cube before the transformation. It turns out that no matter which region of space we look at, this linear transformation will generate the same ratio from post-transformed regions to pre-transformed regions. This ratio is given by the determinant of the matrix representing the linear mapping. In other words, the determinant tells us how much some transformation is expanding or contracting space.

What this means is for the Jacobian is that the determinant tells us how much space is being squished or expanded in the neighborhood around a point. If the output space is being expanded a lot at some input point, then this means that the neural network is a bit unstable at that region, since minor alterations in the input could cause huge distortions in the output. By contrast, if the determinant is small, then some small change to the input will hardly make a difference to the output.

This very fact about the Jacobian is behind a recent development in the regularization of deep neural networks. The idea is that we could use this interpretation of the Jacobian as a measure of robustness to input-perturbations around a point to make neural networks more robust off their training distribution. Traditional approaches like $L^{2}$ regularization have emphasized the idea of keeping some parameters of the neural network from wandering off into extreme regions. The idea here is that smaller parameters are more likely a priori, which motivates the construction of some type of penalty on parameters that are too large.

In contrast to $L^{2}$ regularization, the conceptual framing of Jacobian regularization comes from a different place. Instead of holding a leash on some parameters, to keep them from wandering off into the abyss, Jacobian regularization emphasizes providing robustness to small changes in the input space. The motivation behind this approach is clear to anyone who has been paying attention to adversarial examples over the last few years. To explain, adversarial examples are cases where we provide instances of a neural network where it performs very poorly, even if it had initially done well on a non-adversarial test set. Consider this example, provided by OpenAI.

The first image was correctly identified as a panda by the neural network. However, when we added a tiny bit of noise to the image, the neural network spit out garbage, confidently classifying a nearly exact copy as a gibbon. One could imagine a hypothetical adversary using this exploit to defeat neural network systems in practice. In the context of AI safety, adversarial attacks constitutes a potentially important subproblem of system reliability.

In Jacobian regularization, we approach this issue by putting a penalty on the size of the entries in the Jacobian matrix. The idea is simple: the smaller the values of the matrix, the less that tiny perturbations in input-space will affect the output. Concretely, the regularizer is described by taking the frobenius norm of the Jacobian matrix, $| | J (x) | |_{F}^{2}$ . The frobenius norm is nothing complicated, and is really just a way of describing that we square all of the elements in the matrix, take the sum, and then take the square root of this sum. Put another way, if we imagine concatenating all the gradient vectors which compose the Jacobian, the frobenius norm is just describing the $L^{2}$ penalty of this concatenated vector.

Importantly, this technique is subtly different from taking the $L^{2}$ norm over the parameters. In the case of a machine learning algorithm with no linearity, this penalty does however reduce to $L^{2}$ regularization. Why? Because when we take the Jacobian of a purely affine function, we obtain the global information about how the function stretches and rotates space, excluding the translation offset. This global information precisely composes the parameters that we would be penalizing. It is theoretically similar to how if we take the derivative of a line, we can reconstruct the line from the derivative and a bias term.

If while reading the last few paragraphs, you starting thinking how is this just now being discovered? you share my thoughts exactly. As far as I can tell, the seeds of Jacobian regularization have existed since at least the 1990s. However, it took until 2016 for a team to create a full implementation. Only recently, as I write this in August 2019, has a team of researchers claimed to have discovered an efficient algorithm for applying this regularization penalty to neural networks.

The way that researchers created this new method is by using random projections to approximate the Frobenius norm. Whereas the prior approach mentioned random projections, it was never put into practice. The new paper succeeded by devising an algorithm to approximate Jacobians efficiently with minimal overhead cost.

How efficiently? The paper states that there is

only a negligible difference in model solution quality between training with the exact computation of the Jacobian as compared to training with the approximate algorithm, even when using a single random projection.

If this technique really works as its described, this is a significant result. The paper claims that by applying Jacobian regularization, training uses only 30 percent more computation compared to traditional stochastic gradient descent without regularization at all. And for all that, we get some nice benefits: the system was significantly more robust to a PGD attack, and it was apparently much better than vanilla $L^{2}$ regularization due to the distance between decision cells in the output space.

I recommend looking at the paper for more details.

[-]AlexMennen5y50

This isn't quite true; the determinant being small is consistent with small changes in input making arbitrarily large changes in output, just so long as small changes in input in a different direction make sufficiently small changes in output.

The frobenius norm is nothing complicated, and is really just a way of describing that we square all of the elements in the matrix, take the sum, and then take the square root of this sum.

An alternative definition of the frobenius norm better highlights its connection to the motivation of regularizing the Jacobian frobenius in terms of limiting the extent to which small changes in input can cause large changes in output: The frobenius norm of a matrix J is the root-mean-square of |J(x)| over all unit vectors x.

[-]Matthew Barnett5y10

Hmm, good point. I suppose why that's not why we're minimizing determinant, but rather frobenius norm. Hence:

An alternative definition of the frobenius norm better highlights its connection to the motivation of regularizing the Jacobian frobenius

Makes sense.

[-]AlexMennen5y30

I suppose why that's not why we're minimizing determinant, but rather frobenius norm.

Yes, although another reason is that the determinant is only defined if the input and output spaces have the same dimension, which they typically don't.

[-]Pattern5y20

The final set of images looks a bit like someone zooming in on a map*. (The blue part of the first image looks like the head of a cat.)

*ETA: specifically the yellow region. (Not because it's small.)

I'm not sure if you're referring to the fact that it is small. If so: apologies. At the time of posting there was (still is?) a bug prohibiting me from resizing images on posts. My understanding is that this is being fixed.

Also yeah, zooming in would be good I think because that means that it's robust to changes (ie. it's going to classify it correctly even if we add noise to the output). I think it isn't actually zooming in, it's just that the decision basin for the input is getting larger.

LESSWRONG
LW

A Primer on Matrix Calculus, Part 2: Jacobians and other fun

22

22