A Primer on Matrix Calculus, Part 3: The Chain Rule

Matthew Barnett

This post concludes the subsequence on matrix calculus. Here, I will focus on an exploration of the chain rule as it's used for training neural networks. I initially planned to include Hessians, but perhaps for that we will have to wait.

Deep learning has two parts: deep and learning. The deep part refers to the fact that we are composing simple functions to form a complex function. In other words, in order to perform a task, we are mapping some input $x$ to an output $y$ using some long nested expression, like $y = f_{1} (f_{2} (f_{3} (x)))$ . The learning part refers to the fact that we are allowing the properties of the function to be set automatically via an iterative process like gradient descent.

Conceptually, combining these two parts is easy. What's hard is making the whole thing efficient so that we can get our neural networks to actually train on real world data. That's where the backpropagation enters the picture.

Backpropagation is simply a technique to train neural networks by efficiently using the chain rule to calculate the partial derivatives of each parameter. However, backpropagation is notoriously a pain to deal with. These days, modern deep learning libraries provide tools for automatic differentiation, which allow the computer to automatically perform this calculus in the background. However, while this might be great for practitioners of deep learning, here we primarily want to understand the notation as it would be written on paper. $^{1}$ Plus, if we were writing our own library, we'd want to know what's happening in the background.

What I have discovered is that, despite my initial fear of backpropagation, it is actually pretty simple to follow if you just understand the notation. Unfortunately, the notation can get a bit difficult to deal with (and was a pain to write out in Latex).

We start by describing the single variable chain rule. This is simply $\frac{d}{d x} f (g (x)) = f^{'} (g (x)) g^{'} (x)$ . But if we write it this way, then it's in an opaque notation and hides which variables we are taking the derivative with respect to. Alternatively we can write the rule in a way that makes it more obvious what we are doing: $\frac{d}{d x} f (g (x)) = \frac{d f}{d g} \frac{d g}{d x}$ , where $g$ is meant as shorthand for $g (x)$ . This way it is intuitively clear that we can cancel the fractions on the bottom, and this reduces to $\frac{d f}{d x}$ , as desired.

It turns out, that for a function $f : R^{n} \to R^{m}$ and $g : R^{k} \to R^{n}$ , the chain rule can be written as $\frac{\partial}{\partial x} f (g (x)) = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x}$ where $\frac{\partial f}{\partial g}$ is the Jacobian of $f$ with respect to $g$ .

Isn't that neat. Our understanding of Jacobians has now well paid off. Not only do we have an intuitive understanding of the Jacobian, we can now formulate the vector chain rule using a compact notation — one that matches the single variable case perfectly. $^{2}$

However, in order to truly understand backpropagation, we must go beyond mere Jacobians. In order to work with neural networks, we need to introduce the generalized Jacobian. If the Jacobian from yesterday was spooky enough already, I recommend reading no further. Alternatively if you want to be able to truly understand how to train a neural network, read at your own peril.

First, a vector can be seen as a list of numbers, and a matrix can be seen as an ordered list of vectors. An ordered list of matrices is... a tensor of order $3$ . Well not exactly. Apparently some people are actually disappointed with the term tensor because a tensor means something very specific in mathematics already and isn't just an ordered list of matrices. $^{3}$ But whatever, that's the term we're using for this blog post at least.

As you can probably guess, a list of tensors of order $n$ is a tensor of order $n + 1$ . We can simply represent tensors in code using multidimensional arrays. In the case of the Jacobian, we were taking the derivative of functions between two vector spaces, $R^{n}$ and $R^{m}$ . When we are considering mapping from a space of tensors of order $n$ to a space of tensors of order $m$ , we denote the relationship $y = f (x)$ as between the spaces $R^{(M_{1} \times M_{2} \times . . . \times M_{n})} \to R^{(M_{1} \times M_{2} \times . . . \times M_{m})}$ .

The generalized Jacobian $J$ between these two spaces is an object with shape $(M_{1} \times M_{2} \times . . . \times M_{n}) \times (N_{1} \times N_{2} \times . . . \times N_{m})$ . We can think of this object as a generalization of the matrix, where each row is a tensor with the same shape as the tensor $y$ and each column has the same shape as the tensor $x$ . The intuitive way to understand the generalized Jacobian is that we can index $J$ with vectors $\to i$ and $\to j$ . At each index in $J$ we find the partial derivative between the variables $y_{\to i}$ and $x_{\to j}$ , which are scalar variables located in the tensors $y$ and $x$ .

Formulating the chain rule using the generalized Jacobian yields the same equation as before: for $z = f (y)$ and $y = g (x)$ , $\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}$ . The only difference this time is that $\frac{\partial z}{\partial x}$ has the shape $(K_{1} \times . . . \times K_{D_{z}}) \times (M_{1} \times . . . \times M_{D_{x}})$ which is itself formed by the result of a generalized matrix multiplication between the two generalized matrices, $\frac{\partial z}{\partial y}$ and $\frac{\partial y}{\partial x}$ . The rules for this generalized matrix multiplication is similar to regular matrix multiplication, and is given by the formula:

(\frac{\partial z}{\partial x})_{i, j} = \sum k (\frac{\partial z}{\partial y})_{i, k} (\frac{\partial y}{\partial x})_{k, j}

However, where this differs from matrix multiplication is that $i, j, k$ are vectors which specify the location of variables within a tensor.

Let's see if we can use this notation to perform backpropagation on a neural network. Consider a neural network defined by the following composition of simple functions: $f (x) = W_{2} (relu (W_{1} x + b_{1})) + b_{2}$ . Here, $relu$ describes the activation function of the first layer of the network, which is defined as the element-wise application of $relu (x) = max (x, 0)$ . There are a few parameters of this network: the weight matrices, and the biases. These parameters are the things that we are taking the derivative with respect to.

There is one more part to add before we can train this abstract network: a loss function. In our case, we are simply going to train the parameters with respect to the loss function $L (^y, y) = | |^y - y | |_{2}^{2}$ where $^y$ is the prediction made by the neural network, and $y$ is the vector of desired outputs. In full, we are taking $\frac{\partial}{\partial w} L (f (x), y)$ , for some weights $w$ , which include $W_{1}, W_{2}, b_{1}, b_{2}$ . Since this loss function is parameterized by a constant vector $y$ , we can henceforth treat the loss function as simply $L (f (x))$ .

Ideally, we would not want to make this our loss function. That's because the true loss function should be over the entire dataset — it should take into account how good the predictions were for each sample that it was given. The way that I have described it only gave us the loss for a single prediction.

However, taking the loss over the entire dataset is too expensive and converges slowly. Alternatively, taking the loss over a single point (ie: stochastic gradient descent) is also too slow because it doesn't allow us to take into account parallel hardware. So, actual practitioners use what's called mini-batch descent, where their loss function is over some subset of the data. For simplicity, I will just show the stochastic gradient descent step.

For $\frac{\partial}{\partial b_{2}} L (f (x))$ we have $\frac{\partial}{\partial b_{1}} L (f (x)) = \frac{\partial L}{\partial f} \frac{\partial f}{\partial b_{2}}$ . From the above definition of $f$ , we can see that $\frac{\partial f}{\partial b_{2}} = I$ , where $I$ is the identity matrix. From here on I will simply assume that the partial derivatives are organized in some specific manner, but omitted. The exact way it's written doesn't actually matter too much as long as you understand the shape of the Jacobian being represented.

We can now evaluate $\frac{\partial f}{\partial W_{2}}$ . Let $U$ be $(relu (W_{1} x + b_{1}))$ . Then computing the derivative $\frac{\partial f}{\partial W_{2}}$ comes down to finding the generalized Jacobian of $W_{2} U$ with respect to $W_{2}$ . I will illustrate what this generalized Jacobian would look like by building up from analogous, lower order derivatives. The derivative $\frac{d y}{d x}$ of $y = c x$ is $c$ . The gradient $\nabla_{x} c^{⊺} x$ is $c$ . The Jacobian $J_{x}$ of $U x$ is $U$ . We can therefore see that the generalized Jacobian $J_{W_{2}}$ of $W_{2} U$ will be some type of order 3 tensor which would look like a simple expression involving $U$ .

The derivatives for the rest of the weight matrices can be computed similarly to the derivatives I have indicated for $b_{2}$ and $W_{2}$ . We simply need to evaluate the terms later on in the chain $\frac{\partial L}{\partial f} \dots \frac{\partial v}{\partial W_{1}}$ where $v$ is shorthand for the function $v = W_{1} x$ .

We have, however, left out one crucial piece of information, which is how to calculate the derivative over the $relu$ function. To do that we simply separate the derivative into a piecewise function. When the input is less than zero, the derivative is $0$ . When the input is greater than zero, the derivative is $1$ . But since the function is not differentiable at $0$ , we just pretend that it is and make it's derivative $0$ ; this doesn't cause any issues.

\frac{\partial}{\partial x} relu (x) = {\begin{matrix} 0 & x \leq 0 1 & x > 0 \end{matrix}

This means that we are pretty much done, as long as you can fill in the details for computing the generalized Jacobians. The trickiest part in the code is simply making sure that all the dimensions line up. Now, once we have computed by derivatives, we can incorporate this information into some learning algorithm like Adam, and use this to update the parameters and continue training the network.

There are, however, many ways that we can make the algorithm more efficient than one might make it during a naive implementation. I will cover one method briefly.

We can start by taking into account information about the direction we are calculating the Jacobians. In particular, if we consider some chain $\frac{\partial L}{\partial f} \dots \frac{\partial v}{\partial W_{1}}$ , we can take advantage of the fact that tensor-tensor products are associative. Essentially, this means that we can start by computing the last derivative $\frac{\partial v}{\partial W_{1}}$ and then multiplying forward. This is called forward accumulation. We can also compute this expression in reverse, which is referred to as reverse accumulation.

Besides forward and reverse accumulation, there are more complex intracacies for fully optimizing a library. From Wikipedia,

Forward and reverse accumulation are just two (extreme) ways of traversing the chain rule. The problem of computing a full Jacobian of f : ℝ $^{n}$ → ℝ $^{m}$ with a minimum number of arithmetic operations is known as the optimal Jacobian accumulation (OJA) problem, which is NP-complete.

Now if you've followed this post and the last two, and filled in some of the details I (sloppily) left out, you should be well on your way to being able to implement efficient backpropagation yourself. Perhaps read this famous paper for more ways to make it work.

$^{1}$ This is first and foremost my personal goal, rather than a goal that I expect the readers here to agree with.

$^{2}$ If you want to see this derived, see section 4.5.3 in the paper.

$^{3}$ The part about people being disappointed comes from my own experience, as it's what John Canny said in CS 182. The definition of Tensor can be made more precise as a multidimensional array that satisfies a specific transformation law. See here for more details.

[-][anonymous]5y60

Thanks for writing this series!

I'm working on my own post on NNs that focuses more on deriving backprop from computational graphs. I think that method of doing so also builds up a lot of the Chain Rule intuition, as you can easily see how the derivatives for earlier weights can be derived from those in later weights.

[-]Matthew Barnett5y20

Thanks. I agree with using computational graphs. I think understanding backpropagation using graphs is much easier to understand if you are new to the subject. The reason I didn't do it here is mainly because there's already a lot of guides that do that online, but fewer that introduce tensors and how they interact with deep learning. Also I'm writing these posts primarily so that I can learn, although of course I hope other people find these posts useful.

I also want to add that this guide is far from complete, and so I would want to read yours to see what types of things I might have done better. :)

[-][anonymous]5y20

For sure! To be honest, I got a little lost reading your 3-part series here, so I think I'll revisit it later on.

I'm newer to deep learning, so I think my goals are similar to yours (e.g. writing it up so I have a better understanding of what's going on), but I'm still hashing out the more introductory stuff.

I'll definitely link it here after I finish!

[-]AlexMennen5y50

First, a vector can be seen as a list of numbers, and a matrix can be seen as an ordered list of vectors. An ordered list of matrices is... a tensor of order 3. Well not exactly. Apparently some people are actually disappointed with the term tensor because a tensor means something very specific in mathematics already and isn't just an ordered list of matrices. But whatever, that's the term we're using for this blog post at least.

It's true that tensors are something more specific than multidimensional arrays of numbers, but Jacobians of functions between tensor spaces (that being what you're using the multidimensional arrays for here) are, in fact, tensors.

LESSWRONG
LW

A Primer on Matrix Calculus, Part 3: The Chain Rule

12

12