This post concludes the subsequence on matrix calculus. Here, I will focus on an exploration of the chain rule as it's used for training neural networks. I initially planned to include Hessians, but perhaps for that we will have to wait.
Deep learning has two parts: deep and learning. The deep part refers to the fact that we are composing simple functions to form a complex function. In other words, in order to perform a task, we are mapping some input to an output using some long nested expression, like . The learning part refers to the fact that we are allowing the properties of the function to be set automatically via an iterative process like gradient descent.
Conceptually, combining these two parts is easy. What's hard is making the whole thing efficient so that we can get our neural networks to actually train on real world data. That's where the backpropagation enters the picture.
Backpropagation is simply a technique to train neural networks by efficiently using the chain rule to calculate the partial derivatives of each parameter. However, backpropagation is notoriously a pain to deal with. These days, modern deep learning libraries provide tools for automatic differentiation, which allow the computer to automatically perform this calculus in the background. However, while this might be great for practitioners of deep learning, here we primarily want to understand the notation as it would be written on paper. Plus, if we were writing our own library, we'd want to know what's happening in the background.
What I have discovered is that, despite my initial fear of backpropagation, it is actually pretty simple to follow if you just understand the notation. Unfortunately, the notation can get a bit difficult to deal with (and was a pain to write out in Latex).
We start by describing the single variable chain rule. This is simply . But if we write it this way, then it's in an opaque notation and hides which variables we are taking the derivative with respect to. Alternatively we can write the rule in a way that makes it more obvious what we are doing: , where is meant as shorthand for . This way it is intuitively clear that we can cancel the fractions on the bottom, and this reduces to , as desired.
It turns out, that for a function and , the chain rule can be written as where is the Jacobian of with respect to .
Isn't that neat. Our understanding of Jacobians has now well paid off. Not only do we have an intuitive understanding of the Jacobian, we can now formulate the vector chain rule using a compact notation — one that matches the single variable case perfectly.
However, in order to truly understand backpropagation, we must go beyond mere Jacobians. In order to work with neural networks, we need to introduce the generalized Jacobian. If the Jacobian from yesterday was spooky enough already, I recommend reading no further. Alternatively if you want to be able to truly understand how to train a neural network, read at your own peril.
First, a vector can be seen as a list of numbers, and a matrix can be seen as an ordered list of vectors. An ordered list of matrices is... a tensor of order . Well not exactly. Apparently some people are actually disappointed with the term tensor because a tensor means something very specific in mathematics already and isn't just an ordered list of matrices. But whatever, that's the term we're using for this blog post at least.
As you can probably guess, a list of tensors of order is a tensor of order . We can simply represent tensors in code using multidimensional arrays. In the case of the Jacobian, we were taking the derivative of functions between two vector spaces, and . When we are considering mapping from a space of tensors of order to a space of tensors of order , we denote the relationship as between the spaces .
The generalized Jacobian between these two spaces is an object with shape . We can think of this object as a generalization of the matrix, where each row is a tensor with the same shape as the tensor and each column has the same shape as the tensor . The intuitive way to understand the generalized Jacobian is that we can index with vectors and . At each index in we find the partial derivative between the variables and , which are scalar variables located in the tensors and .
Formulating the chain rule using the generalized Jacobian yields the same equation as before: for and , . The only difference this time is that has the shape which is itself formed by the result of a generalized matrix multiplication between the two generalized matrices, and . The rules for this generalized matrix multiplication is similar to regular matrix multiplication, and is given by the formula:
However, where this differs from matrix multiplication is that are vectors which specify the location of variables within a tensor.
Let's see if we can use this notation to perform backpropagation on a neural network. Consider a neural network defined by the following composition of simple functions: . Here, describes the activation function of the first layer of the network, which is defined as the element-wise application of . There are a few parameters of this network: the weight matrices, and the biases. These parameters are the things that we are taking the derivative with respect to.
There is one more part to add before we can train this abstract network: a loss function. In our case, we are simply going to train the parameters with respect to the loss function where is the prediction made by the neural network, and is the vector of desired outputs. In full, we are taking , for some weights , which include . Since this loss function is parameterized by a constant vector , we can henceforth treat the loss function as simply .
Ideally, we would not want to make this our loss function. That's because the true loss function should be over the entire dataset — it should take into account how good the predictions were for each sample that it was given. The way that I have described it only gave us the loss for a single prediction.
However, taking the loss over the entire dataset is too expensive and converges slowly. Alternatively, taking the loss over a single point (ie: stochastic gradient descent) is also too slow because it doesn't allow us to take into account parallel hardware. So, actual practitioners use what's called mini-batch descent, where their loss function is over some subset of the data. For simplicity, I will just show the stochastic gradient descent step.
For we have . From the above definition of , we can see that , where is the identity matrix. From here on I will simply assume that the partial derivatives are organized in some specific manner, but omitted. The exact way it's written doesn't actually matter too much as long as you understand the shape of the Jacobian being represented.
We can now evaluate . Let be . Then computing the derivative comes down to finding the generalized Jacobian of with respect to . I will illustrate what this generalized Jacobian would look like by building up from analogous, lower order derivatives. The derivative of is . The gradient is . The Jacobian of is . We can therefore see that the generalized Jacobian of will be some type of order 3 tensor which would look like a simple expression involving .
The derivatives for the rest of the weight matrices can be computed similarly to the derivatives I have indicated for and . We simply need to evaluate the terms later on in the chain where is shorthand for the function .
We have, however, left out one crucial piece of information, which is how to calculate the derivative over the function. To do that we simply separate the derivative into a piecewise function. When the input is less than zero, the derivative is . When the input is greater than zero, the derivative is . But since the function is not differentiable at , we just pretend that it is and make it's derivative ; this doesn't cause any issues.
This means that we are pretty much done, as long as you can fill in the details for computing the generalized Jacobians. The trickiest part in the code is simply making sure that all the dimensions line up. Now, once we have computed by derivatives, we can incorporate this information into some learning algorithm like Adam, and use this to update the parameters and continue training the network.
There are, however, many ways that we can make the algorithm more efficient than one might make it during a naive implementation. I will cover one method briefly.
We can start by taking into account information about the direction we are calculating the Jacobians. In particular, if we consider some chain , we can take advantage of the fact that tensor-tensor products are associative. Essentially, this means that we can start by computing the last derivative and then multiplying forward. This is called forward accumulation. We can also compute this expression in reverse, which is referred to as reverse accumulation.
Besides forward and reverse accumulation, there are more complex intracacies for fully optimizing a library. From Wikipedia,
Forward and reverse accumulation are just two (extreme) ways of traversing the chain rule. The problem of computing a full Jacobian of f : ℝ → ℝ with a minimum number of arithmetic operations is known as the optimal Jacobian accumulation (OJA) problem, which is NP-complete.
Now if you've followed this post and the last two, and filled in some of the details I (sloppily) left out, you should be well on your way to being able to implement efficient backpropagation yourself. Perhaps read this famous paper for more ways to make it work.
This is first and foremost my personal goal, rather than a goal that I expect the readers here to agree with.
If you want to see this derived, see section 4.5.3 in the paper.
The part about people being disappointed comes from my own experience, as it's what John Canny said in CS 182. The definition of Tensor can be made more precise as a multidimensional array that satisfies a specific transformation law. See here for more details.