Complexity Penalties in Statistical Learning

michael_h

I am currently taking a course on statistical learning at the Australian Mathematical Sciences Institute Summer School. One idea that has appeared many times in the course is that a more complicated model is likely to have many short comings. This is because complicated models tend to overfit the observed data. They often give explanatory value to parts of the observation that are simply random noise.

This is common knowledge for many aspiring rationalists. The term complexity penalty is used to describe the act of putting less credence in complicated explanations because they are more complex. In this blog post I aim to provide a brief introduction to statistical learning and use an example to demonstrate how complexity penalties arise in this setting.

Statistical Learning

Broadly speaking, statistical learning is the process of using data to select a model and then using the model to make predictions about future data. So, in order to perform statistical learning, we need at least three things. We need some data, a class of models and a way of measuring how well a model predicts the future data. In this blog we will look at the problem of polynomial regression.

The Data

For polynomial regression, our data is in the form of $n$ pairs of real numbers $(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})$ . Our goal is to find the relationship between the input values $x_{i}$ and the output values $y_{i}$ and then use this to predict future outputs given new inputs. For example, the input values could represent the average temperature of a particular day and the corresponding output value could be the number of ice creams sold that day. Going with this example, we can suppose our data looks something like this:

To simplify our analysis we will make some assumptions about the relationship between our the inputs and outputs. We will assume that there exists an unknown function $g^{*}$ such that $y = g^{*} (x) + E$ , where $E$ is a statistical error term with mean equal to 0 and variance equal to $σ$ . This assumption is essentially saying that there is some true relationship between our input $x$ and our output $y$ but that the output can fluctuate around this true value. Furthermore we are assuming that these fluctuations are balanced in the positive and negative direction (since the mean of $E$ is zero) and the size of these fluctuations doesn't depend on the input $x$ (since the variance of $E$ is constant).

The Models

We want models that can take in a new number $x$ and predict what the corresponding $y$ should be. Thus our models will be functions that take in real numbers and return real numbers. Since we are doing polynomial regression, the classes of models we will be using will be different sets of polynomial functions. More specifically, let $G_{p}$ be the set of polynomials of degree at most $p$ . That is $G_{p}$ contains all functions $g$ of the form

g (x) = a_{0} + a_{1} x + a_{2} x^{2} + \dots + a_{p} x^{p}, where a_{i} \in R .

The parameter $p$ corresponds to the complexity of the class of model we are using. As we increase $p$ , we are considering more and more complicated possible models.

Evaluating the models

We now have our data and our class of models, the remaining ingredient is a way to measure the performance of a particular model. Recall that our goal is to find a model that can take in new numbers $x$ and predict which $y$ should be associated to it. Thus if we have a second set of data $({^x}_{1}, {^y}_{1}), ({^x}_{2}, {^y}_{2}), \dots, ({^x}_{m}, {^y}_{m})$ , one way to measure the performance is to look at the average distance between our guess $g ({^x}_{i})$ and the actual value ${^y}_{i}$ . That is, the best model is the one which minimizes

\frac{1}{m} m \sum i = 1 | g ({^x}_{i}) - {^y}_{i} | .

It turns out that looking at the average squared distance between our guesses and the actual value gives a better way to measure performance. By taking squares we are more forgiving when the model gets the answer almost right but much less forgiving when the model is way off. Taking squares also makes the mathematics more tractable. The best model now becomes the one which minimizes

\frac{1}{m} m \sum i = 1 (g ({^x}_{i}) - {^y}_{i})^{2} .

The above average is called the test loss of the model $g$ . From our assumptions about the type of data we're modeling we know that even the perfect function $g^{*}$ will occasionally differ from the output we're given. Thus, most of the time, we won't be able to make the test loss much smaller than $σ$ which is the expected test loss of $g^{*}$ .

Using the test loss to measure performance has one clear limitation, it requires a second batch of data to test our models. What do we do if we only have one batch? One solution is to divide our batch in two and keep some data to the side to use to test models. Another solution is to try to estimate the test loss. It turns out that complexity penalties naturally arise when exploring this second solution.

Training loss

One way to try to estimate the test loss is to look at how well our model matches the data we've seen so far. This gives rise to the training loss which is defined as

\frac{1}{n} n \sum i = 1 (g (x_{i}) - y_{i})^{2} .

Note that for the training loss we're using the original data points to test the performance of our model. This makes the training loss easy to calculate and easy to minimize within the class $G_{p}$ (the set of all polynomials of degree at most $p$ ). Here is a plot of some of the polynomials of a fixed degree that minimize the training loss. The purple polynomial has degree 1, the green polynomial has degree 2 and the black polynomial has degree 15.

Since the training loss only uses the old data it doesn't tell us much about how the model will perform on new data. For example, while the 15 degree polynomial matches the above initial data very well, it is overfitting. The 15 degree polynomial does a poor job of matching some new independent data, as shown below.

In general, we'd expect the training loss to be much smaller than the testing loss. This is because the model has already been calibrated to the original data. Indeed if we were using polynomials of degree $p \geq n - 1$ we would be able to find a model that passes through every data point $(x_{i}, y_{i})$ . Such a model would have a training loss of 0 but wouldn't generalize well to new data and would have a high test loss.

Approximating the test loss

Thus it seems that the training loss won't be the most informative or useful ways of estimating the testing loss. However, the training loss is salvageable, we just need to add an extra term that makes up for how optimistic the training loss is. Note that we can write the test loss as

Test Loss = Training Loss + (Test Loss - Training Loss) .

Thus the difference between test loss and the training loss of a model gives us a way of quantifying how much the model is over-fitting the training data. Thus if we can estimate this difference we'll be able to add it to the training loss to get an estimate of the test loss and evaluate the performance of our model!

It turns out that in our particular case estimating this difference isn't too tricky. Suppose that we have a model $g$ in the class $G_{p}$ (that is $g$ is a polynomial of degree at most $p$ ). Suppose further that we have a lot of data points (in particular assume that $n$ , the number of data points, is greater than $p + 1$ ). Then, under the above assumptions, we have the following approximation

Test Loss - Training Loss \approx \frac{2 σ (p + 1)}{n} .

Rewriting this we have

Test Loss \approx Training Loss + \frac{2 σ (p + 1)}{n}

Thus we can measure the performance of our models by calculating the training loss and then adding $\frac{2 σ (p + 1)}{n}$ . This number is our complexity penalty as it increases as the complexity parameter $p$ increases. It also increases with $σ$ , the variance of the errors. Thus the more noisy our data is, the more likely we are to overfit. Also the penalty decreases with $n$ , the number of data points. This suggests that if we have enough data we can get away with using quite complicated models without worrying about overfitting. This is because with enough data the true relationship between the inputs and outputs will become very clear and even a quite complicated model mightn't overfit it.

One last interesting observation about this complexity penalty is the way it depends on a given model. Recall that a model $g$ is a polynomial of degree $p$ and that $σ$ and $n$ are parameters for the whole statistical learning problem. Thus the above complexity penalty depends on $g$ only via the degree of $g$ . This gives us the following tractable way of finding the best model. For each $p$ we can find the polynomial of degree $p$ that minimizes the training loss and record the training loss it achieves. We can then compare polynomials of different degrees by adding the complexity penalty $\frac{2 σ (p + 1)}{n}$ to the training loss. We can then chose the best model based off which $p$ minimizes the sum of the training loss and complexity penalty. The only downside to this method is that $σ$ is an unknown quantity but hopefully some heuristics can be used to estimate it.

Below is a plot of the training loss, test loss and the approximation of the test loss from our example for different values of $p$ . While the approximation isn't always exact it follows the general trend of the test loss. Most importantly, both the test loss and the estimation have a minimum at $p = 2$ . This shows that using approximation would let us select the best model which in this case is a quadratic.

Other examples of complexity penalties

Complexity penalties can be found all over statistical learning. In other problems the above estimate can be harder to calculate. Thus complexity penalties are used in a more heuristic manner. This gives rise to techniques such as ridge regression, LASSO regression and kernel methods. Model complexity is again an important factor when training neural networks . The number of layers and the size of each layer are both complexity parameters and must be tuned to avoid overfitting.

What makes the above example interesting is that the complexity penalty arose naturally out of trying to measure the performance of our model. It wasn't a heuristic but rather a proven formula guaranteed to provide a good estimate of the test loss. This in turn gives support to the heuristic complexity penalties used in situations when such proofs or formulas are more difficult to come by.

References

The ideas in this blog post are not my own and come from the AMSI Summer School course Mathematical Methods for Machine Learning taught by Zdravko Botev. The notes for the course will soon be published as a book by D. P. Kroese, Z. I. Botev, S. Vaisman and T. Taimre titled "Mathematical and Statistical Methods for Data Science and Machine Learning".

I made the plots myself but they are based off similar plots from the course. You can access the data set I made and used to make the plots here.

[-]Adrià Garriga-alonso7y120

Great post, thank you for writing it!

By taking squares we are more forgiving when the model gets the answer almost right but much less forgiving when the model is way off

It's really unclear why this would be better. I was going to ask for a clarification but I found something better instead.

The first justification for taking the least-squares estimator that came to mind is that it's the maximum likelihood solution if you assume your errors E are independent and Gaussian. Under those conditions, the probability density of the data is $p ({y_{i}}_{i = 1, \dots, N} | {x_{i}}_{i = 1, \dots, N}, σ^{2}) = \prod_{i} Gaussian (y_{i} | g (x_{i}), σ^{2})$ . We take the product because the errors are independent for each point. If you take the logarithm (which is a transformation that preserves the maximum), this comes out to be $- 1 / 2 σ^{2} \cdot \sum_{i} (y_{i} - g (x_{i}))^{2}$ plus a constant that doesn't depend on g.

It turns out there's a stronger justification, which doesn't require assuming the errors are Gaussian. The Gauss-Markov Theorem shows that the best unbiased estimator of the coefficients of g that is linear in the powers of the data, i.e. $x_{i}, x_{i}^{2}, x_{i}^{3}, \dots$ . That it's unbiased means that, for several independent samples of data, the mean of the least-squares estimator will be the true coefficients of the polynomial you used to generate the data. The estimator is the best because it has the lowest variance under the above sampling scheme. If you assume you have a lot of data, by the central limit theorem, this is similar to lowest squared error in coefficient-space.

However, if you're willing to drop the requirement that the estimator is unbiased, you can get even better error on average. The James-Stein estimator has less variance than least-squares, at the cost of biasing your estimates of the coefficients towards zero (or some other point). So, even within a certain maximum allowed degree of the polynomial, it is helpful to penalise complexity of the coefficients themselves.

On a related note, for some yet unknown reason, neural networks that have way more parameters than is necessary in a single layer seem to generalize better than networks with few parameters. See for example Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks. It seems to be related to the fact that we're not actually finding the optimal configuration of parameters with gradient descent, but only approximating it. So the standard statistical analysis that you outlined here doesn't apply in that case.

[-]tenthkrige7y20

This was interesting. I think I missed an assumption somewhere, because for $p = n$ , it seems that the penalty is $\approx 2 σ$ , which seems very low for a $n$ -degree polynomial fitted on $n$ points.

[-]michael_h7y30

Good point. Thank you for bringing this up. I just had a closer look in my notes at how the complexity penalty is derived and there is a additional assumption that I left out.

The derivation uses a matrix $X$ with $p + 1$ columns and $n$ rows which has entry $x_{i}^{j - 1}$ in the $i^{t h}$ row and $j^{t h}$ column (where $x_{1}, x_{2}, \dots, x_{n}$ is the training set). In the derivation it is assumed that $X$ has rank $p + 1$ which true most of the time provided that $n \geq p + 1$ . For simplicity I won't add a mention of this matrix to original post but I will add the assumption $n \geq p + 1$ .

31