A more accurate title for this post would be "Meta, Error bounds, Test Data, Meta-Learning, Error decomposition, and an optional introduction to general probability spaces." The first item is me trying to categorize the stuff we've been doing in this sequence so far, which is meta-learning but nonetheless different from the Meta-Learning item, which is about a Machine Learning technique.
Categorizing Machine Learning Insights
Although the book doesn't do this, I think taking some time to reflect on how exactly we've been making progress is very worthwhile. I would categorize the material we've covered so far as follows:
- (1) Carving out relevant and learnable classes out of ML problem space
- (1.1) Defining such classes and establishing results about their limits
- Examples: Linear Predictors (post 4), Convex (bounded, Lipschitz / smooth) problems (post 5), the negative result on convex problems (post 5)
- (1.2) Finding algorithms that learn these classes and prove that they work.
- Examples: Perceptron (post 4), Linear Programming (post 4), applied linear algebra (post 4), Stochastic Gradient Descent (post 6)
- (2) Extending the usability of such classes
- Examples: Primarily surrogate loss functions (post 5); also the mapping of inhomogeneous linear problems to linear problems (post 4), even if it's just a detail. There will be more stuff here in future posts.
- (3) General (as in, widely applicable) theoretical work
- Examples: The general learning model, work on overfitting, the PAC learning framework, the basic error decomposition, the ERM approach (all post 1), the uniform convergence property, the No-Free-Lunch Theorem, the VC dimension, the fundamental theorem of statistical learning (all post 2), the nonuniform learning framework, the SRM approach, the minimal description length, the basics of computational complexity (all post 3), meta-learning, error bounds via test data, and advanced error decomposition (later in this post)
Not quite covered: coming up with learning algorithms. So far, that's just Stochastic Gradient Descent, which I have listed as a solution to solve convex Lipschitz/smooth bounded problems, but it's not restricted to that. Also, neural networks supposedly have an impressive track record of fitting all sorts of functions, and I don't expect there to be analogous theoretical results.
Based on this categorization (which is not perfect – much of (3) is only about binary classification), the entire first part of the book which was covered in posts I-III is only about broad theoretical work, hence "foundations". The remaining book jumps around between all three. Note that my adaption isn't always linear.
I did not check my old posts before writing the above categories, and after doing so, I only added 2 items. Writing summaries in this style appears to be an extremely effective way of internalizing the material for me (though it isn't cheap), which is why I've decided to cover the entire book at one post per week. As of now, we're about halfway there.
One way to motivate this chapter is that we wish to find better guarantees for the quality of predictors. We usually either care about the term or about . It turns out that both are closely related. We'll call either of these the "gap". Now one can alternatively ask
- "given that I have data points, what is the smallest upper-bound on the gap that I can prove?"; or
- "given that I want to establish an upper bound of on the gap, how many data points do I need?"
where, in both cases, the guarantee will hold with some probability , rather than with certainty. The first would be an error bound, the second a sample complexity bound, but they're equivalent: given an error bound, one can take the corresponding equation, which will have the form , set the left part to and solve it for , which will yield a sample complexity bound. Going into the opposite direction is no harder.
Musings on past approaches
The punchline of this section is that all bounds we have looked at so far are based on the performance of the classifier on the data it's been trained on. Given that the output predictor may be highly dependent on patterns that only exist in the training data but not in the real world, this is difficult – and, as we know from the No-Free-Lunch Theorem, even impossible without making further assumptions.
This implies that we must have made such assumptions, and indeed we have. In addition to the empirical error, our current bounds have been based on
- (1) the hypothesis class
- (2) the properties of the classifier that the learner guarantees.
By (2), I mean that the classifier is in the set in the case of , in the class in the case of ; and in the case of we've used the update in the error bound proof.
Deriving error bounds with test data
By testing the classifier on data that the learner didn't have access to, we can toss out (1) and (2) in favor of a purely statistical argument, which results in a bound that is more general, easier to prove, and stronger. The downside is that it requires additional data, whereas the previous bounds were "for free".
To be more precise, the idea now is that we split our existing data into two sequences and ; we train a learner on and we test the output predictor on to obtain an estimate of the real error. In symbols, we use as our estimate for , where is defined just like only, of course, using instead of as the set (this definition is implicit if one simply regards the definition of as applying to with any set as a subscript). So . We will call the output of the test error.
It is clear that in general, the real error does not equal the test error (in symbols, ) because the test error is based on the test sequence which may be unrepresentative. However, the real error equals the expected test error (we'll prove this in the optional chapter at the end of the post). Therefore, we only need an upper-bound on the difference between the expected value and the actual value of random variabes (this is the aforementioned statistical argument), and a theorem to that end has already been established by relevant literature. It is Hoeffding's Inequality:
Let be i.i.d. RVs that range over the interval . Then, for all , it holds that , where .
This uses the same notational shortcut that is used all the time for Random Variables (RVs), namely pretending as if RVs are numbers when in reality they are functions. Importantly, note that depends on the input randomness, whereas is a constant.
In our case, the random variables we are concerned with are the point-based loss functions applied to our predictor for our instances in , i.e. if , then our random variables are . They are random variables if we regard the points they're based on as unknown (otherwise they'd just be numbers). Their mean, equals the test error, , whose randomness ranges over the choice of all of . So and . Given that result, Hoeffding's inequality tells us that , provided that all loss functions range over (if not, then as long as they are still bounded, a generalized version of this theorem exists). Furthermore, if one drops the , the failure probability halves, i.e. . Setting and solving for , we obtain . Thus, we have established the following:
If is a sequence of i.i.d examples sampled by the same distribution that generated , but is disjoint from , then the inequality holds with probability at least over the choice of .
For a brief comparison: a bound provided by the (quantitative version of) the fundamental theorem of statistical learning states that , provided that has VC-dimension . So this bound relies on a property of the hypothesis class (and if that property doesn't hold, it doesn't tell us anything), whereas our new bound always applies. And as a bonus, the new bound doesn't have the constant but instead has a nice 2 in the denominator.
The book now goes on to talk about how this idea can be used for model selection. Let's begin by summarizing the argument.
Model Selection ...
Suppose we have several possible models for our learning task, or alternatively one model with different parameters (maximal polynomial degree, step size of stochastic gradient descent, etc). Then we can proceed as follows: we split our training data into three sequences, the training sequence , the validation sequence , and the test sequence . For each model we look at, we train the corresponding learner on , then we test all output hypotheses on , i.e. we compute , and pick one with minimal validation error, i.e. we pick such that . Finally, we compute and use that value to argue that the predictor performs well (but not to change it in any way).
Since the learners haven't been given access to , the validation error will be unbiased feedback as to which learner performs best, and since even that learner has been computed without using , the test error is yet again unbiased feedback on the performance of our eventual pick . That is, provided that there are no spurious correlations across our data sets, i.e. provided that the i.i.d. assumption holds.
One can be even more clever, and trade some additional runtime for more training data. The test sequence remains as-is, but rather than partitioning the remaining data in and we partition it into a bunch of blocks of equal size, say blocks, and then, for each such block , we
- train our learner A on all blocks
- compute the error on block , i.e.
This way, we obtain different validation errors (one for each classifier which was trained on nine blocks and tested on the tenth); now we take the mean of all of them as our ultimate validation error. We do this entire thing for each learner , take the one that performed best, and then we can even retrain that learner on all blocks to make sure we fully utilize all data we have.
Finally, just as before, we test that final predictor using the test data and output the result. This entire process is called -fold cross validation.
... is just meta-learning
While this is all well and good, I think it's important to understand that the specifics are non-fundamental. What is novel here is the concept of obtaining unbiased feedback by computing the empirical error of a predictor based on a sequence that has not been used to learn that predictor. Beyond that, the observation is that it might be useful to do meta-learning, i.e. have several levels of learning, i.e. partition all possible hypotheses in a bunch of different classes, take the best from each class, and then take the best across all classes.
But the notion of exactly two levels is arbitrary. Suppose we have a bunch of models, and each model has some parameters. Then we can partition our training data in four sequences . For each model, we choose a bunch of plausible parameter values, so say we have models and each one has parameter settings. For each such setting, we train the learner on and test it on . We pick the optimal one among those ; this will be the predictor with the optimal parameters for our model. We do this for all models, leading to predictors, all of whom use the optimal parameters for their model. We test these predictors using and output the optimum yet again. Finally, we test that predictor on .
Similarly, one could imagine four levels, or however many. The crucial thing is just that choosing the optimal predictor of the next lower level has to be done with data that hasn't been used to learn that predictor. So in the above example, for each [model and particular parameter setting], the respective output hypothesis just depends on . If we then compute the optimal predictor of that model by looking at which of the predictors (corresponding to the parameter settings) performed best on , this output predictor also depends on . In fact, the procedure I just described is just one kind of learning algorithm that uses both and as training data. Similarly, if we take the predictor with the smallest error on across all models to obtain our final predictor, then this predictor depends on and and – in fact, the procedure I just described is just one kind of learning algorithm that uses and and .
Differently put, to derive a predictor for a particular problem, one might choose to apply meta-learning, i.e. choose different models, train a predictor in each model, and then select the best of those predictors, based on their performance on training data which hasn't been used to train them – and each of those smaller problems is itself just another problem, so one might again choose to apply meta-learning, and so on.
The trick from the last section for saving data can also be used with more than 2 levels, although it quickly increases runtime.
The idea of using a test sequence is more fundamental (i.e. not just another level), since it's only used to obtain a certificate of the quality of the final predictor, but not for learning (in any way). As soon as one violates this rule (for example by choosing another predictor after the first one showed poor performance on the test data), the bound from Hoeffding's inequality is no longer applicable, since the test error is no longer an unbiased estimate of the real error.
So far, we've decomposed the error of a predictor like so:
where . The term is the approximation error, which is the lowest error achievable by any predictor in our hypothesis class ("how well does this class approximate the problem?") and the term is the estimation error ("how well did we estimate the best predictor?").
The usefulness of this approach is limited. One doesn't actually know the approximation error – if learning fails, it's not clear which of the two errors is at fault. Furthermore, both overfitting and underfitting is part of the estimation error, which suggests that it should be divided up further (whereas the approximation error can stay as-is). This is what we do now. Our new decomposition will rely heavily on having the kind of unbiased feedback from some independent data which we've been talking about in the previous chapter. Note, however, that while I'll only talk about the test sequence going forward, everything applies equally if one uses a validation sequence instead (i.e. training data which has not been given to the learner). In fact, using such data is the way to go if one intends to change the learner in any way upon seeing the various parts of the decomposition.
I find the notation with differences not very illustrative, which is why I'm making up a different one. The above decomposition can be alternatively written as
where we decompose the left term into the value of the arrow plus that of the right term. (Each error corresponds to adding "". If there are more than two terms, we decompose the leftmost term into the sum of all errors plus the rightmost term (actually, any term in the chain equals the sum of all errors to its right plus the rightmost term).
That said, here is a more sophisticated error decomposition:
The red terms are the ones we have access to. Note that it is not the case that each step is necessarily negative, so the values are not monotonically decreasing. Now let's look at this more closely.
"" can be bounded using Hoeffding's inequality.
Let's pause already and reflect on that result. Our only terminal value is to maximize the leftmost term; the first arrow can be bounded tightly; and the second term is known. It follows that, if the test error is small we can stop right there (because the real error is probably also small). In fact, that was the point of the previous chapter. The remaining decomposition is only necessary if the test error is larger than what is acceptable.
"" is the difference between the unbiased and the biased error estimate – the error on the data the learner had no access to minus the error on the data it did have access to. It measures how much we overfit – which is quite useful given that this is another term we have access to. Thus, if the test error is large but the empirical error small, we might want to change our learner such that fits the training data less closely. However, we don't have any guarantee in this case that overfitting is the only problem. It's possible that we overfit and the approximation error is large.
If the test error and the empirical error are both large, that's when we're interested in the remaining part of this decomposition.
"" is usually negative, at least if we're using or something close to it.
"" can also be bounded by Hoeffding's inequality, because does not depend on (unless we've messed with the hypothesis class based on the training data).
Finally, is the approximation error from our previous decomposition. It doesn't make sense to decompose this further – if the approximation error is large, our hypothesis class doesn't contain any good predictors and learning it is hopeless.
Thus, based on the previous 3 paragraphs, the empirical error is unlikely to be significantly larger than the total error of the output predictor, i.e. the approximation error. (It might be significantly smaller, though.) It follows that if the empirical error is high, then so is the approximation error (but not vice-versa), with the important caveat that one needs to be confident in the learner's ability to minimize empirical risk. If the problem is learning starcraft, then it's quite plausible that minimizing the empirical risk is so difficult that the empirical risk could be large even if the approximation error is very small.
... one can reason like this:
- Test error...
- is small real error is probably small too, so everything's good
- is large empirical error...
- is small the learner probably suffers from overfitting
- is large the empirical error seems...
- easy to minimize approximation error is probably large
- hard to minimize ??? (the problem is just hard)
If the approximation error is large, then the problem is unrelated to our learner. In that case, the hypothesis class may be "too small"; perhaps it contains only linear predictors and the real function just isn't linear. Alternatively, it could also be the case that the real function is approximately linear, but we just didn't choose features that represent meaningful properties of the data points. For example, if the problem is learning a function that classifies emails as spam / not spam, one could imagine that someone was just really wrong about which features are indicative of spam emails. This is the earliest point of failure; in this case, learning might be impossible (or at least extremely difficult) even with a good hypothesis class and a good learner.
 While this is technically true, one might actually be fine with some amount of overfitting. If so, then this case isn't quite satisfactory, either; in particular, one doesn't know whether the approximation error is large or not. Another way to approach the question is to train the learner on parts of the training sequence first, say , and examine the test error for each learner . If it goes down as we use more data, that's some evidence that the approximation error is low.
In general, this chapter is far from a full treatment of error decomposition.
Proof that E(test error) = real error
As mentioned, this chapter is completely optional (also note that the post up to this point is roughly as long as previous posts in this sequence). It's long bothered me that I didn't understand how a general probability space was defined, so if you share that frustration, this chapter will provide an answer. However, the connection to Machine Learning is very loose; one needs general probability spaces to prove the result, but the actual proof is quite easy, so this will be a lot of theory with a fairly brief application.
Preamble: General Probability Spaces
Before the general case is introduced, one is generally being taught two different important special cases of probability spaces. The first is the discrete one. Here, a probability space is given by a pair , where is a countable sample space and a function that assigns to each elementary event a probability. For any subset , the probability is then simply defined as . And for a random variable , we have the definition
This is super intuitive, but it can't model how an event happens at a random point throughout a day. For that, we need the absolutely continuous case. This is where no one point has nonzero probability, but intervals do. In other words, one is in the absolutely continuous case whenever the probability distribution admits of a probability density function such that, for each interval, the probability of that interval is equal to the integral of that function over that interval. If (covering only that case here), we can write the probability space as where for any interval , we have . This is also rather intuitive – one can just draw an analogy between probability mass and area. A single point has an "infinitely thin" area below it, so it has zero probability mass (and in fact the same is true for any countable collection of points because even the tiniest interval is uncountable), but an interval does have real area below it. The height of the area is then determined by .
Given a random variable , we have
But neither of these is the general case, because one assumes that every point has nonzero probability and the other assumes that no point does. In the general case, a probability space looks like this:
The first element is the same as before – the sample space. But the is something new. It is called a -algebra, which is a set of subsets of , so , and it consists of all those subsets which have a probability. It needs to fulfill a bunch of properties like being closed under complement and like , but we don't need to concern ourselves with those.
The probability distribution is a function . So it's a function that assigns subsets of a probability, but only those subsets that we declared to have a probability, i.e. only those subsets that are elements of our -algebra . It must have a bunch of reasonable properties such as and , where the symbol is meant to indicate that and are disjoint.
( In the discrete case, we don't need to specify because we'd simply have , i.e. every subset has a probability. I'm not sure how it would look in the continuous case, but maybe something like the set all of all subsets that can be written as a countable union of intervals plus a countable union of single points. )
The question relevant for us is now this: given a general probability space and a random variable , how do we compute the expected value of ? And the answer is this:
where is the Lebesgue integral rather than the Riemann integral. But how does one compute a Lebesgue integral? Fortunately, we don't require the full answer to this question, because some of the easy special cases will suffice for our purposes.
Computing Lebesgue Integrals of Simple Functions
The setting to compute a Lebesgue integral is a bit more general than that of a general probability space, but since we're only trying to do the easy cases anyway, we'll assume our space has the same structure as before. However, we will rename our probability distribution into and call it a measure, so that our space now looks like this:
And we'll also rename our random variable into to make it look more like a general function. So is the function whose Lebesgue integral we wish to compute.
Recall that assigns a probability to the subsets of that appear in , and that the entire space has probability 1. Analogously, assigns a measure to subsets of , and the entire space has measure 1. So let be a subset that is "half" of , then .
One now proceeds differently than with Riemann integrals. Rather than presenting the general case right away, we begin with the simplest kind of function, namely indicator functions. That is a function of the form for some , and it is simply 1 on and 0 everywhere else. And now, even though we don't understand how Lebesgue integrals work in general, in this particular case the answer is obvious:
The function just says "I count once and I count nothing else", so the integral has to just be one times the size, i.e. the measure, of . Which is .
So for example, and .
Next, if is not itself an indicator function, but it can be written as a finite weighted sum of indicator functions, i.e. , then is called a simple function and we define
So for example, if , i.e. the function that is 2 on one half of the space and 0 on the other, then .
And this where we stop because integrating simple functions is all we need.
Let's return to our setting of general probability spaces. Here, the probability distribution becomes the measure. So let's say our space looks like . Our -algebra includes all sub-intervals of (closed or open or half-open), and all such intervals plus the point 10. Our probability distribution says that and the probability mass of an interval in is just the size of the interval, so if , then . Then clearly . Recall that has to assign each set a probability, which it now does.
Let's say is a random variable given by . Then is a simple function – it is the weighted sum of two indicator functions – and we can compute the expected value of as
The actual proof
The real error of is
The test error depends on the test sequence. Let be the length it the sequence, so that ; then to compute the expected test error, we have to compute the Lebesgue integral
So our measure for the space is now . We wish to write as a sum of indicator functions. Recall that , i.e. counts the number examples which got wrong. Thus, by defining the sets for all , we get the equality .
Thus, the integral simply equals the probability mass of the respective sets (multiplied by their respective weights). In symbols,
The i.i.d. assumption tells us that (in fact, this or something similar is probably how it should be formally defined), and therefore the above sum equals .