Understanding Machine Learning (II)

4[anonymous]

1Rafael Harth

3[anonymous]

3Steven Byrnes

New Comment

I'm curious if you've looked at Learning From Data by Abu-Mostafa, Magdon-Ismail, and Lin? (There's also a lecture series from CalTech based off the book.)

I haven't read Understanding Machine Learning, but it does seem to be an even more technical, given my skimming of your notes. However, the Mostafa et al book does give a proof of why you can expect the VC dimension to be polynomially bounded for a set of points greater than the break point (if the VC dimension is finite), as well as a full proof of the VC Bound in the appendix.

I haven't. Would you say that it is good?

UML also includes the proof though – it hasn't skipped any important proof so far. I just didn't want to replicate it *here*, because I didn't feel like understanding the steps of the proof really helped my intuition in any significant way.

Ah, gotcha.

LFD was my first intro to statistical learning theory, and I think it's pretty clear. It doesn't cover the No Free Lunch Theorem or Uniform Convergence, though, so your review actually got me wanting to read UML. I think that if you're already getting the rigor from UML, you probably won't get too much out of LFD.

Thanks, this and the previous post helped me understand a couple things that I had previously found confusing.

This is part 2 of a three part series on the fundamentals of Machine Learning as presented by this book. It builds heavily onto part I.

## Uniform Convergence

is a property which a hypothesis class may satisfy that will end up being equivalent to PAC learning, and the ability to use either definition whenever needed will be quite useful. "Uniform convergence" meansUniform convergenceconvergencewith respect to the difference between the empirical error and the real error, which isuniformwith respect to all hypotheses in the class – i.e. we can bound the difference between real and empirical error across all hypotheses simultaneously. We have the following formal definition:Recall that PAC learnability demands that the true error of the hypothesis output by AERM is low. On the other hand, uniform convergence demands that the gap between the empirical and the true error is low. Unsurprisingly, both properties are equivalent. We will prove one direction now, namely that uniform convergence implies PAC learnability.

Suppose the uniform convergence property holds for values ϵ,δ. We apply AERM to our task, feed it our training sequence S which fulfills the property that |S|>u∗(ϵ,δ), and now want to argue that the true error of AERM(S) is not much higher than the true error of some h∗∈argminh∈H[ℓ(h)]. The uniform convergence property does not allow us to compare both errors directly, rather, we can argue that ℓ(AERM(S))−ℓS(AERM(S))≤ϵ because of uniform convergence, that ℓS(AERM(S))−ℓS(h∗)≤0 because of what AERM does, and finally that ℓS(h∗)−ℓ(h∗)≤ϵ because of uniform convergence. Putting these three inequalities together, we do indeed obtain a bound on ℓ(AERM(S))−ℓ(h∗), but it is 2ϵ not ϵ. The confidence parameter δ stays the same. Of course, the doubling of the gap does not matter for the definition, since we can choose both ϵ and δ to be arbitrarily small anyway (to reach the gap ϵ for PAC learning, we simply choose ϵ2 for u∗), so this simple argument indeed proves that every hypothesis class which has the uniform convergence property is PAC learnable.

Before we get to more exciting things, consider the following problem in isolation. Let X be some finite set and let H be the set of all functions from X to {0,1}. Think of some "true" function f∗:X→{0,1} and of H as hypothesis class containing every possible candidate. To begin, the composition of H reflects that we're wholly ignorant as to the nature of f∗ (every possible function is in H). Now, suppose we are given some partial information about f∗ in the form of some subset U⊊X and a function g:U→{0,1} such that g=f∗|U. (This notation means, "g restricted to the set U".) We now only care about those functions in H that agree with g on the set U.

How much does this help? For every element x∈U, we now know the correct label, namely g(x). But consider an element y∈X−U. Does our partial knowledge of f∗ help us to figure out the label of f∗(y)? The answer to this is a resounding no. Even if U contains every element in X except y, there are still 2 possible candidates in H, namely the function f0 which agrees with g on U and sets y to 0, and f1 which agrees with g on U and sets y to 1. This means we're still at 50/50 odds with regard to the label of y, the same as we were from the beginning. Thus,

ifevery labeling combination is possible(and equally likely),thenlearning the label of some points is zero information about the label of other points.This principle will be key to the next two sections.## The No-Free-Lunch Theorem

The overfitting question is about how to weigh evidence vs priors, but is there a need for priors at all? What happens if one only relies on evidence? The papayas example shows that AERM will not reliably do the job, but perhaps there is some other, more clever learner that will always succeed given enough samples?

It will come to no surprise that the answer is negative (for any reasonable definition of "enough samples"), and this is formally proven in the

. Given any learning algorithm A, we shall construct a domain set X, a probability distribution D, and a hypothesis class H such that A(S) will have a high expected error. ("Expected" as in, we take the set of all possible training sequences as our probability space and consider A(S) to be a random variable so that we can compute its expected value.) The label set will still be just {0,1}, so the No-Free-Lunch theorem holds for binary classification. One noticeable feature of this proof is that we will require the instance domain to be much larger than the training sequence.No-Free-Lunch TheoremLet m be the length of the training sequence S, and let C⊆X be a set of size 2m. We will only consider probability distributions which put all of their probability mass into elements of C, so the instances outside of C will not matter. Let H′ be the set of all functions from C to Y={0,1}, and let H be the set obtained by taking every function from H′ and extending it to a function from all of X to Y by letting it label every element outside of C by 0. (Again, these elements do not matter.) In symbols,

H={h:X→Y|∃h′∈H′ such that [∀x∈C:h(x)=h′(x)]∧[∀x∈X−C:h(x)=0]}

Note that we have |H|=|H′|.

Now, we construct a set D of many different probability distributions. Each D∈D will sample each domain point x∈C with equal probability, but they will label the element differently. (We shall not formally switch to the setting of a true function in the environment, but the probability distributions which we consider are all consistent with such a function.) To be more precise, we construct one probability distribution for each function f∈H′ that labels each element according to f. In symbols, for each f∈H′ we set

Df((x,y)):={1/|C|;x∈C∧y=f(x)0;otherwise

and we let D={Df|f∈H′}. Notice that the learner is now in a situation analogous to what we've discussed earlier: it can learn about at most half of the points in the environment (because |S|=m=12|C|), and since every in-principle-possible labeling function is equally viable, that means it doesn't know anything about the labels it doesn't see, and those are at least half. Thus, it will have a sizeable total error – in fact, at least 14 in expectation, since A has to blindly guess about half of all elements.

The proof consists of formalizing the above argument. This gets more technical, but I'll provide the details for the part that I think is at least an interesting technical problem. For each probability distribution Df, one can enumerate all possible training sequences SDf1,...,SDfk that are obtained by querying Df repeatedly, m times in total. (We have k=(2m)m but this is not important.) The expected loss for this probability distribution is now 1k∑ki=1ℓ(A(SDfi)). The problem here is that A may just guess all unseen labels correctly – it may even ignore S entirely and guess the labels of

allelements correctly. Of course, such a learner would then perform exceptionally poorly on other probability distributions; the point is that it cannot do better than guess. Thus, averaging over all training sequences does not do the trick; we actually want to average over allprobability distributions. To achieve this, recall that we need to show that there exists someprobability distribution with high error, or in other words, themaximumacross all probability distributions (of the above sum) will have high error. We can then lower bound the maximum by replacing it with the average, hence summing over probability distributions after all. To be precise, let H′={f1,...,fT} (we have T=22m but this is not important) so that D={Df1,...,DfT}, thenmaxD∈D(1k∑ki=1ℓ(A(SDi)))≥1T∑Tj=11k∑ki=1(ℓ(A(SDfji)))

We can now exchange the order of these to sums so that we finally get to sum over all possible probability distributions. Then, we replace the right sum (across all training sequences) with the

minimumacross all training sequences, because in fact the training sequence does not matter. For each training sequence, we have that for each of the m points outside the sequence, exactly half of all probability distributions will label that point 0 and the other half 1. Now the learnercannotdifferentiate between them, no matter how it is defined, because they have lead to the identical training sequence, and thus it gets it right in only half of all cases. Alas, it only gets half of all elements outside the training sequence right in expectation, and thus at most three quarters of all elements total. (For the remaining details, see pages 62/63.)Now, this leads to an expected error of at least 14. However, instead of choosing a set C of size 2m, we could more generally choose a set C of size b⋅m. With an analogous construction, we can show that the learner might still get the m elements in the training sequence right, but will merely guess on the other (b−1)⋅m, and its expected error ends up as 12b−1b. Thus, by increasing b we can push the error arbitrarily close toward 12. In the setting of binary classification, this is the error achieved by blindly guessing, and so it is fair to say that no meaningful lower bound of quality can be achieved without making assumptions about the nature of the learning task. In particular, such assumptions are the basis of generalization: the idea that a papaya which is surrounded by similar papayas, all of which tasty, is probably tasty itself – that idea is an assumption about how we can generalize; it would not be true if we took all labeling functions as equally likely candidates for the true function. In fact, having such a dense hypothesis class makes generalization impossible, as we've just demonstrated. This fact is also key for the upcoming chapter.

## The VC-Dimension

The

is a concept that falls right out of the above line of thinking. You might recall that every finite hypothesis class is learnable – but many infinite hypothesis classes are also learnable. For example, the class of all intervals (or more precisely, the set of all predictors h such that there exists an interval I with the property that h(x)=1⟺x∈I) over the domain X=R is learnable. (I won't give a proof because we're about to get a much more general result, but it's probably not surprising – if you want, stop reading and think about it.) This suggests that the decisive property for learnability is not the size of the hypothesis class, but ratherVC-dimensionsome kind of measure for how easy it would be to construct a no-free-lunch style example.The key piece in the preceding proof is the same idea I've emphasized before: if all possible functions on a finite set are possible, then learning about the value of the true function on some points is zero information about its value on the other points. How does that work in the case of intervals? Can we construct a set of points such that all labeling functions are possible? Note that the important question is not whether the environment distribution D can label them in every possible combination – the answer to this is likely affirmative since the D need not be deterministic. Rather, the question is whether there are predictors

in our class– i.e., interval predictors – which allow every combination of labels on any particular set. This is so because PAC learnability only implies that we can get close to the best classifierin our hypothesis class, so in order for a no free lunch type argument to go through, it has to be impossible to obtain information about this predictor (which is in our class).So is this possible with intervals? Let's start with a 1-point set:

Success! We have found interval predictors that give every possible labeling combination (i.e. (1) and (0)) on our one domain point x∈X. (To be clear, the black and red x in the image are meant to denote the same element; we drew two possible scenarios of where the interval might be with regard to that element). We thus say that our one-point set is

by the class of interval predictors – a bad thing as it means it's harder to learn.shatteredWhat about a two-point set?

Indeed, we see that our hypothesis class is dense enough to shatter 2-point sets, since all four sequences of labels (namely (11), (10), (01), (00)) are possible. Now, the training sequence is of course a

sequenceand could therefore contain the same point twice. In that case, this two-point sequence could not be shattered. However, we actually care about the existential quantifier rather than the universal quantifier here. The existence of a single sequence of size two that is shattered will be important for our upcoming definition. (Also, the cases where the same point is included more than once are uninteresting anyway because those sequences can trivially never be shattered.) This is why we go about defining this new concept in terms of sets rather than sequences.What about three-point sets? Take a moment to think about it: there are eight possible labeling combinations. Are there eight different interval classifiers that will produce all eight different labels?

The answer is no – if we label the three points x and y and z such that x<y<z, any interval containing both x and z must also contain y, which makes the labeling (1,0,1) impossible (although all 7 other combinations are possible). No set of size three can be shattered by our class.

This proves that

the VC-dimension of our class is 2, because the VC-dimension is defined simply as VC(H):=max{n∈N|∃C⊆X:|C|=n∧H shatters C}. Since we foundsomeset of size 2 that is shattered and proved thatnoset of size 3 is shattered, the largest integer such that a set of that size can be shattered is 2.(If no maximum exists – i.e. if for any n∈N, there exists a set of size n that is shattered – the VC-dimension is ∞. )

How useful is this concept of VC-dimension? For one,

any class with VC-dimension ∞ is not PAC learnable. Given such a hypothesis class and any learner A, for each m∈N, one can find a set C of size 2m and a probability distribution D such that the argument used in the No-Free-Theorem goes through, thus establishing a minimum gap between the performance of the optimal classifier and the classifier output by A, thus contradicting the definition of PAC learning. (To be more precise, we first prove that, if the expected error is at least 14, then there must be at least a 18 chance to have an error of at least 17, which then gives us a direct contradiction with the definition of PAC learning. This proof of this is easy – prove the contrapositive.)The fact that this contradicts PAC learnability depends on the precise definition of PAC learnability. In particular, if the sample complexity were allowed to depend on the probability distribution, the No-Free Lunch argument would no longer work. The third post will discuss two alternative notions of learnability, where the sample complexity is allowed to depend on further elements.

What about classes with finite VC-dimension? Here, the definition of VC-dimension may seem uninformative on first glance – only sets of size five are shattered, but perhaps for any m there exist sets of size m that are almost shattered? However, it turns out this is not possible. One can prove that, for any m∈N and any subset C⊂X with |C|=m, the number of functions in HC (where HC:={h|C|h∈H}) is equal to the number of subsets of C that are shattered by H. Since, if the VC-dimension of H is d for some d∈N, only subsets of size at most d can be shattered, and since there are only polynomially many subsets of size at most d, this implies that the size of HC can be bounded by a function polynomial in |C|. This is a pretty strong result, given that there are 2|C| possible classifiers for C.

No proof for this result, as I didn't find reading it very instructive. The same will be true for most of the upcoming result.

## The Fundamental Theorem of Statistical Learning

The

(in slightly simplified form) states that, if H is some hypothesis class for any domain set X and target set Y={0,1}, the following statements are equivalent:fundamental theorem of statistical learningWe've proved 1⟹2 in the first section of this post. 2⟹3 is trivial. We've sketched the proof for 3⟹4 in the previous section, by arguing that ¬4⟹¬3. The difficult-and-very-technical-proof-I-will-not-include is the one for 4⟹1. Those four taken together make it into a circle, thus establishing equivalence.

There is also a quantitative version which upper- and lower bounds the sample complexity for PAC learnability and uniform convergence.

As nice as this result is, it is also important to understand that it says nothing whatsoever about runtime. As we've seen, the class of all predictors such that there exists a C program with at most 10000 symbols implementing the predictor is PAC learnable, but actually implementing AERM is completely infeasible.