Understanding Machine Learning (III)

Rafael Harth

This is the final part of a 3-part series on the fundamentals of Machine Learning as defined by this book (although it will transition into a longer sequence about the remaining, more practical part of the book). If you're interested, begin here.

Nonuniform Learning

Recall the definition of PAC learnability:

A hypothesis class $H$ is called PAC learnable iff there exists a learner $A$ and a function $m^{*} : (0, 1)^{2} \to N$ such that, for any probability distribution $D$ over $X \times Y$ , any gap $ϵ \in (0, 1)$ , and any failure probability $δ \in (0, 1)$ , if the training sequence $S$ consists of at least $m^{*} (ϵ, δ)$ i.i.d. examples sampled via $D$ , then with probability at least $1 - δ$ over the choice of $S$ , it holds that $ℓ (A (S)) \leq ℓ (h^{*}) + ϵ$ for all $h^{*} \in {argmin}_{h \in H} [ℓ (h)]$ .

One way to weaken this definition so that it applies to a wider range of hypothesis classes is to allow the function $m^{*}$ which upper-bounds the sample complexity to also depend on the hypothesis we are competing with – so rather than upper-bounding the difference of $A (S)$ and $h^{*}$ , we upper-bound the difference of $A (S)$ and $h^{'}$ separately for each $h^{'} \in H$ . We have the following formal definition:

A hypothesis class $H$ is called nonuniformly learnable iff there exists a learner $A$ and a function $z^{*} : (0, 1)^{2} \times H \to N$ such that, for any probability distribution $D$ over $X \times Y$ , any gap $ϵ \in (0, 1)$ , any failure probability $δ \in (0, 1)$ , and any hypothesis $h \in H$ , if the training sequence $S$ consists of at least $z^{*} (ϵ, δ, h)$ i.i.d. examples sampled via $D$ , then with probability at least $1 - δ$ over the choice of $S$ , it holds that $ℓ (A (S)) \leq ℓ (h) + ϵ$ .

(There is also a second relaxation, where the sample complexity is allowed to depend on the probability distribution $D$ in addition to the hypothesis and the gap and failure probability. It is called consistency, but it turns out to be too weak to be useful. One can prove that every countable hypothesis class is consistent, and that $A_{memorize}$ is a successful "learner.")

For an example of a hypothesis class that is nonuniformly learnable but not PAC learnable, consider the domain $X = R$ , the usual binary label class $Y = {0, 1}$ , and the hypothesis class $H$ of all polynomial classifiers – that is, all classifiers $h$ such that there exists a polynomial function $p$ such that $h (x) = 1 ⟺ p (x) \geq 0$ for all $x \in X$ . Arguing with the VC-dimension, one can see that class of polynomial classifiers of degree at most $n$ is PAC learnable (even though it's infinite), and that the entire class of polynomial classifiers is not PAC learnable. While this class is nonuniformly learnable (this will follow from what we prove this chapter), $A_{ERM}$ is no longer a successful learner. This fact is easy to verify: given any finite amount of points, $A_{ERM}$ would always choose a polynomial of sufficiently high degree in order to classify all of them correctly. As long as no two points are exactly the same, this is always doable – for example, one can choose the polynomial $- (1) \cdot (x - p_{1})^{2} \cdot (x - p_{2})^{2} \dots$ which will be 0 at every $p_{i}$ and negative everywhere else. You might notice that this is precisely the argument that proves that the class has infinite VC-dimension. It is also a classic example of overfitting – it is not at all going to lead to a small estimation error. In fact, this implementation would mean that $A_{ERM}$ behaves exactly like $A_{memorize}$ . A more sophisticated learning algorithm is needed.

Suppose we have such a nonuniform learner $A_{? ? ?}$ . This will mean that, if we fix some distribution $D$ generating the points $x \in X$ , then for any one hypothesis $h$ , we can bound the number of sample points we need such that the predictor $A_{? ? ?} (S)$ will compete well with the true error of $h$ .

The learner $A_{? ? ?}$ will have to be biased toward lower degree polynomials so as not to behave like $A_{ERM}$ . Therefore, if $h$ is a predictor corresponding to a polynomial with small degree, we probably won't need many samples to compete with it. On the other hand, if $h$ comes from a polynomial of degree 50 and the distribution $D$ is such that $h$ is an excellent predictor, we will need a very large number of sample points before $A_{? ? ?}$ is convinced enough to seriously consider high-degree polynomials and outputs something almost as good as $h$ . As the degree of polynomials we're competing with goes up, the number of sample points we need will go up, and therefore there is no uniform function (namely uniform across all other hypotheses) that can bound the sample complexity.

If $D$ is such that the optimal classifier corresponds to a polynomial of small degree, this will not be an issue – then, while we don't get the theoretical assurance of being able to compete with classifiers from high-degree polynomials, they will, in fact, have large real error (although excellent empirical error), and we will compete with every hypothesis in our class fairly quickly. In that case, the theoretical bound will be pessimistic. However, recall that nonuniform learning is about guarantees that hold for all distributions $D$ simultaneously.

I've previously stated that there are more sophisticated approaches to the evidence-vs-priors problem than to commit to a hypothesis class ahead of time. The learner $A_{? ? ?}$ will be an example.

Now recall the definition of uniform convergence:

A hypothesis class $H$ has the uniform convergence property iff there is a function $u^{*} : (0, 1)^{2} \to N$ such that, for every joint probability distribution $D$ over $X \times Y$ , every gap $ϵ \in (0, 1)$ and every confidence level $δ \in (0, 1)$ , if $| S | \geq u^{*} (ϵ, δ)$ , then with probability at least $1 - δ$ over the choice of $S$ , the inequality $| ℓ_{S} (h) - ℓ (h) | \leq ϵ$ holds for all $h \in H$ .

Notice that we called the three bounding functions $m^{*}$ in case of PAC learnability (for sample complexity), $u^{*}$ (for uniform convergence) and $z^{*}$ for nonuniform learning. Notice also that, unlike the other two, uniform convergence bounds the difference between real and empirical error in both directions (this will become important in a bit).

We've found two properties which are equivalent to PAC learnability: the uniform convergence property and having a finite VC-dimension. We now introduce a property which is equivalent to nonuniform learnability, namely

A hypothesis class $H$ is nonuniformly learnable iff there exists a family ${H_{i}}_{i \in N}$ of hypothesis classes, each of which PAC learnable, such that $H = ⋃_{i \in N} H_{i}$ .

There are two directions to prove in order to establish this result; we will prove both of them. The first one is pretty easy.

" $⟹$ " Suppose $H$ is nonuniformly learnable. Let $z^{*}$ be the function from the definition. For each $n \in N$ , set $H_{n} := {h \in H | z^{*} (\frac{1}{8}, \frac{1}{8}, h) \leq n}$ . Then, $H = ⋃_{n \in N} H_{n}$ , since for all $h \in H$ , we have $h \in H_{z^{*} (\frac{1}{8}, \frac{1}{8}, h)}$ . Moreover, given any $n \in N$ , if $H_{n}$ had infinite VC-dimension, we would find a probability distribution $D$ over $X \times Y$ such that the expected value of $ℓ (A (S)) - {min}_{h \in H} {ℓ (h)}$ is at least $\frac{1}{4}$ . (Apply the No-Free-Lunch argument.) In particular, there would be at least a $\frac{1}{8}$ chance that $ℓ (A (S)) - {min}_{h \in H} {ℓ (h)} \geq \frac{1}{7} > \frac{1}{8}$ , contradicting the fact that $z^{*} (\frac{1}{8}, \frac{1}{8}, h) \leq n$ for all $h \in H_{n}$ . Therefore, $H_{n}$ has finite VC-dimension and is thus PAC learnable by the fundamental theorem of statistical learning.

The interesting thing about this proof is that getting both the gap and the failure probability to $\frac{1}{8}$ indirectly implies that they can also go arbitrarily low.

The reverse direction is good deal harder. We shall construct a new learner $A_{SRM}$ that will be our $A_{? ? ?}$ . The abbreviation SRM stands for Structural Risk Minimization, and the idea is that we take into account both how well a hypothesis performs on the training data (like ERM does) and how easily we can bound the estimation error of its hypothesis class.

Structural Risk Minimization

To begin, we suppose that we are given a family ${H_{n}}_{n \in N}$ of hypothesis classes, each of which PAC learnable. Since each is PAC learnable, each also has the uniform convergence property, and we can find a family of functions ${u_{n}^{*}}_{n \in N}$ such that $u_{n}^{*}$ bounds the sample complexity for class $H_{n}$ . Among our infinitely many classes, we we will need to prioritize some over others. To do this, we define a weighting function $w : N \to (0, 1)$ ; i.e. a function with the property that $\sum_{i \in N} w (i) \leq 1$ . The obvious choice here is $w (i) = 2^{- i}$ , although many other choices are also possible; it will not matter for the theoretical construction.

To achieve a global guarantee, we shall query each class $H_{n}$ for some "local" guarantee in the form of some $ϵ^{'}$ and $δ^{'}$ . The gap will not be the same for each class, so we define a function $ϵ_{n} : (0, 1) \times N \to (0, 1)$ that, given a certainty $δ^{'}$ and a sample size $m \in N$ , will return the best (i.e. lowest) gap between real and empirical error that $H_{n}$ can guarantee for $δ^{'}$ and $m$ . In symbols, we set $ϵ_{n} (δ, m) := min {ϵ \in (0, 1) | u_{n}^{*} (ϵ, δ) \leq m}$ .

We want $A_{SRM}$ to satisfy the definition of nonuniform convergence, so we assume that we are given two values $ϵ, δ \in (0, 1)$ as well as a training sequence $S$ . Now in order to implement $A_{SRM}$ , we could, for each $n \in N$ , compute the gap which $H_{n}$ guarantees with failure probability $δ$ , namely $ϵ_{n} (δ, | S |)$ , as well as the minimal empirical error achievable with that class, namely $min {ℓ_{S} (h) | h \in H_{n}}$ . Then, we could output the hypothesis that minimizes the sum of these two values (across all of $⋃_{n \in N} H_{n}$ ). However, this approach will not allow a proof that we satisfy the definition of nonuniformly learning because we will not be able to upper-bound our risk of failure by $δ$ . Rather, we can upper-bound it by $δ$ for each class, but then then only one of these infinitely many guarantees has to fail for our uniform upper-bound to fail. (Despite its name, the $δ$ in the definition of nonuniform learning is very much uniform, i.e., it applies to the entire class $⋃_{n \in N} H_{n}$ .)

To remedy this, we will demand successively stronger safety guarantees, and we use this as our mechanism to privilege earlier classes. Namely, we do it as explained above except that we compute the gap which $H_{n}$ can guarantee with safety $1 - δ \cdot w (n)$ rather than just $1 - δ$ , i.e., the value $ϵ_{n} (δ \cdot w (n), | S |)$ . Clearly $δ \cdot w (n)$ is a lot smaller than $δ$ , so we get more confident bounds (and particularly confident bounds for more distant classes). Then we still compute the empirical risk, add both numbers, and return the global minimum. To write it down as a formula, $A_{SRM}$ will output a hypothesis in the set ${argmin}_{h \in H} {ℓ_{S} (h) + ϵ_{n} (δ \cdot w (n), | S |)}$ .

With these progressively more confident bounds, the probability that they all hold is $\prod_{i = 0}^{\infty} (1 - δ w_{n}) \geq 1 - \sum_{i = 0}^{\infty} δ w_{n} = 1 - δ \sum_{i = 0}^{\infty} w_{n} \geq 1 - δ$ .

With the definition in place, we can prove $⟸$ by showing that $A_{SRM}$ as constructed above is a nonuniform learner of the class $H$ . For given $ϵ, δ \in (0, 1)$ and $h \in H$ , if we set $n$ to be the index of the (first) class that contains the predictor $h$ , i.e., $n := min {n \in N | h \in H_{n}}$ , then we can define $z^{*} (ϵ, δ, h) := u_{n}^{*} (\frac{ϵ}{2}, δ \cdot w (n))$ , which will guarantee a gap between $ℓ (A_{SRM} (S))$ and $ℓ (h)$ of at most $ϵ$ by the following hand-written proof:

The first inequality holds because it simply writes out the gap which $ϵ_{1}$ guarantees. The second inequality holds because this is how we defined $A_{SRM}$ . The third inequality holds because $| S | \geq u_{n}^{*} (\frac{ϵ}{2}, δ \cdot w (n))$ and $ϵ_{n} (δ \cdot w (n), u_{n}^{*} (\frac{ϵ}{2}, δ \cdot w (n))) = \frac{ϵ}{2}$ (make sure you understand why). In our case, we have that $n = 5$ . And the fourth inequality – recall that uniform convergence bounds the difference between empirical and real error in both directions, i.e. it also holds that $ℓ_{S} (h) \leq ℓ (h) + ϵ_{5} (δ \cdot w (n), | S |)$ . Now the second summand comes out $\frac{ϵ}{2}$ for the same reason as above.

Note that, in the final (fourth) inequality, we bound the empirical error via the real error plus some maximum bound. This is necessary for the proof, but unlikely to make sense for a real instance – the hypothesis $h$ is in a fairly "far out" class, so it is presumably fairly complex, and likely to overfit the data, i.e., have low empirical error but high real error. Whenever the best predictor is in an early class, all of the later classes are likely to perform poorly in the real world (although they will often perform better on the training data, at least if the classes are actually ordered by increasing complexity). At least this is so in the case of polynomial predictors.

Minimum Description Length

Suppose we are interested in learning a countable hypothesis class $H$ . Without any other requirements, $H$ is nonuniformly learnable, being the countable collection of singleton classes, i.e. $H = {h_{0}, h_{1}, . . .} = ⋃_{n \in N} {h_{n}}$ . The idea now is that we can use a "natural" weighting function which is induced by the minimum description length of each predictor in the class. Formally, we define a description language as a subset $L \subset {0, 1}^{*}$ . If the description language is prefix-free, i.e. if there are no two descriptions $x, y \in L$ such that $x$ is a prefix of $y$ , we can define a weighting function $w : N \to (0, 1)$ by $w (h_{n}) = 2^{- | d (h_{n}) |}$ , where $d (h_{n}) \in L$ is the description of $h_{n}$ . To see that this defines a proper weighting function – i.e. that the total sum of weights equals at most 1 – note that can view the set of all descriptions as a probability space, i.e. $Ω = {d (h_{n})}_{n \in N}$ , and assign each description a probability. Specifically, consider the following experiment:

Repeatedly throw a coin; whenever it lands heads, write down a 1, whenever it lands tails, write down a 0. Stop when the output string equals some element in $L$ .

Since this language is prefix-free, each element $ℓ \in L$ is assigned a probability in this way – and as we know, probabilities sum up to at most 1. Then since $w = Pr$ , i.e. the function $w$ just assigns these probabilities as weights, this shows that it is a proper weighting function.

This leads to a natural definition of an alternative learner $A_{MDL}$ , which is simply following the SRM rule while using $w$ as the weighting function. This construction is closely related to Occam's Razor and Kolmogorov Complexity.

Computational Complexity of Learning

A field with fairly strong similarity to Machine Learning is statistics, because both are, in some sense, about learning from data. But in addition to theoretical limits, Machine Learning also has to concern itself with the annoying reality of bounded runtime. Thus, even though we've previously pretended that the behavior of $A_{ERM}$ and $A_{SRM}$ is fully determined by their subscript, this is not really so: the subscript is merely a mission statement, and how this is actually achieved has been left unsaid. Different approaches may have drastically different runtimes in practice.

Instead of reinventing the wheel, we will borrow the relevant ideas from the existing literature on computational complexity theory.

Computational Complexity Theory: the Landau notation

One generally cannot make precise statements about the runtime of an algorithm, because it depends on the processor speed of the machine it runs on (a problem which persists even if the algorithm is fully specified). Instead, one uses "asymptotic" bounds. Let's demonstrate how this works with an example.

Consider the $selection-sort$ algorithm. It takes as input a sequence of points $(x_{i})_{i \in {1, . . ., n}}$ such that $x_{i} \in X$ for all $i$ and there is a total order $<$ on $X$ (so given any two elements, we can say which one is larger). It goes through the list, remembers the smallest element, puts it at the beginning, goes through the list again (this time starting with the second element), remembers the smallest element, and puts it at the second spot, then goes through the list again (this time starting with the third element), puts the smallest element of those in third place and so on. It's the slowest not-intentionally-bad sorting algorithm that I know of, but it's also the simplest, and it's probably a decent description of how a human would sort a list.

So how long does $selection-sort$ need to sort a list of $n$ elements? Well, we sort of need $n$ operations to find the smallest element, then $n - 1$ for the second smallest and so on, leading to $\sum_{i = 1}^{n} i = n \cdot \frac{n + 1}{2}$ . And then some operations to swap elements, but it's not obvious how many. And maybe there's some overhead, too.

To have something rigorous, we would like to say that it's around (or actually, not significantly larger than) $n^{2}$ in a principled way. The approach taken here is the $O$ notation or Landau notation. Let $s$ be the function describing the runtime of $selection-sort$ as a function of the input size of the list on any machine, and let $f : N \to N$ be given by $f (n) = n^{2}$ , where $f$ outputs time in seconds. Then we want to write that $s \in O (f)$ to say exactly that " $s$ is not significantly larger than $f$ ", even though we don't know exactly how large $s$ gets (because we don't know for which machine it measures the runtime). The formal definition goes like this:

$s \in O (f) ⟺ \exists C \in R, n_{0} \in N such that \forall n \in N_{\geq n_{0}} : s (n) \leq C \cdot f (n)$

So the idea is that, regardless of how slow the machine is that $s$ is measuring runtime for, there should be some $C$ such that every operation on this machine takes at most $C$ seconds. Obviously, for any modern machine, many thousands of operations might be performed in one second, so $C < 0.001$ is quite likely to do the job. But even if we had declared that $f$ measures nano-seconds instead, there would still always be such a $C$ ; one could just take the current one and multiply it by $10^{9}$ . Thus, if such a $C$ exists, then $s$ should be bounded by $C$ times $f$ . Furthermore, there might be some constant overhead, so we don't demand that this is true for all input values, only all input values from some point onward.

One often sees $f \in O (n^{2})$ as shorthand for [ $f \in O (g)$ where $g (n) := n^{2}$ ]. One also often sees $f = O (n^{2})$ rather than $f \in O (n^{2})$ even though that notation is nonsensical (function equals set of functions?) and not even shorter. Far worse still is the inexplicable $f (n) = O (g (n))$ which offends me on a remarkably deep level.

To apply these ideas to machine learning, we will need to make three adjustments. 1) we need functions that take real numbers as input, rather than natural numbers. 2) we will need to compare their output as their parameter goes to 0, rather than as their parameter grows. And 3), we will need functions of several variables.

The first two are straight-forward. For $s, f : (0, 1) \to N$ , one can simply alter the definition into

$s \in O (f) ⟺ \exists C \in R, r_{0} \in (0, 1) s.t. \forall r \in (0, 1) : (r < r_{0} ⟹ s (r) \leq C \cdot f (r))$

The third one appears to raise at least one question – if $r$ turns to a vector $r = (r_{1}, . . ., r_{d})$ , then what does the condition " $r < r_{0}$ " turn into? The answer is " $r_{i} < r_{0}$ for some $i \in {1, . . ., d}$ ", so it will be enough if one of the coordinates gets small.

Last words on Landau in general: since $f \in O (g)$ for $f (n) = 3 n$ and $g (n) = 2^{2^{n}}$ , this notation is fairly imprecise. To have something that bounds runtime in both directions, one can use the symbol $Θ$ which is defined as $f \in Θ (g) : ⟺ f \in O (g) \land g \in O (f)$ .

Applying Landau to Machine Learning tasks

In the context of Machine Learning, one has the problem that there is no obvious input size for a learning task. The size of the training set is a non-starter because having more training data makes a problem easier, not harder. What one does instead is to use the gap and confidence as parameters. We have the following formal definition:

Given a function $f : (0, 1) \to N$ , a learning task defined via $X$ and $Y$ and $H$ and $ℓ$ , and a learner $A$ , we say that $A$ solves the task in time $O (f)$ iff there exists a constant $C \in R$ such that, for all $ϵ, δ \in (0, 1)$ , if $A$ is given a randomly sampled training sequence $S$ of sufficient length, then

1. $A$ always puts out a predictor $h := A (S)$ in at most $C \cdot f (ϵ, δ)$ seconds

2. Whenever $h$ is given any domain point $x \in X$ , it puts out a label $y \in Y$ in at most $C \cdot f (ϵ, δ)$ seconds

3. The equation $ℓ (h) \leq {min}_{h^{'} \in H} [ℓ_{S} (h^{'})] + ϵ$ holds with probability at least $1 - δ$ over the choice of $S$

The first condition is the logical place we've been working toward: it says that $A$ can't take too long to output a predictor. The second condition makes sure that $A$ doesn't cheat – if it wasn't for this rule, then $A$ could offload all computational work to the predictor. To be precise, we could define $A_{ERM_cheating}$ as "memorize the training sequence, then output the predictor $h$ that, upon receiving a domain point $x \in X$ , runs the code of $A_{ERM_honest}$ , applies that to $x$ and returns its output". Clearly, $A_{ERM_cheating}$ has excellent runtime, but it's probably not a very useful way to go about defining a predictor. Alas, with the predictor itself being bound by the same term as the learner, this strategy no longer helps. (Even if it could somehow offload half of all work to the predictor, taking half as long doesn't change the complexity class.)

Finally, the third condition is just the usual PAC condition.

There are some cases in which $A_{ERM}$ can be implemented with a reasonable runtime complexity. But usually it can't, and that's why this is only the end of part I and not of the book. There are many different hypothesis classes that are of interest, and the ability to implement $A_{ERM}$ efficiently on them is generally desirable.

LESSWRONG
LW