(This is part five in a sequence on Machine Learning based on this book. Click here for part 1.)

The first three posts of this sequence have defined PAC learning and established some theoretical results, problems (like overfitting), and limitations. While that is helpful, it doesn't actually answer the question of how to solve real problems (unless the brute force approach is viable). The first way in which we approach that question is to study particular classes of problems and prove that they are learnable. For example, in the previous post, we've looked (among other things) at linear regression, where the loss function has the form $ℓ : R^{d} \to R$ and is linear. In this post, we focus on convex learning problems, where the loss function also has the above form and is convex.

Convexity

We begin with sets rather than functions.

Convex sets

A set $M$ (that is part of a vector space) is called convex iff for any two points in that set, the line segment which connects both points is a subset of $M$ .

The condition that $M$ is part of a vector space is missing in the book, but it is key to guarantee that a line between two points exists. Convexity cannot be defined for mere topological spaces, or even metric spaces. In our case, all of our sets will live in $R^{d}$ for some $d \in N_{+}$ .

As an example, we can look at letters as a subset of the plane. None of the letters in this font are quite convex – l and I are closest but not quite there. The only convex symbols in this post that I've noticed are . and ' and | and –.

Conversely, every regular filled polygon with $n$ corners is convex. The circle is not convex (no two points have a line segment which is contained in the circle), but the disc (filled circle) is convex. The disc with an arbitrary set of points on its boundary (i.e. the circle) taken out remains convex. The disc with any other point taken out is not convex, neither is the disc with any additional point added. You get the idea. (To be precise on the last one, the mathematical description of the disc is $D = {x \in R^{2} | | | x | | \leq 1}$ , so there is no way to add a single point that somehow touches the boundary.)

Convex functions

Informally, a function $f : R^{d} \to R$ is convex iff the set of all points on and above the function is convex as a subset of $R^{d + 1}$ , where the dependent variable goes up/downward.

(Here, the middle function ( $x^{3}$ ) is not convex because the red line segment is not in the blue set, but the left ( $| x |$ ) and the right ( $x^{2}$ ) are convex.)

The formal definition is that a function $f : R^{d} \to R$ is convex iff for all $x, y \in R^{d}$ , the equation $f (x + α (y - x)) \leq f (x) + α [f (y) - f (x)]$ holds for all $α \in [0, 1]$ . This says that a line segment connecting two points on the graph always lies above the graph.

If $d = 1$ as was the case in all of my pictures, then $f$ is convex iff the little pixie flying along the function graph never turns rightward. This is the case iff $f^{'}$ is monotonically increasing, which is the case iff $f^{''} (x) \geq 0$ for all $x \in R$ .

The main reason why convexity is a desirable property is that, for a convex function, every local minimum is a global minimum. Here is a proof:

Suppose that $x \in R^{d}$ is a local minimum. Then we find some ball $B_{d} (x, ϵ) := {p \in R^{d} | | | p - x | | \leq ϵ}$ around $x$ such that $f (y) \geq x$ for all $y$ in the ball (this is what it means to be a local minimum in $R^{d}$ ). Now let $z$ be an arbitrary point in $R^{d}$ ; we show that its function value can't lie below that of $x$ . Imagine the line segment from $x$ to $z$ . A part of it must lie in our ball, so we find some (small) $δ \in R_{+}$ such that $x + δ [z - x] \in B_{d} (x, ϵ)$ . Then (because $x$ is our local minimum), we have that $f (x) \leq f (x + δ [z - x])$ . By convexity of $f$ we have $f (x + δ [z - x]) \leq f (x) + δ [f (z) - f (x)]$ , so taken together we obtain the equation

$f (x) \leq f (x) + δ [f (z) - f (x)]$

Or equivalently $δ [f (z) - f (x)] \geq 0$ which is to say that $δ f (z) \geq δ f (x)$ which is to say that $f (z) \geq f (x)$ .

If there are several local minima, then there are several global minima, then one can draw a line segment between them that inevitably cannot go up or down (because otherwise one of the global minima wouldn't be a global minimum), so really there is just one global minimum. This is all about the difference between $\leq$ and $<$ . The simplest example is a constant function – it is convex, and everywhere is a global minimum.

Jensen's Inequality

The key fact about convex functions, I would argue, is Jensen's inequality:

Given $α_{1}, . . ., α_{n} \in R_{+}$ with $\sum_{i = 1}^{n} α_{i} = 1$ , if $f : R^{d} \to R$ is convex, then for any sequence $(x_{1}, . . ., x_{n}) \in (R^{d})^{n}$ , it holds that $f (\sum_{i = 1}^{n} α_{i} x_{i}) \leq \sum_{i = 1}^{n} α_{i} f (x_{i})$ .

If you look at the inequality above, you might notice that it is almost the definition of linearity, except for the condition $\sum_{i = 1}^{n} α_{i} = 1$ and the fact that we have $\leq$ instead of $=$ . So convex functions fulfill the linearity property as an inequality rather than an equality (almost). In particular, linear functions are convex. Conversely, concave functions (these are functions where the $\leq$ in the definition of convex functions is a $\geq$ ) also fulfill the above property as an inequality, only the the sign does again turn around. In particular, linear functions are concave. To refresh your memory, here is the definition of convexity:

$f (x + α (y - x)) \leq f (x) + α [f (y) - f (x)] \forall x, y \in X, α \in [0, 1]$

So to summarize: convex functions never turn rightward, concave functions never turn leftward, and the intersection of both does neither, i.e., always goes straight, i.e., is linear. Looking at convexity and concavity as a generalization of linearity might further motivate the concept.

Terms of the form $x + a (y - x)$ , which one sees quite often (for example in defining points on a line segment), can be equivalently written as $(1 - a) x + a y$ . I think the first form is more intuitive; however, the second one generalizes a bit better. We see that $x$ and $y$ are given weights, and those weights sum up to $1$ . If one goes from 2 weighted values to $n$ weighted values (still all non-negative), one gets Jensen's inequality. Thus, the statement of Jensen's inequality is that if you take any number of points on the graph and construct a weighted mean, that resulting point still lies above the graph. See wikipedia's page for a simple proof via induction.

Guaranteeing learnability

Recall that we are trying to find useful solvable special cases of the setting minimize a loss function of the form $ℓ : R^{d} \to R$ . This can be divided into three tasks:

(1) define the special case

(2) demonstrate that this special case is indeed solvable

(3) apply the class as widely as possible

This chapter is about (1). (When I say "chapter" or "section", I'm referring to the level-1 and level-2 headlines of this post as visible in the navigation at the left.) The reason why we aren't already done with (1) is that convexity of the loss function alone turns out to be insufficient to guarantee PAC learnability. We'll discuss a counter-example in the context of linear regression and then define additional properties to remedy the situation.

A failure of convexity

In our counter-example, we'll have $X = Y = R$ (note that this is a convex set). Our hypothesis class $H$ will be that of all (small) linear predictors, i.e. just $H = {f_{α} : x \mapsto α \cdot x | α \in [- 1, 1]}$ . The reason that we only allow small predictors is that our final formulation of the learnable class will also demand that $H$ is a bounded set, so this example demonstrates that even boundedness + convexity is still not enough.

We've previously defined real loss functions as taking a hypothesis and returning the real error, and empirical loss functions as taking a hypothesis and a training sequence and returning the empirical error. Now we'll look point-based loss functions (not bolded as it's not an official term, but I'll be using it a lot) which measure the error of a hypothesis on a single point only, i.e. they have the form $ℓ_{(x, y)} : H \to R$ for some $(x, y) \in X \times Y$ . To be more specific, we will turn the squared loss function defined in the previous post into a point-based loss function. Thus we will have $ℓ_{(x, y)}^{(2)} (h_{α}) = | | α \cdot x - y | |^{2} = (α x - y)^{2}$ , where the last equality holds because we're in the one-dimensional case. We will only care about two points (all else will have probability mass $0$ ), namely these two:

That's the point $(1 / - 1)$ at the left and one all the way to the right at $(1 / μ, 0)$ . With $μ$ , think of an extremely small positive number, so that $1 / μ$ is quite large.

If this class were PAC learnable, then there would be a learner $A$ such that, for all $ϵ, δ \in (0, 1)$ , if the size of the training sequence is at least $m^{*} (ϵ, δ)$ , then for all probability distributions over $X \times Y$ , with probability at least $1 - δ$ over the choice of $S$ , the error of $A (S)$ would be at most $ϵ$ larger than that of the best classifier.

So to prove that it is not learnable, we first assume we have some learner $A$ . Then we get to set some $ϵ$ and $δ$ and construct a probability distribution $D_{A}$ based on $A$ . Finally, we have to prove that $A$ fails on the problem given the choices of $ϵ$ and $δ$ and $D_{A}$ . That will show that the problem is not PAC learnable.

We consider two possible candidates for $D_{A}$ . The first is $D_{L}$ which has all probability mass on the point $(1 / - 1)$ on the left. The second is $D_{?}$ , which has almost all probability mass on the point $(1 / - 1)$ , but also has $μ$ probability mass on the point $(1 / μ, 0)$ . As mentioned, $μ \in R_{+}$ will be extremely small; so small that the right point will be unlikely to ever appear in the training sequence.

If the right point doesn't appear in the training sequence, then the sequence consists of only the left point sampled over and over again. In that case, $A$ cannot differentiate between $D_{L}$ and $D_{?}$ , so in order to succeed, it would have to output a hypothesis which performs well with both distributions – which as we will show is not possible.

Given our class $H$ , the hypothesis $A (S)$ must be of the form $h_{α}$ for some $α \in R$ . Recall that the classifier is supposed to predict the $y$ -coordinate of the points. Thus, for the first point, $α = - 1$ would be the best choice (since $- 1 \cdot 1 = - 1$ ) and for the second point, $α = 0$ would be the best choice (since $0 \cdot 1 / μ = 0$ ).

Now if $α \leq - 0, 5$ , then we declare that $D_{A} = D_{?}$ . In this case (assuming that the second point doesn't appear in the training sequence), there will be a $μ$ chance of predicting the value $α \cdot 1 / μ = α / μ \leq - 1 / 2 μ$ , which, since we use the squared loss function, leads to an error of at least $\frac{1}{4 μ^{2}}$ , and thus the expected error is at least $μ \frac{1}{4 μ^{2}} = \frac{1}{4 μ}$ , which, because $μ$ is super tiny, is a very large number. Conversely, the best classifier would be at least as good as the classifier with $α = 0$ , which would only have error $1 - μ$ (for the left point), which is about $1$ and thus a much smaller number.

Conversely, if $α > - 0.5$ , we declare that $D_{A} = D_{L}$ , in which case the error of $A (S)$ is at least $(- 0.5 - (- 1))^{2} = \frac{1}{4}$ , whereas the best classifier (with $α = - 1$ ) has zero error.

Thus, we only need to choose some $ϵ < \frac{1}{4}$ and an arbitrary $δ$ . Then, given the sample size $m$ , we set $μ$ small enough such that the training sequence is less than $δ$ likely to contain the second point. This is clearly possible: we can make $μ$ arbitrarily small; if we wanted, we could make it so small that the probability of sampling the second point is $< δ^{100}$ . That concludes our proof.

Why was this negative result possible? It comes down to the fact that we were able to make the error of the first classifier with $α < - 0.5$ large via a super unlikely sample point with ${super}^{2}$ high error – so the problem is the growth rate of the loss function. As long as the loss function grows so quickly that, while both giving a point less probability mass and moving it further to the right, the expected error goes up, well then one can construct examples with arbitrarily high expected error. (As we've seen, the expected error in the case of $α < - 0.5$ is at least $\frac{1}{4 μ}$ , i.e. a number that grows arbitrarily large as $μ \to 0$ .)

Lipschitzness & Smoothness

There are at least two ways to formalize a requirement such that the loss function is somehow "bounded". They're called Lipschitzness and Smoothness, and both are very simple.

Lipschitzness says that a function cannot grow too fast, i.e.:

A function $f : R^{d} \to R$ is $ρ$ -Lipschitz iff $| f (y) - f (x) | \leq ρ | | y - x | | \forall x, y \in R^{d}$

If $f$ is differentiable, then a way to measure maximum growth is the gradient, because the gradient points into the direction of fastest growth. Thus, one has the equivalent characterization:

A differentiable function $f : R^{d} \to R$ is $ρ$ -Lipschitz iff $| | \nabla x | | \leq ρ$ for all $x \in R^{d}$

However, non-differentiable functions can be Lipschitz; for example, the absolute value function $| x |$ on the real numbers is $1$ -Lipschitz.

Conversely, smoothness is about the change of change. Thus, $| x |$ is definitely not smooth since the derivative jumps from $- 1$ to $1$ across a single point (smoothness is only defined for differentiable functions). On the other hand, the function $x^{2}$ is smooth on all of $R$ . The formal definition simply moves Lipschitzness one level down, i.e.

A differentiable function $f : R^{d} \to R$ is $β$ -smooth iff its gradient is $β$ -Lipschitz

Which is to say, iff $| | \nabla f (y) - \nabla f (x) | | \leq β | | y - x | |$ for all $x, y \in R^{d}$ . In the one-dimensional case, the gradient equals the derivative, and if the derivative is itself differentiable, then smoothness can be characterized in terms of the second derivative. Thus, a twice-differentiable function $f : R \to R$ is $β$ -smooth iff $| f^{''} (x) | \leq β$ for all $x \in R$ .

One now defines the class of convex Lipschitz bounded problems and that of convex smooth bounded problems. They both require that $H$ has a structure as a familiar set like $B_{d} (0, M)$ , that it is convex and bounded (so $H = R^{d}$ would not suffice), and that, for all $(x, y) \in X \times Y$ , the point-based loss function $ℓ_{(x, y)} : H \to R$ is $ρ$ -Lipschitz (in the former case) or $β$ -smooth and nonnegative (in the latter case). If all this is given, the class is called convex Lipschitz bounded with parameters $(M, ρ)$ ; or convex smooth bounded with parameters $(M, β)$ .

In the previous example, the hypothesis could be represented by the set $[0, 1]$ , which is both convex and bounded. (In that example, we thought of it as a set of functions, each of which fully determined by an element in $[0, 1]$ ; now we think of it as the set $[0, 1]$ itself.) Each point-based loss function $ℓ_{(x, y)}^{(2)}$ is convex (and non-negative). However, for any number $x \in R_{+}$ , the loss function $ℓ_{(x, y)}^{(2)}$ is defined by the rule $ℓ_{(x, y)}^{(2)} (α) = (α \cdot x - y)^{2}$ , and the gradient of this function with respect to $α$ (which equals the derivative since we're in the one-dimensional case) is $2 (α \cdot x - y)$ . Since this gets large as $α$ gets large, the function is not Lipschitz. Furthermore, the second derivative is $2 x$ . This means that each particular function induced by the point $(x, y)$ is $2 x$ -smooth, but there is no parameter $β$ such that all functions are $β$ -smooth.

Surrogate Loss Functions

We are now done with task (1), defining the special case. Task (2) would be demonstrating that both convex Lipschitz bounded and convex smooth bounded problems are, in fact, PAC learnable with no further conditions. This is done by defining an algorithm and then proving that that the algorithm works (i.e., learns any instance of the class with the usual PAC learning guarantees). The algorithm we will look at for this purpose (which does learn both classes) is an implementation of Stochastic Gradient Descent; however we'll do this in the next post rather than now. For this final chapter, we will instead dive into (3), i.e. find an example of how the ability to learn these two classes is useful even for problems that don't naturally fit into either of them.

Recall the case of binary linear classification. We have a set of points in some high-dimensional space $X = R^{d}$ , a training sequence where points are given binary labels (i.e. $Y = {- 1, 1}$ ) and we wish to find a hyperplane that performs well in the real world. We've already discussed the Perceptron algorithm and also reduced the problem to linear programming; however, both approaches have assumed that the problem is separable.

We're not going to find a perfect solution for the general case, because one can show that the problem is NP-hard. However, we can find a solution that approximates the optimal predictor. The approach here is to define a surrogate loss function, which is a loss function $ℓ^{*}$ that (a) upper-bounds the real loss $ℓ^{0 - 1}$ , and (b) has nicer properties than the real loss, so that minimizing it is easier. In particular, we would like for it to be a member of one of the two learnable classes we have introduced. Our point-based loss function for $ℓ^{0 - 1}$ has the form $ℓ_{(x, y)}^{0 - 1} (h_{a}) := 1_{h_{a} (x) \neq y}$ , where $1_{B}$ for a boolean statement $B$ is 1 iff $B$ is true and 0 otherwise.

Recall that each hyperplane is fully determined by one vector in $R^{d}$ , hence the notation $h_{a}$ . If we represent $H$ directly as $R^{d}$ and assume $d = 1$ , then the graph of $ℓ_{(x, y)}^{0 - 1}$ looks like this...

... because in $d = 1$ and the homogeneous case, the classifier determined by a single number; if this number is positive it will label all positive points with 1; if it's negative, it will label all negative points with 1. If the $x$ -coordinate of the point in question is positive with label $1$ or negative with label $- 1$ (i.e. $x > 0$ and $y = 1$ ; or $x < 0$ and $y = - 1$ ), then the former case is the correct one and we get this loss function. Otherwise, the loss function would jump from 0 to 1 instead.

Obviously, $d = 1$ is silly, but it already demonstrates that this loss function is not convex (it makes a turn to the right, and it's easy to find a segment which connects two points of the graph and doesn't lie above the graph). But consider the alternative loss function $ℓ_{(x, y)}^{*}$ :

This new loss function can be defined by the rule $ℓ_{(x, y)}^{*} (h_{a}) := max (0, 1 - ⟨ a, x ⟩ y)$ . In the picture, the horizontal axis corresponds to $a$ and we have $x > 0$ and $y = 1$ . This loss function is easily seen to be convex, non-negative, and not at all smooth. It is also $| | x | |$ -Lipschitz. Thus, the problem with $X = R^{d}$ is not convex Lipschitz bounded, but if we take $X = B_{d} (0, ρ)$ and also $H = B_{d} (0, M)$ for some $M, ρ \in R_{+}$ , then it does become a member of the convex-Lipschitz-bounded class with parameters $M$ and $ρ$ , and we can learn via e.g. stochastic gradient descent.

Of course, this won't give us exactly what we want (although penalizing a predictor for being "more wrong" might not be unreasonable), so if we want to bound our loss (empirical or real) with respect to $ℓ^{0 - 1}$ , we will have to do it via $ℓ^{0 - 1} (h) = ℓ^{*} (h) + [ℓ^{0 - 1} (h) - ℓ^{*} (h)]$ , where the second term is the difference between both loss functions. If $ℓ^{0 - 1} (h) - ℓ^{*} (h)$ is small, then this approach will perform well.

LESSWRONG
LW