(This is the ninth post in a sequence on Machine Learning based on this book. Click here for part I.)

Kernels

To motivate this chapter, consider some training sequence $S = ((x_{1}, y_{1}), . . ., (x_{m}, y_{m}))$ with instances in some domain set $X$ . Suppose we wish to use an embedding $ψ : X \to R^{d}$ of the kind discussed in the previous post (i.e., to make the representation of our points more expressive, so that they can be classified by a hyperplane). Most importantly, suppose that $d$ is significantly larger than $m$ . In such a case, we're describing each point $ψ (x_{i})$ in terms of $d$ coordinates, even though our space only has $m$ points, which means that there can, in some sense, only be $m$ "relevant" directions. In particular, let

$U := span (ψ (S_{x})) = {p \in R^{d} | \exists a \in R^{m} : p = \sum_{i = 1}^{d} α_{i} ψ (x_{i})}$

where $S_{x}$ is the training sequence without labels, so that $ψ (S_{x}) = (ψ (x_{1}), . . ., ψ (x_{m}))$ . Then $U$ is an (at most) $m$ -dimensional subspace of $R^{d}$ , and we would like to prove that we can work in $U$ rather than in $R^{d}$ .

I

As a first justification for this goal, observe that $ψ (x_{i}) \in U$ for all $i \in [m]$ . (The symbol $[n]$ for any $n \in N$ denotes the set ${1, . . ., n}$ .) Recall that we wish to learn a hyperplane parametrized by some $w \in R^{d}$ that can then be used to predict a new instance $ψ (y)$ for some $y \in X$ by checking whether $⟨ w, ψ (y) ⟩ > 0$ . The bulk of the difficulty, however, lies in finding the vector $w$ ; this is generally much harder than computing a single inner product $⟨ w, ψ (y) ⟩$ .

Thus, our primary goals are to show that

(1) $w$ will lie in $U$

(2) $w$ can somehow be computed by only working in $U$

To demonstrate this, we need to look at how $w$ is chosen, which depends on the algorithm we use. In the case of Soft Support Vector Machines (previous post), we choose

$w \in arg min w \in R^{d} (λ | | w | |^{2} + \frac{1}{m} m \sum k = 1 max [0, 1 - y_{k} ⟨ w, ψ (x_{k}) ⟩])$ .

This rule shows that we only care about the inner product between $w$ and our mapped training points, the $ψ (x_{i})$ . Thus, if we could somehow prove (1), then (2) would seem to follow: if $w \in U$ , then, according to the rule above, we would only end up caring about inner products between points that are both in $U$ .

Therefore, we now turn to proving (1) formally. To have the result be a bit more general (so that it also applies to algorithms other than Soft Support Vector Machines), we will analyze a more general minimization problem. We assume that

$w \in arg min w \in R^{d} [f_{(y_{1}, . . ., y_{m})} (⟨ w, ψ (x_{1}) ⟩, . . ., ⟨ w, ψ (x_{m}) ⟩) + R (| | w | |^{2})]$

where $f$ is any function and $R$ is any monotonically non-decreasing function. (You might verify that the original problem is an instance of this one.) Now let $w^{*}$ be a solution to the above problem. Then we can use extended orthogonal decomposition $^{1}$ to write $w^{*} = π (w^{*}) + q$ , where $π : R^{d} \to U$ is the projection onto $U$ that leaves vectors in $U$ unchanged and $q$ is orthogonal to every vector in $U$ . Then, for any $u \in U$ , we have

$⟨ w^{*}, u ⟩ = ⟨ π (w^{*}) + q, u ⟩ = ⟨ π (w^{*}), u ⟩ + ⟨ q, u ⟩ = ⟨ π (w^{*}), u ⟩$ .

In particular, this is true for all the $ψ (x_{i})$ . Furthermore, since $R$ is non-decreasing and the norm of $w^{*}$ is at least as large as the norm of $π (w^{*})$ (note that $| | w^{*} | |^{2} = | | π (w^{*}) | |^{2} + | | u | |^{2}$ due to the Pythagorean theorem), this shows that $π (w^{*})$ is a solution to the optimization problem. Moreover, if $R$ is strictly monotonically increasing (as is the case for Soft Support Vector Machines), then if $q > 0$ , it would also be better than $w^{*}$ , which is impossible since $w^{*}$ is by assumption optimal. Thus, $q$ must be $0$ , which implies that not only some but all solutions lie in $U$ .

[1] Regular orthogonal decomposition, as I've formulated in the previous post, only guarantees that $u$ is orthogonal to $ψ (w^{*})$ rather than to every vector in $U$ . But the extended version is no harder to prove. Choose some orthonormal basis $B$ of $U$ , extend it to an orthonormal basis $B^{'}$ of all of $R^{d}$ (amazingly, this is always possible), and define $π$ by $π (\sum_{i = 1}^{| B^{'} |} α_{i} b_{i}) = \sum_{i = 1}^{| B |} α_{i} b_{i}$ ; i.e., just discard all basis elements that belong to $B^{'}$ but not $B$ . That does the job.

II

We've demonstrated that only the inner products between mapped training points matter for the training process. Another way to phrase this statement is that, if we have access to the function

$K_{ψ} : X \times X \to R K_{ψ} : (x, y) \mapsto ⟨ ψ (x), ψ (y) ⟩$

we no longer have any need to represent the points $ψ (x_{k})$ explicitly. The function $K$ is what is called the kernel function, that gives the chapter its name.

Note that $K$ takes two arbitrary points in $X$ ; it is not restricted to elements in the training sequence. This is important because, to actually apply the predictor, we will have to compute $⟨ w, ψ (y) ⟩$ for some $y \in X$ , as mentioned above. But to train the predictor, we only need inner products between mapped training points, as we've shown. Thus, if we set

$g_{k, ℓ} := K (x_{k}, x_{ℓ}) = ⟨ ψ (x_{k}), ψ (x_{ℓ}) ⟩ \forall k, ℓ \in [m]$

then we can do our training based solely on the $g_{k, ℓ}$ (which will lead to a predictor that uses $K$ to classify domain points.) Now let's reformulate all our relevant terms to that end. Recall that we have just proved that $w^{*} \in U$ . This implies that $w^{*} = \sum_{i = 1}^{m} α_{i} ψ (x_{i})$ for the right $α_{i}$ . Also recall that our objective is to find $w^{*}$ in the set

$arg {max}_{w \in U} f (⟨ w, ψ (x_{1}) ⟩, . . ., ⟨ w, ψ (x_{m}) ⟩) + R (| | w | |^{2})$

Now we can reformulate

$⟨ w, ψ (x_{k}) ⟩ = ⟨ \sum_{i = 1}^{m} α_{i} ψ (x_{i}), ψ (x_{k}) ⟩ = \sum_{i = 1}^{m} α_{i} ⟨ ψ (x_{i}), ψ (x_{k}) ⟩ = \sum_{i = 1}^{m} α_{i} g_{i, k}$

for all $k \in [m]$ , and

$| | w | |^{2} = ⟨ w, w ⟩ = ⟨ \sum_{i = 1}^{m} α_{i} ψ (x_{i}), \sum_{i = 1}^{m} α_{i} ψ (x_{i}) ⟩ = \sum_{k, ℓ = 1}^{m} α_{k} α_{ℓ} g_{k, ℓ}$ .

Plugging both of those into the term behind the $argmax$ , we obtain

$f (\sum_{i = 1}^{m} α_{i} g_{i, 1}, . . ., \sum_{i = 1}^{m} α_{i} g_{i, m}) + R (\sum_{k, ℓ = 1}^{m} α_{k} α_{ℓ} g_{k, ℓ})$

This is enough to establish that one can learn purely based on the $g_{k, ℓ}$ . Unfortunately, the Machine Learning literature has the annoying habit of writing everything that can possibly be written in terms of matrices and vectors in terms of matrices and vectors, so we won't quite leave it there. By setting $α := (α_{1}, . . ., α_{m})$ (a row vector), we can further write the above as

$f ([α G]_{1}, . . ., [α G]_{m}) + R (α G α^{T}) where G = (g_{k, ℓ})_{\begin{matrix} 1 \leq k \leq m 1 \leq ℓ \leq m \end{matrix}}$

or even as $f ((α G)) + R (α G α^{T})$ , at which point we've successfully traded any conceivable intuition for compactness. Nonetheless, the point that $G$ is sufficient for learning still stands. $G$ is also called the Gram matrix.

And for predicting a new point $ψ (y)$ , we have

$⟨ w, ψ (y) ⟩ = ⟨ \sum_{i = 1}^{m} α_{i} ψ (x_{i}), ψ (y) ⟩ = \sum_{i = 1}^{m} α_{i} ⟨ ψ (x_{i}), ψ (y) ⟩ = \sum_{i = 1}^{m} α_{i} K (x_{i}, ψ (y))$ .

At this point, you might notice that we never represented $U$ explicitly, but just reformulated everything in terms of inner products. Indeed, one could introduce kernels without mentioning $U$ , but I find that thinking in terms of $U$ is quite helpful for understanding why all of this stuff works. Note that the above equation (where we predict the label of a new instance) is not an exception to the idea that we're working in $U$ . Even though it might not be immediately apparent from looking at it, it is indeed the case that we could first project $ψ (y)$ into $U$ without changing anything about its prediction. In other words, it is indeed the case that $⟨ w, ψ (y) ⟩ = ⟨ w, π (ψ (y)) ⟩$ for all $y \in X$ . This follows from the definition of $π$ and the fact that all basis vectors outside of $U$ are orthogonal to everything in $U$ .

III

Kernels allow us to deal with arbitrarily high-dimensional data (even infinitely dimensional) by computing $m^{2}$ distances, and later do some additional computations to apply the output predictor – under the essential condition that we are able to evaluate the kernel function $K$ . Thus, we are interested in embeddings $ψ$ such that $K_{ψ}$ is easy to evaluate.

For an important example, consider an embedding for multi-variable polynomials. Suppose we have such a polynomial of the form $p : R^{n} \to R$ , i.e. something like

$p (x, y, z) = x^{2} y z^{2} + 3 x y z^{2} - 2 x^{3} z^{2} + 12 y^{2}$

where the above would be a $3$ -variable polynomial of degree 5. Now recall that, to learn one-dimensional polynomials with linear methods, we chose the embedding $ψ : x \mapsto (1, x, x^{2}, . . ., x^{k})$ . That way, a linear combination of the image coordinates can do everything a polynomial predictor can do. To do the same for an arbitrary $n$ -dimensional polynomial of degree $k$ , we need the far more complex embedding

$ψ : R^{n} \to R^{(n + 1)^{k}} ψ : (x_{1}, . . ., x_{n}) \mapsto (\prod_{i = 1}^{k} x_{w (i)})_{w \in {0, . . ., n}^{k}}$

An $n$ -dimensional polynomial of degree $k$ may have one value for each possible combination of its $n$ variables such that at most $k$ variables appear in each term. Each $w \in {0, . . ., n}^{k}$ defines such a combination. Note that this is a sequence, so repetitions are allowed: for example, the sequence $(1, 2, . . ., 2) \in {0, . . ., n}^{k}$ corresponds to the term $x_{1} x_{2}^{k - 1}$ . We set $x_{0} = 1$ so that we also catch all terms with degree less than $k$ : for example, the sequence $(0, 0, 0, 3, . . ., 3)$ corresponds to the term $x_{3}^{k - 3}$ and the sequence $(0, . . ., 0)$ to the absolute value of the polynomial.

For large $n$ and $k$ this target space is extremely high-dimensional, but we're studying kernels here, so the whole point will be that we won't have to represent it explicitly.

Now suppose we have two such instances $ψ (x)$ and $ψ (x^{'})$ . Then, $⟨ ψ (x), ψ (x^{'}) ⟩ = ⟨ (\prod_{i = 1}^{k} x_{w (i)})_{w \in {0, . . ., n}^{k}}, (\prod_{i = 1}^{k} x_{w (i)}^{'})_{w \in {0, . . ., n}^{k}} ⟩ = \sum_{w \in {0, . . ., n}^{k}} ⟨ \prod_{i = 1}^{k} x_{w (i)}, \prod_{i = 1}^{k} x_{w (i)}^{'} ⟩ = \sum_{w \in {0, . . ., n}^{k}} \prod_{i = 1}^{k} x_{w (i)} x_{w (i)}^{'}$

And for the crucial step, the last term can be rewritten as $(\sum_{i = 0}^{n} x_{i} x_{i}^{'})^{k}$ – both terms include all sequences $x_{i} x_{i}^{'}$ of length $k$ where $i \in {0, . . ., n}$ . Now (recall that $x_{0} = x_{0}^{'} = 1$ ) this means that the above sum simply equals $(1 + ⟨ x, x^{'} ⟩)^{k}$ . In summary, this calculation shows that

$K (x, x^{'}) := (1 + ⟨ x, x^{'} ⟩)^{k} = ⟨ ψ (x), ψ (x^{'}) ⟩ \forall x, x^{'} \in X$

Thus, even though $ψ$ maps points into the very high-dimensional space $R^{(n + 1)^{k}}$ , it is nonetheless feasible to learn a multi-polynomial predictor through linear methods, namely by embedding the values via $ψ$ and then ignoring $ψ$ and using $K$ instead. The gram matrix $G$ will consist of $m^{2}$ entries, where for each, a term of the form $(1 + ⟨ x, x^{'} ⟩)^{k} = (1 + \sum_{i = 1}^{n} x_{i} x_{i}^{'})^{k}$ has to be computed. This doesn't look that scary! Even for relatively large values of $d$ , $k$ , and $m$ , it should be possible to compute on a reasonable machine.

If we do approach learning a multi-dimensional polynomial in this way, then (I think) there are strong reasons to question in what sense the embedding $ψ$ actually happens – this question is what I was trying to wrap my head around at the end of the previous post. It seemed questionable to me that $ψ$ is fundamental even if the problem is learned without kernels, but even more so if it is learned with them.

And that is all I have to say about kernels. For the second half of this post, we'll turn to a largely independent topic.

Boosting

Boosting is another item under the "widening the applicability of classes" category, much like the $ψ$ from earlier.

I

This time, the approach is not to expand the representation of data and then apply a linear classifier on that representation. Instead, we wish to construct a complex classifier as a linear combination of simple classifiers.

When hyperplanes are visualized, it is usually understood that one primarily cares about hyperplanes in higher-dimensional spaces where they are much more expressive, despite the illustration depicting an instance in $2$ -d or $3$ -d. But this time, think of the problem instance below in literal 2-d space:

No hyperplane can classify this instance correctly, but consider a combination of these three hyperplanes:

By letting $h (p) = σ_{sign} (h_{1} (p) + h_{2} (p) + h_{3} (p) - 2.5)$ where $h_{i}$ is the predictor corresponding to the $i$ -th hyperplane and $σ_{sign}$ is the sign function, we have constructed a predictor $h$ which has zero empirical error on this training instance.

Perhaps more surprisingly, this trick can also learn non-convex areas. The instance below,

will be classified correctly by letting $h (p) = σ_{sign} (h_{1} (p) + 2 h_{2} (p) + 2 h_{3} (p))$ , with the $h_{i}$ (ordered left to right) defined like so:

These two examples illustrate that the resulting class is quite expressive. The question is, how to learn such a linear combination?

II

First, note that hyperplanes are just an example; the framework is formulated in terms of a learning algorithm that has access to a weak learner, where

An algorithm $A$ is called a $γ$ -weak learner for a hypothesis class $H$ iff there is a function $w^{*} : (0, 1) \to N$ such that, for any probability distribution $D$ over $X \times Y$ and any failure probability $δ \in (0, 1)$ , if $S$ consists of at least $w^{*} (δ)$ i.i.d. points sampled via $D$ , then with probability at least $1 - δ$ over the choice of $S$ , it holds that $ℓ (A (S)) \leq \frac{1}{2} - γ$ .

If you recall the definition of PAC learnability back from chapter 1, you'll notice that this is very similar. The only difference is in the error: PAC learning demands that it be arbitrarily close to the best possible error, while a weak learner merely has to bound it away from $\frac{1}{2}$ by some fixed amount $γ$ , which can be quite small. Thus, a weak learner is simply an algorithm that puts out a predictor that performs a little bit better than random. In the first example, the upper hyperplane could be the output of a weak learner. The term "boosting" refers to the process of upgrading this one weak learner into a better one, precisely by applying it over and over again under the supervision of a smartly designed algorithm –

– which brings us back to the question of how to define such an algorithm. The second example (the non-convex one) illustrates a key insight here: repeatedly querying the weak learner on the unaltered training instance is unlikely to be fruitful, because the third hyperplane by itself performs worse than random, and will thus not be output by a $γ$ -weak learner (not for any $γ \in R_{+}$ ). To remedy this, we somehow need to prioritize the points we're currently getting wrong. Suppose we begin with the first two hyperplanes. At this point, we have classified the left and middle cluster correctly. If we then weigh the right cluster sufficiently more strongly than the other two, eventually, $h_{3}$ will perform better than random. Alas, we wish to adapt our weighting of training points dynamically, and we can do this in terms of a probability distribution over the training sequence.

Now the roadmap for defining the algorithm which learns a predictor on a binary classification problem via boosting is as follows:

Have access to a training sequence $S$ and a $γ$ -weak learner $A_{γ}$
Manage a list of weak predictors which $A_{γ}$ has output in previous rounds
At every step, hand $A_{γ}$ the training sequence $S$ along with some distribution $D^{(t)}$ over $S$ , and have it output a $γ$ -weak predictor $h_{t + 1}$ on the problem $(S, D^{(t)})$ , where each point in $S$ is taken into account proportional to its probability mass.
Stop at some point and output a linear combination of the $h_{i}$

The particular algorithm we will construct is called Ada-Boost, where "Ada" doesn't have any relation to the programming language, but simply means "adaptive".

III

Let's first look into how to define our probability distribution, which will be the most complicated part of the algorithm. Suppose we have our current distribution $D^{(t)}$ based on past predictors $h_{1}, . . ., h_{t - 1}$ output by $A_{γ}$ , and suppose further that we have computed weights $w_{1}, . . ., w_{t - 1}$ such that $w_{i}$ measures the quality of $h_{i}$ (higher is better). Now we receive a new predictor $h_{t}$ with quality $w_{t}$ . Then we can define a new probability distribution $D^{(t + 1)}$ by letting

$D^{(t + 1)} ((x_{i}, y_{i})) \propto D^{(t)} ((x_{i}, y_{i})) \cdot e^{- w_{t} y_{i} h_{t} (x_{i})} \forall i \in [m]$

where we write $\propto$ rather than $=$ because the term isn't normalized; it will equal the above scaled such that all probabilities sum to 1.

The term $y_{i} h_{t} (x_{i})$ is $1$ iff predictor $h_{t}$ classified $x_{i}$ correctly. Thus, the right component of the product equals $e^{- w_{t}}$ iff the point was classified correctly, and $e^{w_{t}}$ if it wasn't. If $h_{t}$ is a bad predictor and $w_{t}$ is small, say $10^{- 3}$ , the two terms are both close to 1, and we don't end up changing our weight on $(x_{i}, y_{i})$ very much. But if $h_{t}$ is good and $w_{t}$ is large, the old weight $D^{(t)} ((x_{i}, y_{i}))$ will be scaled significantly upward (if it got the point wrong) or downward (if it got the point right). In our second example, the middle hyperplane performs quite well on the uniform distribution, so $w_{2}$ should be reasonably high, which will cause the probability mass on the right cluster to increase and on the two other clusters to decrease. If this is enough to make the right cluster dominate the computation, then the weak learner might output the right hyperplane next. If not, it might output the second hyperplane again. Eventually, the weights will have shifted enough for the third hyperplane to become feasible.

IV

Now let's look at the weights. Let $ϵ_{t} = ℓ_{S}^{0 - 1} (h_{t})$ be the usual empirical error of $h_{t}$ , i.e., $ϵ_{t} = \frac{1}{m} | {(x, y) \in S | h_{t} (x) \neq y} |$ . We would like $w_{i}$ to be a real number, which starts close to $0$ for $ϵ_{t}$ close to $\frac{1}{2}$ and grow indefinitely for $ϵ_{t}$ close to 0. One possible choice is $w_{t} := \frac{1}{2} ln (\frac{1}{ϵ_{t}} - 1)$ . You can verify that it has these properties – in particular, recall that $h_{t}$ is output by a weak learner so that its error is bounded away from $\frac{1}{2}$ by at least $γ$ . Because of this, $\frac{1}{ϵ_{t}}$ is larger than $2$ so that $\frac{1}{ϵ_{t}} - 1$ is larger than $1$ and $w_{t}$ is larger than $0$ .

V

To summarize,

AdaBoost ( $A_{γ}$ : weak learner, $S$ : training sequence, $T : N_{+}$ )

$D^{(0)} \leftarrow (\frac{1}{m} \dots \frac{1}{m})$

for $(t \leftarrow 1 to T)$ do

$h_{t} \leftarrow A_{γ} (S, D^{(t - 1)})$

$ϵ_{t} \leftarrow \frac{1}{m} | {(x, y) \in S | h_{t} (x) \neq y} |$

$w_{t} \leftarrow \frac{1}{2} ln (\frac{1}{ϵ_{t}} - 1)$

$D^{(t)} \leftarrow normalize ((D_{i} \cdot e^{- w_{t} y_{i} h_{t} (x_{i})})_{i \in [m]})$

endfor

return $f : x \mapsto σ_{sign} (\sum_{t = 1}^{T} w_{t} h_{t} (x))$

end

VI

If one assumes that $A_{γ}$ always returns a predictor with error at most $\frac{1}{2} - γ$ (recall that it may fail with probability $δ$ ), one can derive a bound on the error of the output predictor. Fortunately, the dependence of the sample complexity on $δ$ is only logarithmic, so $δ$ can probably be pushed low enough that $A_{γ}$ is unlikely to fail even if it is called $T$ times.

Now the error bound one can derive is $e^{- 2 γ^{2} T}$ . Looking at this, it has exactly the properties one would expect: a higher $γ$ pushes the error down, and so do more rounds of the algorithm. On the other hand, doing more rounds increases the chance of overfitting to random quirks in the training data. Thus, the parameter $T$ allows one to balance the overfitting vs. underfitting tradeoff, which is another nice thing about AdaBoost.The book mentions that Boosting has been successfully applied to the task of classifying gray-scale images into 'contains a human face' and 'doesn't contain a human face'. This implies that human faces can be recognized using a set of quantitative rules – but, importantly, rules which have been generated by an algorithm rather than constructed by hand. (In that case, the weak learner did not return hyperplanes, but simple predictors of another form.) In this case, the result fits with my intuition (that face recognition is the kind of task where a set-of-rules approach will work). It would be interesting to know how well boosting performs on other problems.

LESSWRONG
LW

LESSWRONG
LW

13

UML IX: Kernels and Boosting

13

13

Kernels

I

II

III

Boosting

I

II

III

IV

V

VI