LESSWRONG
LW

UML VIII: Linear Predictors (2) — LessWrong

(This is the eight post in a sequence on Machine Learning based on this book. Click here for part I. Alternatively, click here for part IV, which covers the basics of linear predictors.)

The mission statement for this post is to

continue the study of linear predictors
widen the applicability of this class

The latter will be quite successful, which is why more tools to learn linear predictors are desirable. In particular, most ways to reduce non-linear problems to linear ones will cause the dimension to blow up, which is why we want techniques that can handle high-dimensional instances (but we won't get to those in this post).

Before diving into the Machine Learning material, we need a tool from Linear Algebra.

Orthogonal Decomposition

The orthogonal decomposition is something my favorite textbook taught me, and we'll need it both for this post for the next (?) one.

Let $V$ be a vector space and $U$ be a subspace. Let $x \in V$ and $u \in U$ be two nonzero vectors. We wish to write $x$ as a multiple of $u$ plus a vector orthogonal to $u$ . That's the orthogonal decomposition. (Is this always possible? Think about it for $R^{2}$ and $R^{3}$ .) We can write

$x = α u + (x - α u)$

which is clearly correct, and if we can choose $α$ such that $u$ and $(x - α u)$ are orthogonal; this is the desired orthogonal decomposition. As one does in math, we now write down what we want to have as an equation and solve it for the desired variable, i.e.

$⟨ u, x - α u ⟩! = 0 ⟺ ⟨ u, x ⟩ - α | | u | |^{2} = 0 ⟺ α = \frac{⟨ u, x ⟩}{| | u | |^{2}}$

So the answer is affirmative – it is always possible. The thing to remember is that, if $x$ is our given vector and $u$ the first vector of the decomposition, we have to put the inner product between both in the nominator and the squared norm of $u$ in the denominator; that will be the $α$ from the decomposition equation.

Orthogonal decomposition can also be used for the simplest and most intuitive proof of the Cauchy-Schwartz inequality that I've seen (simplest even if one has to derive the orthogonal decomposition as part of the proof), but that's not important for this post. Let's do Machine Learning.

Support Vector Machines

Support Vector Machines is an odd name for hyperplanes that are selected in a way that attempts to maximize the distance between the hyperplane and the nearest point(s). We differentiate between Hard Support Vector Machines which are hyperplanes that separate the instance space perfectly and where (among all hyperplanes which do that) the distance to the nearest point is maximal, and Soft Support Vector Machines which try to simultaneously maximize the distance to the nearest point(s) and the number of correctly classified points; without the assumption that they'll get the second one perfectly right. Thus, Hard Support Vector Machines only exist if the instance space is linearly separable, but Soft Support Vector Machines always exist.

Illustrating the concept

In the 2-dimensional classification problem depicted below,

both the black and the green hyperplanes classify all instances correctly, but only the green one is (approximately) a Hard Support Vector Machine, because the distance to the nearest points (dotted lines) is maximized. Note that for a true Hard Support Vector Machine, generally there will be a bunch of nearest points with identical distance (because if there was just one, then the hyperplane could very likely be moved in such a way that distance to that point increases while distance to other points might decrease, leading to a higher minimal distance). In two-dimensional space, there will likely be two such points. On the other hand, the black hyperplane is not a Hard Support Vector Machine, and indeed the minimal margin could be slightly improved by moving it to the bottom left, such that the margin to the nearest triangle decreases but that to the nearest circle increases.

Given that Hard Support Vector Machines have an extremely discontinuous "caring function" (i.e. only care about the distance to a couple of points and are entirely indifferent to all others), it is easy to construct examples where they look silly:

You get the idea. The difference between Hard and Soft Support Vector Machines is somewhat analogous between the difference between committing to a hypothesis class ahead of time, and utilizing Structural Risk Minimization which has a weighting function over different hypothesis classes. In the above example, a Soft Support Vector Machine would probably bite the bullet and classify the one blue point incorrectly, in exchange for a much larger margin to the nearest correctly classified points.

Distance between a point and a hyperplane

In order to implement either form of Support Vector Machines, we need to know how to compute the distance between a point and a hyperplane. Let's assume we're in $R^{d}$ . In the homogeneous case (where hyperplanes have to pass through the origin), a hyperplane can be parametrized by a single vector, i.e. for $w \in R^{d}$ , the predictor $h_{w}$ is determined by the rule $h_{w} (x) = 1_{⟨ x, w ⟩ > 0} \forall x \in R^{d}$ (where $1_{B}$ for a boolean statement $B$ is 1 iff the statement is true and 0 otherwise). Let's write $H_{w}$ for the actual hyperplane that corresponds to $w$ and $h_{w}$ , i.e. the set of points ${p \in R^{d} | ⟨ w, p ⟩ = 0}$ .

Now, what is the distance between a point $x$ and $H_{w}$ ? It turns out that if $| | w | | = 1$ (which does not restrict the set of possible hyperplanes), then it is the simplest answer one could hope for, namely $| ⟨ x, w ⟩ |$ . Let's prove this claim.

In general metric spaces, the distance between a point $p$ and a set of points $S$ is defined as $inf {d (p, q) | q \in S}$ . If $S$ is a linear subspace of a larger vector space with an inner product, then this definition is still valid (inner-product spaces are metric spaces), but the distance also equals the length of the unique vector which connects $S$ and p and is orthogonal to $S$ – this is a basic result out of linear algebra. The vector orthogonal to the hyperplane $H_{w}$ is simply $w$ (see post IV), which means that the shortest distance between $H_{w}$ and $x$ is achieved by going from $x$ to $H_{w}$ in the direction of $w$ . Furthermore, the vectors which are orthogonal to $w$ are exactly those on $H_{w}$ .

This means that if we apply orthogonal decomposition to write $x$ as a multiple of $w$ plus a vector orthogonal to $w$ , then the former be precisely the distance of $x$ and $H_{w}$ .

Now if you recall, orthogonal decomposition has the form $x = α w + (x - α w)$ , where $α = \frac{⟨ x, w ⟩}{| | w | |^{2}}$ – which is just $⟨ x, w ⟩$ in our case since $| | w | | = 1$ – and hence the distance between $x$ and $H_{w}$ is $| | α w | | = | | ⟨ x, w ⟩ w | | = | ⟨ x, w ⟩ | \cdot | | w | | = | ⟨ x, w ⟩ |$ .

Learning Hard Support Vector Machines

As usual, let $S = ((x_{1}, y_{1}), . . ., (x_{m}, y_{m}))$ be our training sequence. Notice that, for Hard Support Vector Machines, we wish to find a hyperplane that is optimal in some sense (it maximizes the distance to the nearest point) under the constraint that it classifies every point directly. Furthermore, everything is linear. These three concepts together sound a lot like linear programming (again, see post IV), and indeed we can describe the problem of finding a Hard Support Vector Machine as a linear program.

First attempt: ${max}_{w \in R^{d}} (min {⟨ x_{k}, w_{k} ⟩ | k \in [m]}) s.t. y_{k} ⟨ w, x_{k} ⟩ > 0 \forall k \in [m]$

where $[m] := {1, . . ., m}$ . (The condition works because $y_{k} ⟨ w, x_{k} ⟩ > 0$ iff $w$ classifies $x_{k}$ correctly according to $y_{k}$ .) Unfortunately, despite our wanting to find a hyperplane, this objective function is non-linear and thus we don't have a linear program.

To remedy this, we will now perform an incredible magic trick. Recall that we assumed that $| | w | | = 1$ , and in that case, the distance between a point and the hyperplane is $⟨ w, x ⟩$ . Now if $| | w | | \neq 1$ , then a look at the proof we did shows that the general term for the distance is $\frac{⟨ w, x ⟩}{| | w | |}$ . Thus, if we measure the distance in terms of $⟨ w, x ⟩$ while letting $| | w | |$ be something else, say smaller, then our measured distance (i.e. $⟨ w, x ⟩$ ) will become smaller (it would have to be divided by a small term to make it grow into the real distance). It is as if we're zooming out of the image, but keep measuring distance with the same ruler, without normalizing it.

Thus, instead of demanding that the minimal distance be as large as possible, we can simply demand that it must be at least 1 (which also fixes the second bug in our linear program, where the constraints have a $>$ sign rather than a $\geq$ sign), and we're maximizing how far we zoom out. Thereby magic happens and our new program looks like so:

${min}_{w \in R^{d}} | | w | |^{2} s.t. y_{k} ⟨ w, x_{k} ⟩ \geq 1 \forall k \in [m]$

which is a proper linear program and can, therefore, be solved efficiently. And we use $| | w | |^{2}$ instead of $| | w | |$ just because it's simpler.

Learning Soft Support Vector Machines

Instead of demanding that all points be classified correctly, we can penalize incorrectly classified points based on how wrong the classification is. This can be done by changing the program to

${min}_{w \in R^{d}} λ | | w | |^{2} + \frac{1}{m} \sum_{k = 1}^{m} ξ_{k} s.t. y_{k} ⟨ w, x_{k} ⟩ \geq 1 - ξ_{k} \forall k \in [m] ξ_{k} \geq 0 \forall k \in [m]$

This reads "instead of classifying a point correctly, we're allowing you to be off by $ξ_{k}$ , but try to keep the $ξ_{k}$ as small as possible". We don't allow negative $ξ_{k}$ , i.e. we don't reward the classifier for getting points correct with a particularly large margin, rather everything above 1 is equally fine (it will have $ξ_{k} = 0$ and thus won't add to the term we're trying to minimize). Furthermore, it's not clear how important zooming further out (i.e. increasing margin) is vs having small $ξ_{k}$ , so one introduces a parameter to control the tradeoff. That parameter is $λ$ . It doesn't matter whether one puts it in front of $| | w | |^{2}$ or $\frac{1}{m} \sum_{k = 1}^{m} ξ_{k}$ except that it looks prettier this way. Finally, I don't know why we're minimizing the arithmetic mean of the $ξ_{k}$ rather than the sum, but given that we have $λ$ it doesn't matter, so I decided to roll with the rule from the book.

Widening Applicability

This chapter is where we look into applying linear learners to non-linear problems. Let's begin with the basic idea.

The basic idea

The basic idea is this: rather than solving non-linear problems using non-linear methods, one can map them into a linear space and then apply linear methods. The book motivates this as follows: consider the classification problem depicted below in $X = R$ :

where black indicates a positive label and red a negative one. This training sequence is not linearly separable, but if we map it onto $R^{2}$ by the rule $ψ (x) = (x, x^{2})$ , then $ψ (S)$ looks like this:

And thus our data set has become linearly separable (via an inhomogeneous hyperplane). The black line indicates this hyperplane and the three strokes indicate that it classifies points above as positive.

In general, given any domain set $X$ , we can choose an arbitrary embedding $ψ : X \to R^{d}$ , or perhaps $ψ : X \to B_{d} (0, M)$ (the set of elements in $R^{d}$ with norm at most $M$ ), and construct a linear classifier $h_{w} : R^{d} \to {0, 1}$ which will define a classifier $h_{w}^{*} : X \to {0, 1}$ by the rule $h_{w}^{*} (x) = h_{w} (ψ (x))$ . This same construction can also be used to recover more information than just the label.

Wait but what?

Reading a description of the above gave me this familiar feeling that I don't quite get what is happening and need to rework it bottom-up. This section is me trying to describe that process.

In the above example, the new data set is indeed linearly separable. However, if $ψ$ is allowed to be an arbitrary function, it can actually do much more work than just lifting the points onto a parabola. For example, the $y$ -dimension isn't even used in the case above. Clearly, what really matters in that problem is the distance to the origin: every point is classified positively iff that distance is at least 2. Thus, we could just have $ψ : R \to R$ be given by $ψ (x) = | x |$ , leading to the transformed 1-dimensional problem

which is separable by an even easier hyperplane. Or why not directly choose $ψ (x) = | x | - 2$ ? Then the data set looks like this:

and the linear classifier can just be the sign function.

(In the homogeneous case, the only two possible 1-dimensional classifiers based on hyperplanes are equivalent to the sign function and the inverse sign function, so the previous $ψ$ does not actually lead to a separable instance, and neither does the original one which lifted the points onto the parabola. However, this is just a technicality: we are in fact allowing inhomogeneous hyperplanes, we're just pretending otherwise because inhomogeneous hyperplanes can be learned through homogeneous methods, which is actually a special case of what this chapter is about. More on this in a bit.)

What we're actually doing here is simply declaring that our classifier be of the form

$h \circ ψ$

where $h$ is linear and $ψ$ is an arbitrary function. Putting it like this, it's obvious that $h$ doesn't have to do anything: concatenating an arbitrary function with a linear function is itself an arbitrary function; no new options are gained by this construction.

This does not imply much about the usefulness of the approach, but I think it helps to understand what is really going on. Similarly, another thing that seems important to realize is that, for learning to be successful, $ψ$ needs to preserve meaningful properties of the data. If we just let $ψ$ randomize some point in $R^{100000}$ for each new input point it receives, then the resulting problem in $R^{100000}$ will almost certainly be linearly classifiable (because the space is so high-dimensional), but the resulting classifier will not have learned how to generalize beyond the training data.

Despite those two caveats, this approach is quite useful. For one, as mentioned (and as you might recall, although I reduced the explanation to a single sentence), it can be used to learn inhomogeneous hyperplanes, by setting

$ψ : R^{d} \to R^{d + 1} ψ : (x_{1}, . . ., x_{d}) \mapsto (x_{1}, . . ., x_{d}, 1)$

where the weight vector $w \in R^{d + 1}$ and its homogeneous hyperplane $H_{w}$ found for $ψ (S)$ will correspond to the inhomogeneous hyperplane $b + H_{w^{'}}$ for $S$ , where $w^{'} := (w_{1}, . . ., w_{d})$ and $b := w_{d + 1}$ .

But there are more impressive examples, too. Let's look at one of them.

Example: polynomial regression

Consider a nonlinear regression problem, i.e. $X = R = Y$ with some training sequence $S = ((x_{1}, y_{1}), . . ., (x_{m}, y_{m}))$ , and suppose we want to fit a polynomial function of reasonable degree. Thus, we choose our hypothesis class $H$ to be the set of all (one-dimensional) polynomial functions of degree at most $k$ . Rather than studying polynomials, one can now simply transform this problem into a linear one by setting

$ψ : R \to R^{k + 1}, ψ : x \mapsto (1, x, x^{2}, . . ., x^{k})$

Then we train some linear predictor on this set, which (because it is linear) will be fully defined by some weight vector $w \in R^{k + 1}$ , i.e. $h_{w} (x) = ⟨ w, x ⟩$ . And thereby we obtain our polynomial predictor $h_{w}^{*} : R \to R$ ; it is defined by the rule $h_{w}^{*} (x) = \sum_{i = 0}^{k} w_{i} x^{i}$ . We have

$h_{S}^{(2)} (h_{w}^{*}) = \sum_{i = 1}^{m} (h_{w}^{*} (x_{k}) - y_{k})^{2} = \sum_{i = 1}^{m} (\sum_{j = 0}^{k} w_{j} x_{k}^{j} - y_{k})^{2} = \sum_{i = 1}^{m} (\sum_{j = 0}^{k} w_{j} ψ (x_{k})_{j} - y_{k})^{2} = \sum_{i = 1}^{m} (⟨ w, ψ (x) ⟩ - y_{k})^{2} = ℓ_{ψ (S)}^{(2)} (h_{w})$

where $ℓ_{S}^{(2)}$ is the empirical squared loss function. These equations show that the empirical error of both predictors are the same, which means that if $h_{w}$ is optimal for the transformed problem $ψ (S)$ , then $h_{w}^{*}$ is optimal for the original problem.

Thus, this technique can learn polynomial predictors. This particular trick is only slightly more general than that: we can learn any family of functions ${f_{w}}_{w \in R^{k}}$ where $f_{w} (x) = \sum_{i = 1}^{k} w_{k} f_{k} (x)$ by setting $ψ (x) = (f_{1} (x), . . ., f_{k} (x))$ and learning $ψ (S)$ via linear regression (post IV).

Further Musings

In complexity theory, one says that problem $A$ can be reduced to problem $B$ iff, given a general solver for problem $B$ , one can solve any instance of problem $A$ . Usually this is done by "translating" the problem instance of $A$ into one of $B$ and then looking at the solution there. If this can be done, it shows that $A$ is at most as hard as B.

In that sense, we have just proved that the problem of learning a polynomial function of degree at most $k$ with minimal empirical error can be reduced to the problem of learning a linear function in $k + 1$ dimensional space with minimal empirical error. Put differently, we have shown that learning such a polynomial function in $1$ -dimensional space is at most as hard as learning a linear function.

So does this prove that there must exist a simple algorithm that learns polynomial predictors? Well –

– the answer to this question is trivial because we have just constructed such an algorithm. It's just that this algorithm happens to make use of $ψ$ . Now, is that a fundamentally meaningful statement? In simple reduction arguments, one tends to get the feeling that both problem solvers kinda do the same thing anyway. So if one understood linear predictors and 1-dimensional polynomial predictors to a point that all theory about them felt trivial, would it still feel like embedding the training data via $ψ$ and then learning it with a linear predictor is different than learning it directly – or would it feel as if the algorithm just ends up doing the same thing anyway? If it's the latter, then $ψ$ is just a hack that is useful for confused people, but not fundamentally necessary.

Moving away from that line of thought – given that we are all confused about Machine Learning (in the sense that we haven't "solved" the field yet), how general is this method? Are there any other Machine Learning problems where one can just split the function to approximate in two parts in a way that will somehow make learning it more feasible?

And for a final thought: it seems non-obvious that one should only use training data to learn $h$ (in the composition $h \circ ψ$ ). The book mentions that $ψ$ is a way to encode prior knowledge, and that's fine, but the same criticism of discontinuity also applies here: it's probably not realistic to say that we have perfect confidence in $ψ$ and zero confidence in any other mapping. Furthermore, if $ψ$ can be improved, then (because training data has diminishing returns) it seems doubtful that it is optimal to use all training data on $h$ . At least, there ought to be some amount of data for which it stops being optimal. In practical terms, one might choose a family ${ψ_{i}}_{i \in [n]}$ of possible embeddings, and apply meta-learning to choose between them.

The next (?) post will cover kernels, which are a sophisticated tool to learn high-dimensional problems. They were originally meant for this post, which instead ended up reaching full length without ever getting there.

LESSWRONG
LW

LESSWRONG
LW

9

UML VIII: Linear Predictors (2)

9

9

Orthogonal Decomposition

Support Vector Machines

Illustrating the concept

Distance between a point and a hyperplane

Learning Hard Support Vector Machines

Learning Soft Support Vector Machines

Widening Applicability

The basic idea

Wait but what?

Example: polynomial regression

Further Musings