Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Mentioned in

VC Theory Overview

2Alexander Gietelink Oldenziel

1Joar Skalse

New Comment

2 comments, sorted by Click to highlight new comments since: Today at 8:12 AM

From reading your post it seems that classical VC theory gives vacuous bounds for NN learning behaviour. Correct me if I'm wrong. You say that the PAC formalism can be improved to be more realistic and suggest more non-vacuous bounds may be proved. Do you have a reference where non-vacuous bounds are proved?

The bounds are not exactly vacuous -- in fact, they are (in a sense) tight. However, they concern a somewhat adversarial setting, where the data distribution may be selected arbitrarily (including by making it maximally opposed to the inductive bias of the learning algorithm). This means that the bounds end up being much larger than what you would typically observe in practice, if you give typical problems to a learning algorithm whose inductive bias is attuned to the structure of "typical" problems.

If you move from this adversarial setting to a more probabilistic setting, where you assume a fixed distribution over or , then you may be able to prove tighter probabilistic bounds. However, I do not have any references of places where this actually has been done (and as far as I know, it has not been done before).

In this post, I will give a brief overview of VC theory and computational learning theory. The reason for this is that I will later write a post about singular learning theory, in which I will refer back to this post. However, this post is self-contained, and may be interesting for anyone interested in theoretical machine learning.

What is Computational Learning Theory?In short, computational learning theory (CLT) is what happens when you combine complexity theory with machine learning. Just like complexity theory is concerned with understanding what kinds of problems are hard to compute, or easy to compute, CLT is concerned with understanding what kinds of things are easy to learn, or hard to learn. This means that CLT seeks to identify classes of problems that you can learn with a small amount of data, and classes of problems that no algorithm can learn unless it gets a large amount of data. This also means that CLT produces generalisation bounds for different kinds of learning algorithms, and other such results.

Probably Approximately Correct LearningIn order to study what kinds of problems are hard to learn, and easy to learn, we first need a mathematical formalism that captures what it means to "learn a problem". The most popular model in CLT is given by the probably approximately correct (PAC) learning framework. This is a model of supervised learning for classification. It uses the following components:

We have an instance space X, which contains all the data that we might observe. For example, X may be a vector space, or it may be the set of all bit strings, etc. If we are doing image classification, then X may be the set of all RGB images of a certain size, and so on.

We have a set of concepts C, where every element of C is a function c:X→{0,1}. This is the set of functions that we might be trying to learn. For example, if X is a vector space, then C could be the set of all linear classifiers on X. Alternatively, if X is {0,1}n, then C could be the set of all Boolean functions over n variables with circuit complexity ≤m. And so on.

We assume that we have a (potentially unknown) distribution D over X, and that there is an unknown function c:X→{0,1} that we are trying to learn. To do this, our learning algorithm will observe a number of data points (x,c(x)), where each x is sampled iid from D. Based on this dataset, it will guess a function h:X→{0,1}. We want this function to be similar to c with high probability.

Specifically, given a distribution D over X, a concept c∈C, and a hypothesis h:X→{0,1}, we say that the error of h is P(h(x)≠c(x)), where x is sampled from D.

A learning algorithm L is an algorithm that takes as input a dataset {(x,c(x))} of points, and returns a function h:X→{0,1}.

We say that L is a PAC-learning algorithm for C if there exists a polynomial p, such that for all ϵ and δ, and all distributions D over X and all concepts c∈C, if L is given access to at least p(1/ϵ,1/δ) samples from D, together with their labels according to c, then L will with probability at least 1−δ learn a function whose error is at most ϵ. Moreover, if there exists a PAC-learning algorithm for C, then we say that C is PAC-learnable.

In other words, PAC-learning is a mathematical formalism that describes supervised learning of (binary) classification algorithms. C is a set of classification functions, that we may be trying to learn. We want to know if there exists a supervised learning algorithm L, such that if L is given access to a small (polynomial) amount of training data, then it will with high probability learn a function with low error. We want this bound to hold for all c∈C, and all D over X, so that we can be sure that this bound holds even if we have no prior knowledge of c or D.

Note that, while this is required to hold for all D, the error of h is defined relative to the same distribution which the training data is drawn from. In other words, we assume that there is no distributional shift. This means that, while D is unknown, it is the case that if a particular data point has a low probability under the training distribution, then it must also have a low probability under the test distribution, and vice versa.

Note also that we are not concerned with the question of whether or not h=c. As far as this framework is concerned, it could be that L never proposes the right hypothesis, even in the limit of infinite data. As long as h has a low error, this is enough.

Ways To Generalise the PAC FormalismThere are some parts of the PAC-learning framework that seem somewhat strong, so before moving on, let me briefly make some comments on how this framework could be generalised.

First of all, we require that for all ϵ and for all δ, if L is given access to at least p(1/ϵ,1/δ) samples from D, together with their labels according to c, then L will with probability at least 1−δ learn a function whose error is at most ϵ. What if we only require that this holds for

someδ (smaller than 0.5) andsomeϵ (smaller than 0.5)?This generalisation makes no difference. First of all, if we have an algorithm L that with probability at least (say) 0.6 learns a function with error at most ϵ, then we can get an algorithm that with probability at least 1-δ learn a function whose error is at most ϵ, by simply running L multiple times and creating an ensemble of the functions that it learns. Similarly, if we have an algorithm that only is a PAC-learning algorithm for some ϵ, then this can similarly be used to construct a full PAC-learning algorithm. However, this is a bit more difficult to show, because simply running the algorithm multiple times is not guaranteed to work. The proof instead relies on creating multiple artificial data distributions, and then training the algorithm on these different distributions (for an overview, see the theory behind AdaBoost and similar algorithms).

Another thing is that this formalism requires L to work for

anydistribution D over X. Are there cases where there is a PAC-learning algorithm for some D, but not for all D? Here, the answer is actually yes. Trivially, we may assume that D only gives support to a single element of X; in that case, a single data point is always enough to learn a hypothesis h with zero error. Less trivially, there are concept classes which are PAC-learnable when D is uniform, but not in general. You could therefore create a version of the PAC-learning framework where you only require that everything works for one particular distribution. However, this generalisation ends up not being very insightful.A third thing is that the PAC formalism assumes that we are doing binary classification. This can be straightforwardly extended to multi-label classification, since classification with n labels is equivalent to learning n binary classifiers. Extending the formalism to deal with

regressionis also possible, but a bit more involved.Finally, the PAC formalism assumes that the true labeling function

cis contained inC, and it assumes that there is no label noise. You can lift these assumptions, without changing the big picture very much. I will briefly discuss this at the end of this post.Other Formalisms in CLTBefore we move on, let me briefly mention some of the other formalisms that are studied in CLT. While the PAC-learning framework probably is the most common one, there are also other frameworks. One option is to assume that there is label noise, which gets especially interesting (mathematically speaking) if the label noise is adversarial. Another option is to assume that the learning algorithm isn't just doing supervised learning, but that it also is able to do active learning. In this setting, we assume that L has access to a subroutine that it can query at unit cost, and which for an input x returns c(x). Interestingly, there are concept classes which are efficiently learnable with active learning, but not with supervised learning. You can also assume that the data isn't drawn i.i.d., but that it is instead correlated in some way. For example, it could come from a time series, or it could be selected intelligently (either helpfully or adversarially). For an overview of these things, see some standard textbook.

Vapnik-Chernovenkis DimensionYou may be surprised to learn that we have necessary and sufficient conditions that characterise when a concept class C is PAC-learnable. These results rely on the Vapnik-Chernovenkis (VC) dimension of C, which is a measure of the expressivity of C. It is defined like this:

Given a finite set S of points in X, we say that C "shatters" S if, for every subset S′ of S, there is a function c∈C such that, for x∈S, we have that c(x)=1 iff x∈S′. In other words,

Cis able to express every Boolean function overS. We then say that the VC dimension of C is the size of the largest set that it can shatter. If it can shatter some arbitrarily large sets, then we say that the VC dimension of C is infinite.For example, if C is the set of all linear classifiers in an

n-dimensional vector space, then the VC dimension of C is n+1. Alternatively, if C is the set of all Boolean functions overnvariables, then the VC dimension of C is 2n. IfCis the set of all computable functions on {0,1}⋆, then the VC dimension of C is infinite. IfCis the set of all classifiers for convex sets in Rn, then the VC dimension of C is infinite.If C has a finite cardinality, then the VC-dimension of C is at most log2|C|.

Putting it TogetherWe say that L is a consistent learner for C if, for every dataset that is consistent with some c∈C, L will output a hypothesis h∈C such that h fits all training data.

We then have that C is PAC-learnable if and only if C has a finite VC dimension. Moreover, in that case,

anyconsistent learner is a PAC-learning algorithm for C.If L is a consistent learner for C, where C has VC-dimension d, and L is given access to at least k((1/ϵ)log(1/δ)+(d/ϵ)log(1/ϵ)) data points, then for any c∈C and any D over

X,Lwill with probability at least 1−δ learn a function with error at most ϵ. Herekis a universal constant.Moreover, for any learning algorithm

LforC, ifChas VC-dimensiond, then we can find a c∈C and a D over X such that if L is given access to less than (d−1)/(32∗ϵ) data points, then the probability that it will learn a function h with error greater than ϵ, is at least δ. We can also find a c∈C and aDoverXfor which we need at least (1/4ϵ)log(1/4δ) data points.Thus, the VC-dimension of

Cgives us both an upper and a lower bound on how much data we need in order to probably learn an approximately correct function. Moreover, these bounds are nearly equal, and are attained by any consistent learning algorithm.What About Computational Cost?The version of PAC-learning that I have given above is only concerned with how much training data

Lneeds, and does not consider the computational cost ofL. This raises the question; are there concept classesCwhere we only need a polynomial amount of data to learnc, but where the cost of fitting a hypothesis to the training data is exponential (or worse)? The answer to this question is yes, given certain widely held assumptions about complexity theory and cryptography. Intuitively, there are cases whereCcan have a finite VC-dimension, but where fitting a hypothesis to the training data is NP-complete (for example). In this setting, I don't think there are neat necessary and sufficient conditions yet. I won't talk more about this here, but if you are interested, it may be worthwhile to read a textbook on CLT.What About Neural Networks?The VC dimension of a neural network depends on its architecture, parameter count, and activation function. However, unless you pick a very strange activation function, you will find that the VC dimension indeed is finite. Moreover, a neural network is of course also a consistent learner (if you train for long enough). This means that you can apply all the math above, and get both an upper and a lower bound on the amount of data that you need to ensure that a neural network will generalise well.

If you do this, you will find that the bound is very, very large -- much larger than the amount of data that in practice is enough to get good generalisation. This is because neural networks have a very large VC dimension, so the above bounds say that a lot of data should be required. Moreover, the VC dimension grows as you add more parameters, so the more parameters you have, the more data is required for the above bounds to apply (whereas in practice, adding more parameters typically improves performance, rather than decrease it).

Some people take this to mean that VC theory somehow doesn't apply to neural networks. This is not correct; the above results are mathematical theorems that apply to all learning machines, including neural networks. However, it is important to note that these bounds are

worst-casebounds -- they say that there existsa functionanda data distributionfor which neural networks require a large amount of data. And this is true; you can artificially construct a labeling function that a neural network can express, but which it takes a lot of data for it to learn.From this, we can immediately see where the PAC-learning formalism must be relaxed, in order to provide bounds that are more informative in practice. The PAC-learning framework demands that

Lprobably will learn an approximately correct hypothesisfor all c in C. In reality, there are many functions that a neural network in principle could express, but which will never come up in reality. The right way to augment the formalism would therefore be to assume that we have some prior probability distributionPoverC, and thatcis sampled fromP. We then wantLto probably learn an approximately correct hypothesis, where the probability is given relative to bothPandD,or something along those lines. We could perhaps also relax the data requirement, to allowpto be polynomial in 1/P(c), in addition to 1/δ and 1/ϵ.(Note that this is also why we get the result that any consistent learner attains both the upper and the lower bound; if

candDcan be selected arbitrarily, including adversarially, then the inductive bias of the learning algorithm doesn't matter.)Note that the VC theory framework still applies somewhat to this setting; we could assume that

cwith high probability is drawn from some subsetC'ofC, whereC'has a much lower VC-dimension thanC, and thatLwill attempt to find a hypothesis inC'that fits the training data. However, spelling all of this out in detail is still somewhat involved.What Does This Have To Do With Alignment?Understanding generalisation and inductive bias is important for alignment. VC theory does not provide that (at least not when it comes to neural networks), but it is an important part of the story of theoretical ML as it stands right now. I wrote this because I think VC theory is interesting, and because I will want to reference this post in an upcoming post on singular learning theory.

Addendum: Agnostic PAC LearningThe definition of PAC-learning makes two assumptions that are somewhat unsatisfying; first of all, it assumes that the true label function c is contained in C, and secondly, it assumes that there is no label noise in the training data (or test data). Both of these assumptions can be lifted, without changing the overall picture very much.

To do this, we will allow the data to be sampled from an arbitrary distribution D over X×{0,1} (so that each data point is a tuple, where the first element is interpreted as the input, and the second element as the label). This covers the case where there is no function c∈C which fits the data distribution perfectly. It also allows for label noise (and the noise isn't even required to be e.g. i.i.d.). Given such a distribution D, and a function h:X→{0,1}, we say that the error of h is errD(h)=PX,Y∼D(h(X)≠Y).

Now, we say that L is an "agnostic" PAC-learning algorithm with hypothesis class C if there exists a polynomial function p such that for every ϵ, every δ, and every distribution D over X×{0,1}, if L is given p(1/ϵ,1/δ) i.i.d. samples from D, then with probability at least 1−δ, L will learn a hypothesis h∈C such that errD(h)≤minh⋆∈CerrD(h⋆)+ϵ. Moreover, we say that C is agnostically PAC-learnable if there exists an agnostic PAC-learning algorithm for C.

In other words, in this setting, we allow label noise, and we allow data distributions which cannot be perfectly modeled by C. We are concerned with whether or not it is possible to learn a hypothesis h which is approximately optimal

among all functions inC, with high probability, using a reasonable amount of training data, while making very minimal assumptions about the underlying data distribution.Say that an

empirical risk minimiseris any learning algorithm which, for any dataset {(x,y)}, returns a function h∈C which minimises the empirical error on that dataset.We then have that C is agnostically PAC-learnable if and only if C has a finite VC-dimension. Moreover, in that case, any empirical risk minimiser is an agnostic PAC-learning algorithm for C. Thus, agnostic PAC-learning is characterised by the same conditions as PAC-learning.

We have that if L is an empirical risk minimiser for C, where C has VC-dimension d, then for any distribution D over X×{0,1}, if L is given access to at least u((d/ϵ2)+(1/ϵ2)log(1/δ)) samples from D, then L will with probability at least 1−δ learn a function h where errD(h)≤minh⋆∈CerrD(h⋆)+ϵ. Moreover, any agnostic PAC-learning algorithm for C needs at least l((d/ϵ2)+(1/ϵ2)log(1/δ)) samples. Here u and l are two constants.