Leslie Valiant discusses the the "probably approximately correct," or PAC, model of machine learning.


Wow, this is quite interesting. What are your thoughts?

You made a number of important contributions to parallel computing in the 1980s. Can you tell me about your more recent work in that arena?

The root problem with parallel machines has not been that any one is inherently bad, but more that it is difficult to make sense of them as a totality. The most striking fact about sequential machines is that they are all the same—they emulate the von Neumann abstraction. This sameness has been vital to the growth of computing since it has enabled us to write portable software for this abstraction and make machines to emulate it. Thus the von Neumann model has served as the vital "bridging model" between the software and hardware.

I have been long interested in the question of whether a similarly effective bridging model exists for parallel machines and what that might be. Parallelism has now arrived with multiple cores in our laptops. My view is that the irresistible driving forces that have brought this about have been the imperatives of physics. It follows, I think, that we have to look to the driving physical limitations to see what is most fundamental about parallel computing. Computation, communication, synchronization, and memory all need various resources, which in turn are limited by the physical arrangements of the devices. The sought-after bridging model has to account for these costs. We have been spoiled over the last 60-odd years by having an abstraction as simple as the von Neumann model suffice. Life will become a little more complicated now. The absence of an accepted bridging model is evident even at the theoretical level of algorithms.

My most recent work addresses these issues via a particular proposed bridging model, called the Multi-BSP, which tries to capture the physical limitations of machines as simply as possible. It uses barrier synchronization in a hierarchical manner and idealizes a machine as a set of numerical parameters that specify a point in the space of possible machines. Mention of the many architectural details of current machines that are not inevitably suggested by physics is deliberately omitted.

Let's talk about your most recent work in computational neuroscience. Can you explain the "neuroidal model" you developed in your book Circuits of the Mind?

The neuroidal model is designed to describe how basic computational tasks are carried out in the brain at the level of neurons. We do not know what changes the brain undergoes when it memorizes a new word or makes a new association. We need some language for describing the alternative algorithms that a network of neurons may be implementing. Memorizing a new word is easy enough for a computer, but it is still mysterious how the brain does it. Devising algorithms that do not obviously contradict biology is not easy, because biology appears very constrained. Each neuron is connected to only a small fraction of the others, and its influence on each of those is believed to be weak on the average.

You also developed a formal system called robust logics that seeks to reconcile what you've described as "the apparent logical nature of reasoning and the statistical nature of learning."

Robust logic is aimed at tasks which we do not understand how to do on any computational device, let alone the brain. The tasks in question are those of reasoning using information that is learned from experience and therefore uncertain. Humans manage to make effective decisions even in an uncertain world and even with only partial knowledge. This skill is close to the heart of intelligence, I believe, and understanding it better raises some fundamental scientific questions.

It's interesting how terms lose their original relevance. Nowadays it would seem silly to name a subfield of machine learning "probably approximately correct", since that describes all of machine learning.

From the article:

Successive contributions by Kearns, Schapire, and Freund not only resolved this question in the very unexpected positive direction, but yielded the boosting technique, perhaps the most powerful general learning method now known.

Valiant is talking specifically about the AdaBoost algorithm. I'm not sure boosting is the most powerful general learning method now known - the support vector machine seems more powerful - but it certainly has the best ratio of power:simplicity. Anyone with a passing interest in machine learning should implement AdaBoost and play around with it a bit.

Valiant is not talking specifically about AdaBoost, although AdaBoost was the first of these algorithms and is well known due to wide proliferation. See this which succinctly describes the differences of some of the different boosters out there. In particular the linked paper by Philip Long at Google is really nice for showing the limitations of boosters and also understanding the fact that boosters are really nothing more than a specialized gradient descent if you recast them in the right way.

I'm not sure boosting is the most powerful general learning method now known - the support vector machine seems more powerful

This is not what is meant here. SVM is a classification rule itself, whereas boosting is a metarule that operates on classification rules themselves and attempts to make coherent use of multiple decision rules each with different degrees of confidence and error. It makes no sense to compare the usefulness of SVMs to the usefulness of boosting, boosting operates on SVMs. To boot, generalized kernel learning methods, sparse dictionary coding, bag-of-words, and Reproducing Kernel Hilbert Space methods all have many cases where they are vastly superior to SVM. For that matter, even simpler methods like Fisher Linear Discriminant can outperform SVM in a lot of practical cases. And SVM lacks much extension to fully unsupervised learning.

I think Valiant, whose office sits down the hall from my adviser's and who I have frequent conversations with, is on the money with this stuff.

It makes no sense to compare the usefulness of SVMs to the usefulness of boosting If an SVM outperforms a boosted-whatever, then it does make sense to compare them.

boosting operates on SVMs Except that in practice no one uses SVMs as the base learners for boosting (as far as I know). I don't think it would work very well, since basic SVMs are linear models, and adding multiple linear models is useless. Boosting is usually done with decision trees or decision stumps.

bag-of-words That is a feature representation, and it has little to do with the learning method. You could encode a text as bag-of-words, and train an SVM on these features.

Reproducing Kernel Hilbert Space methods Kernel SVM ''is'' a RKHS method, in fact, it is basically the prototypical one.

bag-of-words That is a feature representation, and it has little to do with the learning method. You could encode a text as bag-of-words, and train an SVM on these features.

Yes, sure, but the most generic way is just to look at a historgram distance between word occurrences. I guess that would generically fall under k-means or similar methods, but that's what I was referring to by citing bag-of-words as a method on its own. Of course you can mix and match and cascade all of these to produce different methods.

I think PAC learning refers specifically to algorithms with rigorous bounds on their probability of error. There are theorems saying that Bayesian updaters get this automatically, but as far as I know they assume that (1) nature's distribution coincides with your subjective distribution and (2) you are implementing inference exactly. Since both of these are pretty dubious, I don't think most statistical machine learning is considered PAC. The examples that would come to mind in my view are things like boosting, multi-armed bandit, and sparse recovery, but I'm not really an expert in this area.

ETA: There are also still tons of people adopting the symbolic approach to AI, which is another reason why I'd be hesitant to label all of machine learning as PAC (even if the symbolic approach seems likely to be a dead end, I'd rather not pretend that it's not still in use).

I agree that in practice PAC designates a specific subset of ideas and papers, my point is just that "probably approximately correct" doesn't help to distinguish that subset (words like boosting and multi-armed bandit do). The VC-theory is totally based on PAC-style results (though it would be better to say PAC is based on VC-style results) and the MDL/MML people have similar generalization theorems as well.