Previously in the series: The laws of large numbers and Basics of Bayesian learning.
As a very basic sketch, in order to specify an ML algorithm one needs five pieces of data.
In most theoretical analyses of models, one uses gradient descent to conceptualize the learning algorithm. Some more sophisticated pictures (such as the tensor programs series) more carefully match realistic stochastic gradient descent by assuming a discontinuous process with finite learning steps rather than continuous gradient descent. Learning algorithms used in industry tend to include more sophisticated control over the gradients by things like gradient decay, momentum, ADAM, etc.
All of these algorithms have in common the property of being sequential and local (i.e., there is some step-to-step learning that ends when it converges to an approximate local minimum.) However, when working theoretically, a learning algorithm doesn’t have to be local or sequential.
Bayesian learning is one such (non-local and non-sequential) learning algorithm. This algorithm converts the learning problem to a Bayesian inference problem. Here the dictionary from ML to statistics is as follows:
Finally, the main new aspect of the Bayesian model is that “component 5”, i.e., the “learning algorithm/optimizer” in the list of ML system components above, is set to be “Bayesian inference” (instead of one of the local algorithms used in conventional learning). Here recall that Bayesian inference returns the stochastic (rather than deterministic) posterior distribution on weights: given by conditionalizing on the observed data .
Some observations.
All the algorithms introduced above are to learn (either deterministically or stochastically) a weight given some data . In the Bayesian inference context the stochastic learning follows the posterior probability distribution
However no one (whether in ML learning or inference) is really interested in learning the parameter itself: it lives in some abstracted space of latents or weights. What we are really interested in is prediction: namely, given a set of observations , together with a new (and in general, previously unobserved) input value , we want to extract a (stochastic or deterministic) predicted value
The reason why it’s generally enough to focus on inference is that in both Bayesian and machine learning, learning leads to prediction. Namely, given a (deterministic or sampled) latent parameter , we automatically get a predicted value by setting Here note that the randomness on y can come from two sources: both the function and the latent can be stochastic in general.
Thus most learning paradigms function via the following pipeline:
Model + data Learned posterior on latents Prediction .
While most Bayesian and learning contexts tend view prediction as an afterthought, in the following section we will focus on decoupling prediction from the rest of the learning paradigm.
The key idea that leads to the field theoretic paradigm on learning (though it is generally not introduced in this way) is cutting out inference from the prediction problem. This is easier to do in Bayesian learning setting, though also entirely possible in other ML settings[2]. For today’s post at least, we will focus on the Bayesian learning context; note that in theoretical analyses, the Bayesian paradigm is often easier to work with, as it corresponds to a “thermostatic” rather than a more general “thermodynamic” picture of learning.
Recall the pipeline I mentioned in the previous section:
Model + data Learned posterior on latents Prediction .
We will enter the “physics” picture by cutting out the middle man and instead considering the shorter pipeline:
Model + data Prediction .
In the Bayesian paradigm, prediction can be conceptualized without ever discussing latents. Namely, going back to the bayesics, after abstracting everything away, a choice of model + prior implies a joint probability distribution on data: . Now n is just another variable here, and so we can throw in “for free” an extra pair of datapoints:
The Bayesian prediction can now be rewritten as follows: Here out of the variables , we condition on : namely, all the variables except y, and for our prediction we draw y from the resulting posterior distribution.
Now while the latent parameters have been flushed out of these expressions they’re still there, just, well, latent. The key idea in the “physics approach” to machine learning is that the prediction problem is more physical than the inference problem (at least in many cases). The specifics of the model, and the process of converting an abstracted-out weight to a nice prediction function matter for our analysis, to be sure. But they matter as a back-end “sausage making” process. Physicists love taking such complex processes and replacing the sausagey specifics by summary analyses. In other words, the typical physicist move is to start with a complex system, then observe that most of the components of the system don’t matter for the result at a given level of granularity, and what matters is some extracted-out, averaged or “massaged” values that are mathematically much nicer to analyze. The art here is to extract the correct summary variables and to (as carefully as possible) track the relationship between different aspects of precision and scale.
Ultimately, we’ve seen that our prediction problem reduces to a conditional probability problem P(y|x, D), conditionalizing on the observed data and the new input. In today’s paradigm (and for most of the rest of our discussion of the “field theoretic view of ML”), we will assume that the size n of the dataset D is very small compared to the width - perhaps only size n = O(1). Thus the problem of conditionalizing on 2n+1 variables is taken to be “easy”, and all we need to do, really, is to find the probability distribution on tuples of n+1 input-output pairs. Since in this question the “predicted” input-output pair (x,y) plays the same role as the “known” pairs , we can drop the distinction between them and consider (without loss of generality) only the problem , of finding the probability of some set of input-output data .
Now we can’t keep our sausages wrapped forever: at some point we have to take a peek inside. And when we do, we notice that this probability distribution on the data is induced directly from the prior on weights:
In other words what we want to do is
The key idea now is to use analysis similar to the discussion in the “Laws of Large numbers” post to see that, for large width, we can extract all the information we need for a good approximation[3] of the probability distribution by observing how the first few cumulants (equivalently, moments) of D transform from layer to layer of our neural net.
I’m going to save a more detailed analysis here for future posts (some of which will be joint with Lauren Greenspan). But before concluding, let me just state a few of the key points that you will see explained if you read our future posts.
Small print: more generally, if it’s not reasonable to assume that a maximizing is unique, one should take the uniform distribution on maximizing .
And worked out, both in PDLT and, for a larger class of learning algorithms, in the tensor programs series.
Notice inherent here an assumed choice of scale.