Previously in the series: The laws of large numbers and Basics of Bayesian learning.

Reminders: formalizing learning in ML and Bayesian learning

Learning and inference in neural nets and Bayesian models

As a very basic sketch, in order to specify an ML algorithm one needs five pieces of data.

An architecture: i.e., a parametrized space of functions that associates to each weight vector a function $f_{θ} : R^{d_{i n}} \to R^{d_{o u t}},$ from some input space to an output space.
An initialization prior on weights. This is a (usually stochastic) algorithm to initialize a weight from which to begin learning. Generally this is some Gaussian distribution on the weight $θ \in R^{w} .$ While this is often ignored, in many contexts it is actually quite important to get right for learning to behave reasonably.
Training data. This is a collection $D = (x_{1}, y_{1}; x_{2}, y_{2}; \dots; x_{n}, y_{n})$ of “observation” pairs $(x_{i}, y_{i}),$ with $x_{i} \in R^{d_{i n}}$ and $y_{i} \in R^{d_{o u t}} .$
A loss function. This is a function on $L (D, θ)$ which operationalizes a measure of how well $f_{θ}$ agrees with the data D.
A learning algorithm/optimizer. (“Learning algorithm” is used in theoretical contexts and “optimizer” is frequently used in engineering contexts.) This is an algorithm (usually, a stochastic algorithm) for finding the “learned” weight parameter $θ$ , usually by some local minimization or approximate minimization, of the function $L (θ)$ .

In most theoretical analyses of models, one uses gradient descent to conceptualize the learning algorithm. Some more sophisticated pictures (such as the tensor programs series) more carefully match realistic stochastic gradient descent by assuming a discontinuous process with finite learning steps rather than continuous gradient descent. Learning algorithms used in industry tend to include more sophisticated control over the gradients by things like gradient decay, momentum, ADAM, etc.

All of these algorithms have in common the property of being sequential and local (i.e., there is some step-to-step learning that ends when it converges to an approximate local minimum.) However, when working theoretically, a learning algorithm doesn’t have to be local or sequential.

Bayesian learning is one such (non-local and non-sequential) learning algorithm. This algorithm converts the learning problem to a Bayesian inference problem. Here the dictionary from ML to statistics is as follows:

Architecture $⟹$ Deterministic Statistical model. Here deterministic means that each latent $θ$ implies a deterministic mapping from inputs to outputs.
1. In particular, weight parameter $⟹$ latent.
Initialization prior $⟹$ Prior on latents.
Training data $⟹$ observations.
Loss function $⟹$ A method of “degrading” deterministic functions to probabilistic ones (with for example “MSE loss” converting the deterministic function $f_{θ} (x) = y$ to the probabilistic function with Gaussian indeterminacy: $y \sim N (f_{θ} (x), σ) .$ (For $σ$ a parameter implicit in the model.)

Finally, the main new aspect of the Bayesian model is that “component 5”, i.e., the “learning algorithm/optimizer” in the list of ML system components above, is set to be “Bayesian inference” (instead of one of the local algorithms used in conventional learning). Here recall that Bayesian inference returns the stochastic (rather than deterministic) posterior distribution on weights: $P (θ | D)$ given by conditionalizing on the observed data $D$ .

Some observations.

When a model is overdetermined, it is actually possible to skip the loss function in the Bayesian paradigm, equivalent in the MSE paradigm to taking the limit as the variance $σ$ of the Gaussian $y \sim N (f_{θ} (x), σ)$ to 0. This means that the posterior probability is entirely supported on the subset of datapoints ${Weights}_{D} := {θ ∣ f_{θ} (x_{i}) = y_{i} \forall i = 1, \dots, n} .$ This imposes $n \cdot d_{o u t}$ conditions $θ \in R^{w}$ and so in the overparameterized context $w > n d_{o u t}$ , can still lead to a nondeterministic posterior.
Instead of learning the probability distribution, one can look at the Maximal Likelihood Estimate, or “MLE”: i.e., take a deterministic distribution on $θ$ which maximizes the probability density value $P (θ | D)$ ^[1]. While not “optimal” in a statistical sense, this is often a reasonable thing to do and the two can be shown to converge to the same distribution in the large-data limit.
Both the problem of sampling the posterior distribution on $θ$ and the MLE problem are NP hard in general (technically both are “continuous” problems, but in this case approximating them with any fixed degree of precision is hard). Note that in some nice contexts (and also for suitably small models), polynomial-time algorithms exist; in general, we can still reason about the Bayesian learning algorithm and view it as an theoretically idealized “approximation” to other kinds of learning.

Prediction

All the algorithms introduced above are to learn (either deterministically or stochastically) a weight $θ$ given some data $D$ . In the Bayesian inference context the stochastic learning follows the posterior probability distribution $P (θ | D) .$

However no one (whether in ML learning or inference) is really interested in learning the parameter $θ$ itself: it lives in some abstracted space of latents or weights. What we are really interested in is prediction: namely, given a set of observations $D = (x_{i}, y_{i})_{i = 1}^{n}$ , together with a new (and in general, previously unobserved) input value $x$ , we want to extract a (stochastic or deterministic) predicted value $y = f_{predicted} (x) | D .$

The reason why it’s generally enough to focus on inference is that in both Bayesian and machine learning, learning leads to prediction. Namely, given a (deterministic or sampled) latent parameter $θ$ , we automatically get a predicted value by setting $y \sim f_{θ} (x)$ Here note that the randomness on y can come from two sources: both the function $f_{θ} (x)$ and the latent $θ$ can be stochastic in general.

Thus most learning paradigms function via the following pipeline:

Model + data $⟹$ Learned posterior on latents $θ \in R^{w}$ $⟹$ Prediction $y \sim f_{θ} (x)$ .

While most Bayesian and learning contexts tend view prediction as an afterthought, in the following section we will focus on decoupling prediction from the rest of the learning paradigm.

Field theory picture: cutting out the middle man

The key idea that leads to the field theoretic paradigm on learning (though it is generally not introduced in this way) is cutting out inference from the prediction problem. This is easier to do in Bayesian learning setting, though also entirely possible in other ML settings^[2]. For today’s post at least, we will focus on the Bayesian learning context; note that in theoretical analyses, the Bayesian paradigm is often easier to work with, as it corresponds to a “thermostatic” rather than a more general “thermodynamic” picture of learning.

Recall the pipeline I mentioned in the previous section:

Model + data $⟹$ Learned posterior on latents $θ \in R^{w}$ $⟹$ Prediction $y \sim f_{θ} (x)$ .

We will enter the “physics” picture by cutting out the middle man and instead considering the shorter pipeline:

Model + data $⟹$ Prediction $y \sim f_{θ} (x)$ .

In the Bayesian paradigm, prediction can be conceptualized without ever discussing latents. Namely, going back to the bayesics, after abstracting everything away, a choice of model + prior implies a joint probability distribution on data: $P (D) = P (x_{1}, y_{1}; x_{2}, y_{2}; \dots; x_{n}, y_{n})$ . Now n is just another variable here, and so we can throw in “for free” an extra pair of datapoints:

$P (D; x, y) = P (x, y; x_{1}, y_{1}; \dots; x_{n}, y_{n}) .$ The Bayesian prediction can now be rewritten as follows: $P (y \sim f_{predicted} (x)) = P (y | x, D) .$ Here out of the $2 n + 2$ variables $(x, y; x_{1}, y_{1}, \dots, x_{n}, y_{n})$ , we condition on $2 n + 1$ : namely, all the variables except y, and for our prediction we draw y from the resulting posterior distribution.

Now while the latent parameters $θ$ have been flushed out of these expressions they’re still there, just, well, latent. The key idea in the “physics approach” to machine learning is that the prediction problem is more physical than the inference problem (at least in many cases). The specifics of the model, and the process of converting an abstracted-out weight $θ$ to a nice prediction function matter for our analysis, to be sure. But they matter as a back-end “sausage making” process. Physicists love taking such complex processes and replacing the sausagey specifics by summary analyses. In other words, the typical physicist move is to start with a complex system, then observe that most of the components of the system don’t matter for the result at a given level of granularity, and what matters is some extracted-out, averaged or “massaged” values that are mathematically much nicer to analyze. The art here is to extract the correct summary variables and to (as carefully as possible) track the relationship between different aspects of precision and scale.

Laws of large numbers

Ultimately, we’ve seen that our prediction problem reduces to a conditional probability problem P(y|x, D), conditionalizing on the observed data and the new input. In today’s paradigm (and for most of the rest of our discussion of the “field theoretic view of ML”), we will assume that the size n of the dataset D is very small compared to the width - perhaps only size n = O(1). Thus the problem of conditionalizing on 2n+1 variables is taken to be “easy”, and all we need to do, really, is to find the probability distribution $P (x, y; x_{1}, y_{1}; \dots; x_{n}, y_{n})$ on tuples of n+1 input-output pairs. Since in this question the “predicted” input-output pair (x,y) plays the same role as the “known” pairs $x_{i}, y_{i}$ , we can drop the distinction between them and consider (without loss of generality) only the problem $P (D)$ , of finding the probability of some set of input-output data $x_{1}, y_{1}; x_{2}, y_{2}; \dots; x_{n}, y_{n}$ .

Now we can’t keep our sausages wrapped forever: at some point we have to take a peek inside. And when we do, we notice that this probability distribution on the data $D = (x_{i}, y_{i})$ is induced directly from the prior on weights:

$P (D) = \int P (D | θ) P (θ) d θ .$ In other words what we want to do is

Take a random weight $θ \in R^{w}$ drawn from the prior,
Look at the values $y_{1}, y_{2}, \dots, y_{n}$ obtained by applying the (stochastic or deterministic, depending on our model) function $f_{θ}$ to each $x_{i}$ ,
Use the resulting probability distribution on the tuple $y_{1}, \dots, y_{n}$ and suitably conditionalize it to obtain our prediction algorithm.

The key idea now is to use analysis similar to the discussion in the “Laws of Large numbers” post to see that, for large width, we can extract all the information we need for a good approximation^[3] of the probability distribution $P (D)$ by observing how the first few cumulants (equivalently, moments) of D transform from layer to layer of our neural net.

Takeaways and future discussions

I’m going to save a more detailed analysis here for future posts (some of which will be joint with Lauren Greenspan). But before concluding, let me just state a few of the key points that you will see explained if you read our future posts.

It turns out that under realistic assumptions on the width, depth, initialization, and activation of a neural network, and assuming the size of the dataset is n = O(1) we can reduce the cumulant expansion formula at arbitrary levels of precision in 1/width to tracking the evolution from layer to layer of a finite (i.e. O(1)) collection of numbers: namely the first four symmetric moments of the joint distribution on the activations. If n = 1, this is tracking just two numbers, namely the total variance, i.e., expectation of the norm squared $| x |^{2}$ of a datapoint, and the total fourth moment, which is the expectation of $| x |^{4}$ (and the analysis consists of tracking these two values as the input x evolves from layer to layer in our neural net). If there are multiple datapoints, i.e., $n > 1$ , we need to track all n total second moments: $| x_{i} |^{2}$ , and $(\frac{n + 1}{2})$ fourth moments, namely the expectations of $| x_{i} |^{2} | x_{j} |^{2}$ as i, j run through the different inputs. Here the “symmetry” (that lets us look only at even moments) corresponds to a certain property of the architecture; it can be weakened, in which case we still only need to track a finite number of ( $\leq 4$ th) moment values – note in particular that the number of values tracked doesn’t depend on the width of the network).
Here the reason we only need to look at moments of degree $\leq 4$ to get all orders of 1/n correction is related to a universality property (an idea that originated in statistical field theory, and is closely linked to renormalization), that only emerges for suitably deep neural networks (i.e. more or less, the depth has to satisfy n << depth << width, for n the number of data point). If we drop the requirement that depth is small and look at shallow networks, e.g. networks with depth = 2, we start seeing more moving parts. Here in order to get an expansion accurate to precision range $\frac{1}{{\sqrt{d}}^{k}},$ we need to track all the symmetric moments of degree $\leq k$ . While they are not necessary to see that the problem reduces to “cumulant math”, actually doing the relevant cumulant math can be significantly simplified by using Feynman diagrams.
In points 1-2 above, it is crucial to assume that the number of inputs is very small compared to the width. Also the formally provable results here are all perturbative, and return predictions for neural net behaviors which are "perturbative" small corrections of a Neural Network Gaussian Process (NNGP). It is ongoing work, which I hope we’ll get a chance to discuss and speculate about, how to treat neural nets where the data distribution is large (compared to other parameters), and emergent (and in general, non-Gaussian) “data regularities” become important to keep track of. Here the promise of the “physical approach” is not to give an asymptotic formula as precise as we see in the small-data large-width limit, but rather to more carefully abstract away the “sausage-making” process of explicit inference on weights. The resulting analysis should capture both averaged properties of the activations (appropriately conceptualized as a random process) and averaged properties of the data distribution and its associated regularities. This is very vague (and the work in this direction is quite limited and new). But I hope by the end of this series to convince you that this powerful approach to conceptualizing neural networks is worth adding to an interpretation theorist’s arsenal.

^{^}
Small print: more generally, if it’s not reasonable to assume that a maximizing $θ$ is unique, one should take the uniform distribution on maximizing $θ$ .
^{^}
And worked out, both in PDLT and, for a larger class of learning algorithms, in the tensor programs series.
^{^}
Notice inherent here an assumed choice of scale.

LESSWRONG
LW