TL;DR: This post derives an upper bound on the prediction error of Bayesian learning on neural networks. Unlike the bound from vanilla Singular Learning Theory (SLT), this bound also holds for out-of-distribution generalization, not just for in-distribution generalization. Along the way, it shows some connections between SLT and Algorithmic Information Theory (AIT).
Written at Goodfire AI.
Singular Learning Theory (SLT) describes Bayesian learning on neural networks. But it currently has some limitations. One of these limitations is that it assumes model training data are drawn independently and identically (IID) from some distribution, making it difficult to use SLT to describe out-of-distribution (OOD) generalization. If we train a model to classify pictures of animals taken outdoors, it's difficult to use vanilla SLT to figure out whether this model will generalize to correctly classify pictures of animals taken indoors. Technically, the main SLT theorems also assume that the model has seen an infinite number of training data points, though in practice they turn out to hold pretty well for finite data.
Another theory about Bayesian learning is Algorithmic Information Theory (AIT), which describes Solomonoff induction (SI) [1,2]. Solomonoff induction is Bayesian learning over an (infinite) ensemble of programs. It is, unfortunately, uncomputable. But it does not require an IID sampling assumption, does not rely on asymptotically infinite data, and does describe OOD generalization. AIT is also older, more developed, and more established than SLT. A lot of agent foundations work builds on the AIT framework.
So if we could derive an AIT-style description of Bayesian learning on neural networks, that description might tell us things about out-of-distribution generalization.
Here, we
The first part is mainly supposed to build intuition for the second, and help people see how the AIT picture and Bayesian learning on neural networks connect.[1]
I also hope that this post might help readers used to thinking about probability from a more E. T. Jaynes-esque/native-Bayesian point of view a little with grokking Singular Learning Theory, which is usually described in a pretty stat-mech-esque/second-language-Bayesian manner.
We consider a prediction setting with inputs (finite binary strings) and labels . We assume that the labels were generated by a probabilistic program [2], i.e., is set by . We are given such input-label pairs .[3]
Our goal is to construct an inductor that predicts whether is given , using Bayesian updating on past input-label pairs that it has already seen. To do this, we will just slightly modify the standard Solomonoff setup for probabilistic programs.
Let be a plain[4] monotone[5] universal Turing machine (UTM) with a binary alphabet, one-way read-only input and output tapes, and a two-way work tape. We define the set of all programs of length bits, and the set of augmented programs , where is some 'read-in' program that first copies from the input tape to the first cells of the work tape, then sets the read head to the first bit of the program string and finally resets the UTM's internal state to its starting configuration.
Our inductor then makes its predictions using the ensemble of augmented programs:
where is the output bit string of our UTM on input string interpreted as the probability that [6], and is the probability we assign to program after updating on the past prediction performance of on the input-label pairs . For our starting prior , we take the uniform distribution over our set of programs .
Note that this setup does not make any assumptions about how the inputs are sampled. This is in contrast to the Singular Learning Theory (SLT) setup, which needs to assume that the inputs are sampled IID from some underlying distribution. This assumption makes it difficult to use SLT to say anything about the generalization behavior of a trained network under distributional shifts, such as switching from the training distribution to a very different deployment distribution. We do not have that problem here.
So long as the program that generated the data has a minimum length less than or equal to on our UTM , , the conventional error bound for Solomonoff induction still applies in this setting. i.e., the total expected prediction error of our inductor will be bounded by
for all . In words, the inductor's total expected prediction error over the first data it sees, scored in bits of cross-entropy
is bounded to stay below the Kolmogorov complexity of the data-generating process on , , meaning the minimum description length of the program on the machine , plus the Shannon entropy of on each data point summed:[7]
The proof works analogously to the derivation of the bound in the conventional SI setup found in most textbooks. The only differences are that
None of these changes affect the structure of the proof much. We rearrange the expression for to make its dependence on the prior manifest, then use the existing result that the implementation of program must have a prior in our induction at the start.
Proof:
First, we rearrange our expression for the total cross-entropy:
In the second equality, we used the assumption that our inductor adjusts its probabilities according to incoming evidence using Bayesian updating. According to result 3.8.1 in this book, the semimeasure for programs on a monotone, plain UTM matching the outputs of a program must be . Inserting this bound yields:
Now, we're going to modify the setup from the previous section, bounding our induction to make it computable. Instead of running programs , where is drawn from the set of all bit strings of length , we restrict ourselves to programs of length that require a maximum of space and a maximum of time steps. Specifically, we construct a new bounded UTM from our UTM . only has cells of work tape available. If the head moves to the last cell of the work tape, immediately halts. Further, always halts itself after execution steps if it has not halted yet. This bounded induction will no longer be optimal in the sense of having an SI-style bound on its total error that depends only on the entropy and Kolmogorov complexity of the data-generating process , because might not be computable in time and space .
Our bounded induction still has a weaker upper bound on its total expected prediction error: If there is some program that is an "efficient predictor" of , in the sense that approximates the probability assignments of up to some error we deem 'acceptable', our induction will predict the data as well as this efficient predictor, up to an offset equal to the efficient predictor's description length.
Specifically, for any program included in the set of programs our bounded induction runs over, we have the bound:[8]
for all .
Proof
Proceeds completely analogously to the one for Equation (1.1), except that we substitute in instead of . So we won't write it out again.
Because
the only difference between this bound and the conventional bound for a computationally unbounded induction is the additional term , the KL divergence between the data-generating process and the program , summed over all input-label pairs. In other words: For any program with minimum description length on that is computable in time and space , our induction will have its total prediction error bounded by the prediction error of plus the minimum description length of on .
Why would we suppose that a program that gets low prediction error exists? Well, for some problems, an efficient predictor probably doesn't exist. But empirically, we can do pretty well on many prediction problems we encounter in the real world using an amount of compute that's practically feasible to obtain. Why that is, I don't know. Perhaps it is somehow related to why a lot of the world seems to abstract well.
Given any program that can run on a plain, monotone, bounded UTM in space and time , we can run it on some other plain, monotone, bounded UTM in space and time .[9] Thus, the Kolmogorov complexity of on is bounded by
provided has sufficient computational resources available, because we can just simulate on and then run on the simulated .
Therefore, our total prediction error for bounded induction on any other plain monotone UTM bounded to space and time, where are the constant prefactors for simulating on , will still be bounded by
for all . So, our choice of UTM matters somewhat, since we might need more space and execution time on a different UTM to implement an equally efficient predictor. However, converting to a different UTM only increases the required execution time from to and the required space from to . So, provided has sufficient space and time , the additional prediction error we get is just , which is typically small for most we might usually consider using.
Now, we will derive a similar style of bound for an inductor using Bayesian learning on neural networks.
In this setting, our inputs are finite-dimensional vectors of real numbers . The labels are still Boolean, . The inductor now makes its predictions using functions , which take vectors as input and output probabilities for the data labels . The functions are all part of some space of functions . The functions are parametrized by a parameter-function map with . We call the number of neural network parameters. We refer to the probabilities the function outputs for input vector as . As in the previous sections, we assume that the data labels were generated from the inputs by some probabilistic program . Note that we do not make any assumptions about the inputs being drawn IID.
The inductor starts with a uniform prior over some finite hypervolume in parameter space: . The inductor makes its predictions as
where is the probability density the inductor places on parameter configuration after updating on data-label pairs .
The total expected prediction error of this induction measured in bits of cross-entropy is upper bounded by
for all , all parameter configurations , and all . Here, is the logarithm of the ratio between the volume of the whole prior , and the volume taken up by the set of parameter configurations that are mapped to functions which match the log-probabilities of up to tolerance :
In other words, if there is a parameter configuration that is mapped to a function that is an efficient predictor of the data, in the sense that approximates the probability assignments of up to some error, the inductor will predict the data as well as this efficient predictor , plus a tolerance we can freely choose, up to an offset equal to the cost in bits of specifying the predictor in the parameter space to precision. Intuitively, the reason we don't just always choose in this bound is that our parameters are real numbers, so describing the location of in parameter space exactly could sometimes take infinitely many bits.
The proof works analogously to the proof for the error bound in Equation (1.1), except that sums over programs are replaced by integrals over parameters, and the semimeasure of programs in the prior matching the outputs of efficient predictor on all possible input bit strings is replaced by the volume of , the set of parameter configurations matching the log-probabilities output by function to precision.
Proof
As before, we start by rearranging our expression for the total cross-entropy:
In the second equality, we used the assumption that our inductor adjusts its probabilities over parameter configurations in response to incoming evidence using Bayesian updating.
Now, we restrict the integral to points in , then use the bound we have for points in this set:
is a pretty similar quantity to the -ball that defines the learning coefficient in Singular Learning Theory. There, the prediction error in-distribution includes the logarithm of a term of the form[10]
where is the 'training distribution' over inputs .
Comparing to :
These differences are the price of describing prediction error in general instead of 'in-distribution'. In SLT, we assume that our data are drawn IID from some known distribution . In that case, average error over that distribution is enough to bound the inductor's expected average error on future data. Without this assumption of IID sampling, we need to classify functions based on how closely their outputs align with our reference on all possible inputs . Or at least, all inputs we think the inductor might possibly encounter.
In SLT, is also set to in the expression for the prediction error instead of being left free as in Equation (2.1), because the theory is primarily concerned with the infinite data limit , where describing the solution to exact precision is 'worth it' and yields a tighter bound. To put that slightly more quantitatively: One term for the prediction error in SLT scales as , while can be shown to only scale as [11], where is a scalar number called the learning coefficient. So, in the vanilla SLT setup, letting go to zero for as yields a tighter bound for the prediction error than setting it to some finite non-zero value.
Section TL;DR: What kind of neural network architectures have a prior equivalent to that of a bounded universal Turing machine? How can we estimate the description length in practice?
So, we can sort of think about Bayesian learning on neural networks and compute-bounded Solomonoff induction using a similar set of mathematical tools. But how similar are these induction processes really? That is, how much does a uniform prior over parameter configurations on some neural network architecture resemble a uniform prior over programs on some compute-bounded UTM?
Another way of asking that question is whether we can get something like Equation (1.3) for switching between a bounded UTM and a neural network, and vice versa. So, can we get something like:
For any parameter vector parametrizing some function , the prediction error of an inductor using bounded UTM is bounded by
where is the description length of the neural network architecture on UTM to precision .[12] Provided has enough time and memory to simulate on , of course.
For any program on bounded UTM , the prediction error of an inductor using neural network architecture is bounded by
where is the description length in bits of bounded UTM on neural network architecture to precision .
If both of these conjectures were true, doing Bayesian learning using neural networks instead of an ensemble of programs would be almost no more of a change than doing Bayesian learning in one programming language instead of another.[13]
My guess is that something vaguely like Conjecture 1 is true. With sufficient compute, we can use a UTM to simulate any neural network architecture to some arbitrary desired floating point precision. But I think Conjecture 2 is generally false, because not every neural network architecture can simulate a bounded UTM. Remember, our definition of 'neural network' here was basically just 'a space of functions parametrized by some parameter-function map '. A family of polynomials would satisfy that definition. And I doubt a uniform prior over polynomials acts much like a uniform prior over compute-bounded programs. But maybe Conjecture 2 still holds for some more restricted set of neural network architectures? Like, say, architectures capable of simulating a bounded UTM. RNNs and transformers run in chain-of-thought mode can do this, for example.[14] I originally thought that this condition would be sufficient. But proving the conjecture for transformers in chain-of-thought mode turned out to be trickier than I thought.[15] Maybe you just need to be a bit cleverer about how to do the UTM simulation than I was so far. Or maybe, being able to simulate a bounded UTM actually isn't a sufficient requirement for Conjecture 2, and the architecture needs to have some additional property. In that case, it'd be interesting to find out what that property is.
For example, does behave like in vanilla SLT in the limit ? That is, for small enough and some reasonable conditions on the NN architecture[16] do we often get something like
where would then be some number analogous to the learning coefficient in SLT? Can we use stochastic gradient Langevin dynamics (SGLD) sampling to estimate on neural networks in real life, the same way people use it to determine the learning coefficient? If a solution can generalize out-of-distribution, it can certainly generalize in-distribution, but the reverse is generally not true. So, presumably would tend to be smaller than . But how much smaller does it tend to be in practice? In other words, how much does OOD generalization error usually differ from in-distribution generalization error?
Thanks to Aram Ebtekar for providing many useful citations. Thanks to Lee Sharkey for helping me come up with a title for this. Thanks to Cole Wyeth for spotting an unnecessary condition in my original definition of the UTMs. And thanks to Aram Ebtekar as well as Mathias Dellago, Alexander Gietelink Oldenziel, Daniel Murfet, David Quarel, Kaarel Hänni, John Wentworth, Dmitry Vaintrob, Jake Mendel, Linda Linsefors, and Scott Aaronson for various discussions and ideas that were causally upstream of my writing this.
I expect that basically nothing in the first part is actually novel, though I haven't managed to track down citations for every result in it in the exact form given here. If you think you know your AIT and don't need to read this part, you might be right. But I'd advise some caution before skipping it. I wrote it partially because many AIT-knowledgeable people I talked to about this topic turned out not to know these things.[17]
As in a program that outputs probabilities as strings in some kind of alphabet.
We can also let tend to infinity though; it shouldn't affect the proofs.
So, not prefix-free. Lots of standard AIT results assume a prefix-free UTM. This post does not. AIT-savvy skim readers beware. If you compare the theorems here to results phrased for the setting of prefix-free programs, you might become pretty confused.
See e.g. the start of section three here for more on the definition of monotone Turing machines.
As in the standard setup for Solomonoff induction with probabilistic programs, can keep running and printing more output digits indefinitely, representing the output probabilities with ever higher precision.
So, if is a deterministic program, this second term will vanish.
See also, for example, theorem 1 this paper, which shows almost the same thing in an RL setting.
See the proof here for example.
For the special case of uniform prior, discrete outputs , and our loss function being cross-entropy error measured in bits.
SLT convention usually considers prediction error scored in nats and thus the expressions use natural logarithms, I'm translating to bits here.
For some to-be-determined definition of 'precision' that makes the conjecture work. Basically, we'd want some measure of the floating‑point precision the UTM uses to approximate the real-number-valued parameters and network inputs.
Still only "almost", because we still have those extra terms. Those don't show up when converting between UTMs.
The results in those links come with some annoying caveats, so if you want to jump on this you might want to use different constructions to do the UTM emulation.
Basically, making unused NN parameters not interfere with the computation no matter what values they're set to is kind of tough. As a consequence, I keep ending up with prefactors in front of the term.
e.g., analyticity.
Not judging. I also thought myself somewhat AIT-knowledgeable, but did not realize some of what it had to say about degeneracy until fairly recently. I spent years being very confused about how learning works because of this.