LESSWRONG
LW

Distributional ShiftsSingular Learning TheorySolomonoff inductionAI

22

From SLT to AIT: NN generalisation out-of-distribution

by Lucius Bushnaq
4th Sep 2025
17 min read
0

22

Distributional ShiftsSingular Learning TheorySolomonoff inductionAI

22

New Comment
Moderation Log
More from Lucius Bushnaq
View more
Curated and popular this week
0Comments

TL;DR: This post derives an upper bound on the prediction error of Bayesian learning on neural networks. Unlike the bound from vanilla Singular Learning Theory (SLT), this bound also holds for out-of-distribution generalization, not just for in-distribution generalization. Along the way, it shows some connections between SLT and Algorithmic Information Theory (AIT).

Written at Goodfire AI.

Introduction

Singular Learning Theory (SLT) describes Bayesian learning on neural networks. But it currently has some limitations. One of these limitations is that it assumes model training data are drawn independently and identically (IID) from some distribution, making it difficult to use SLT to describe out-of-distribution (OOD) generalization. If we train a model to classify pictures of animals taken outdoors, it's difficult to use vanilla SLT to figure out whether this model will generalize to correctly classify pictures of animals taken indoors. Technically, the main SLT theorems also assume that the model has seen an infinite number of training data points, though in practice they turn out to hold pretty well for finite data.

Another theory about Bayesian learning is Algorithmic Information Theory (AIT), which describes Solomonoff induction (SI) [1,2]. Solomonoff induction is Bayesian learning over an (infinite) ensemble of programs. It is, unfortunately, uncomputable. But it does not require an IID sampling assumption, does not rely on asymptotically infinite data, and does describe OOD generalization. AIT is also older, more developed, and more established than SLT. A lot of agent foundations work builds on the AIT framework.

So if we could derive an AIT-style description of Bayesian learning on neural networks, that description might tell us things about out-of-distribution generalization.

Here, we

  1. Derive error bounds for a computationally bounded Solomonoff induction (SI).
    1. First, we import Solomonoff induction to the learning-theoretic setting of data coming in input-label pairs (x,y), with the inductor trying to predict the labels y conditional on the inputs x and show that the standard error bound for SI still holds there.
    2. Then, we define a bounded, computable form of induction over programs. We show that it still satisfies a weaker SI-style error bound. We also show that this bounded induction is still somewhat invariant to our choice of universal Turing machine.
  2. Derive a bound on the expected prediction error of Bayesian learning on neural networks.
    1. Specifically, we derive an error bound for Bayesian learning over parametrized functions, starting from a uniform prior over real-valued parameters. In contrast to SLT-style error bounds, we do not need to assume that our data are drawn IID. Hence the bound will also hold under distributional shifts.
    2. We show how this error bound relates to the notion of ϵ-volumes that define the learning coefficient in Singular Learning Theory.

The first part is mainly supposed to build intuition for the second, and help people see how the AIT picture and Bayesian learning on neural networks connect.[1]

I also hope that this post might help readers used to thinking about probability from a more E. T. Jaynes-esque/native-Bayesian point of view a little with grokking Singular Learning Theory, which is usually described in a pretty stat-mech-esque/second-language-Bayesian manner.

Prediction error bounds for a computationally bounded Solomonoff induction

Claim 1: We can import Solomonoff induction into the learning-theoretic setting

We consider a prediction setting with inputs xn∈{0,1}∗ (finite binary strings) and labels yn∈{0,1}. We assume that the labels were generated by a probabilistic program μ [2], i.e., Pμ(yn∣xn) is set by μ(xn). We are given N∈N such input-label pairs {(xn,yn)}Nn=1.[3]

Our goal is to construct an inductor that predicts whether yn is 1 given xn, using Bayesian updating on past input-label pairs (x1,y1),…,(xn−1,yn−1) that it has already seen. To do this, we will just slightly modify the standard Solomonoff setup for probabilistic programs.

Let M1 be a plain[4] monotone[5] universal Turing machine (UTM) with a binary alphabet, one-way read-only input and output tapes, and a two-way work tape. We define the set of all programs p of length T∈N bits, and the set of augmented programs pn:=(pread,xn,p), where pread is some 'read-in' program that first copies xn from the input tape to the first Lin cells of the work tape, then sets the read head to the first bit of the program string p and finally resets the UTM's internal state to its starting configuration.

Our inductor then makes its predictions using the ensemble of augmented programs: 

PM1(yn∣xn,D<n)=∑pPM1(yn∣pn)P(p∣D<n),

 where PM1(yn=1∣pn)=M1(pn)=M1(pread,xn,p) is the output bit string of our UTM on input string pn interpreted as the probability that yn=1[6], PM1(yn=0∣pn)=1−PM1(yn=1∣pn) and P(p∣D<n) is the probability we assign to program p after updating on the past prediction performance of pn−1,…,p1 on the input-label pairs (xn−1,yn−1),…,(x1,y1). For our starting prior P(p∣D<1)=P(p), we take the uniform distribution P(p)=2−T over our set of programs p.

Note that this setup does not make any assumptions about how the inputs xn are sampled. This is in contrast to the Singular Learning Theory (SLT) setup, which needs to assume that the inputs xn are sampled IID from some underlying distribution. This assumption makes it difficult to use SLT to say anything about the generalization behavior of a trained network under distributional shifts, such as switching from the training distribution to a very different deployment distribution. We do not have that problem here.

Claim

So long as the program that generated the data μ has a minimum length less than or equal to T on our UTM M1, C(μ,M1)≤T, the conventional error bound for Solomonoff induction still applies in this setting. i.e., the total expected prediction error of our inductor will be bounded by 

N∑n=1H(Pμ(⋅∣xn),PM1(⋅∣xn,D<n))≤C(μ,M1)+N∑n=1H(Pμ(⋅∣xn)),(1.1)

 for all N∈N>0. In words, the inductor's total expected prediction error over the first N data it sees, scored in bits of cross-entropy 

N∑n=1H(Pμ(⋅∣xn),PM1(⋅∣xn,D<n)):=−N∑n=1∑ynPμ(yn∣xn)log2PM1(yn∣xn,D<n)

 is bounded to stay below the Kolmogorov complexity of the data-generating process μ on M1, C(μ,M1), meaning the minimum description length of the program μ on the machine M1, plus the Shannon entropy H(Pμ(⋅∣xn)) of μ on each data point xn summed:[7]

N∑n=1H(Pμ(⋅∣xn)):=−N∑n=1∑ynPμ(yn∣xn)log2Pμ(yn∣xn).

High-level proof summary

The proof works analogously to the derivation of the bound in the conventional SI setup found in most textbooks. The only differences are that

  1. We have not taken program length T to infinity as in the universal prior. We just assume that it is set to some number larger than C(μ,M1).
  2. M1 is a plain UTM rather than a prefix-free UTM, and thus our starting prior over programs is just uniform, rather than falling off exponentially with program length. The degeneracy of program implementations ensures that simpler programs are exponentially preferred; we don't need to insert some term favoring simpler programs in the prior by hand.
  3. Our programs pn=(pread,xn,p) start with a prefix (pread,xn), where xn changes between data points.

None of these changes affect the structure of the proof much. We rearrange the expression for ∑nH(Pμ(⋅∣xn),PM1(⋅∣xn,D<n)) to make its dependence on the prior P(p) manifest, then use the existing result that the implementation of program μ must have a prior ≥2−C(μ,M1) in our induction at the start.

Proof:

First, we rearrange our expression for the total cross-entropy: 

N∑n=1H(Pμ(⋅∣xn),PM1(⋅∣xn,D<n))=−N∑n=1∑ynPμ(yn∣xn)log2(PM1(yn∣xn,D<n))=−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)log2(∑pPM1(yN∣pN)P(p∣D<N))−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)N−1∑n=1log2(PM1(yn∣xn,D<n))=−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)log2(∑p∏Nn=1PM1(yn∣pn)∏N−1n=1PM1(yn∣xn,D<n)P(p))−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)N−1∑n=1log2(PM1(yn∣xn,D<n))=−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)log2(∑pPM1(yN∣pN)…PM1(y1∣p1)P(p))

In the second equality, we used the assumption that our inductor adjusts its probabilities according to incoming evidence using Bayesian updating. According to result 3.8.1 in this book, the semimeasure for programs on a monotone, plain UTM matching the outputs of a program μ must be ≥2−C(μ,M1). Inserting this bound yields: 

N∑n=1H(Pμ(⋅∣xn),PM1(⋅∣xn,D<n))=−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)log2(∑pPM1(yN∣pN)…PM1(y1∣p1)P(p))≤−∑yN,…,y1Pμ(yn∣xn)…Pμ(y1∣x1)log2(2−C(μ,M1)Pμ(yN∣xN)…Pμ(y1∣x1))=C(μ,M1)+N∑n=1H(Pμ(⋅∣xn)).

Claim 2: A bounded induction still efficiently predicts efficiently predictable data}

Setup

Now, we're going to modify the setup from the previous section, bounding our induction to make it computable. Instead of running programs (pread,xn,p), where p is drawn from the set of all bit strings of length T, we restrict ourselves to programs of length T that require a maximum of s space and a maximum of t time steps. Specifically, we construct a new bounded UTM M2 from our UTM M1. M2 only has s+1 cells of work tape available. If the head moves to the last cell of the work tape, M2 immediately halts. Further, M2 always halts itself after t execution steps if it has not halted yet. This bounded induction will no longer be optimal in the sense of having an SI-style bound on its total error that depends only on the entropy and Kolmogorov complexity of the data-generating process μ, because μ might not be computable in time t and space s.

Claim

Our bounded induction still has a weaker upper bound on its total expected prediction error: If there is some program μ∗ that is an "efficient predictor" of μ, in the sense that μ∗ approximates the probability assignments of μ up to some error we deem 'acceptable', our induction will predict the data as well as this efficient predictor, up to an offset equal to the efficient predictor's description length.

Specifically, for any program μ∗ included in the set of programs our bounded induction runs over, we have the bound:[8]

N∑n=1H(Pμ(⋅∣xn),PM2(⋅∣xn,D<n))≤C(μ∗,M2)+N∑n=1H(Pμ(⋅∣xn),Pμ∗(⋅∣xn)),(1.2)

 for all N∈N>0.

Proof

Proceeds completely analogously to the one for Equation (1.1), except that we substitute in Pμ∗(yn∣xn) instead of Pμ(yn∣xn). So we won't write it out again.

 

Because 

N∑n=1H(Pμ(⋅∣xn),Pμ∗(⋅∣xn))=N∑n=1H(Pμ(⋅∣xn))+N∑n=1DKL(Pμ(⋅∣xn),Pμ∗(⋅∣xn)),

the only difference between this bound and the conventional bound for a computationally unbounded induction is the additional term ∑Nn=1DKL(Pμ(⋅∣xn),Pμ∗(⋅∣xn)), the KL divergence between the data-generating process μ and the program μ∗, summed over all input-label pairs. In other words: For any program μ∗ with minimum description length ≤T on M2 that is computable in time t and space s, our induction will have its total prediction error bounded by the prediction error of μ∗ plus the minimum description length of μ∗ on M2.

Why would we suppose that a program μ∗ that gets low prediction error exists? Well, for some problems, an efficient predictor probably doesn't exist. But empirically, we can do pretty well on many prediction problems we encounter in the real world using an amount of compute that's practically feasible to obtain. Why that is, I don't know. Perhaps it is somehow related to why a lot of the world seems to abstract well.

Claim 3: The bounded induction is still somewhat invariant under our choice of UTM

Given any program μ∗ that can run on a plain, monotone, bounded UTM M3 in space s and time t, we can run it on some other plain, monotone, bounded UTM M2 in space O(s) and time O(tlog2(t)).[9] Thus, the Kolmogorov complexity C(μ∗,M3) of μ∗ on M3 is bounded by 

C(μ∗,M3)≤C(μ∗,M2)+C(M2,M3)

 provided M3 has sufficient computational resources available, because we can just simulate M2 on M3 and then run μ∗ on the simulated M2.

Therefore, our total prediction error for bounded induction on any other plain monotone UTM M3 bounded to s′≥css space and t′≥cttlog2(t) time, where cs,ct≥1 are the constant prefactors for simulating M2 on M3, will still be bounded by 

N∑n=1H(Pμ(⋅∣xn),PM3(⋅∣xn,D<n))≤C(μ∗,M2)+C(M2,M3)+N∑n=1H(Pμ(⋅∣xn),Pμ∗(⋅∣xn)).(1.3)

 for all N∈N>0. So, our choice of UTM matters somewhat, since we might need more space and execution time on a different UTM to implement an equally efficient predictor. However, converting to a different UTM only increases the required execution time from t to O(tlog2(t)) and the required space from s to O(s). So, provided M3 has sufficient space s′ and time t′, the additional prediction error we get is just C(M2,M3), which is typically small for most M2,M3 we might usually consider using.

Prediction error bound for Bayesian learning on neural networks

Claim 4: We can obtain a similar generalization bound for Bayesian learning on neural networks 

Now, we will derive a similar style of bound for an inductor using Bayesian learning on neural networks.

Setup

In this setting, our inputs x are finite-dimensional vectors of real numbers x∈X⊆Rdin. The labels are still Boolean, y∈{0,1}. The inductor now makes its predictions using functions f(w), which take vectors xn∈X as input and output probabilities for the data labels yn. The functions are all part of some space of functions f∈F. The functions are parametrized by a parameter-function map M:RD→F with D∈N. We call D the number of neural network parameters. We refer to the probabilities the function f(w) outputs for input vector xn as f(yn∣xn,w). As in the previous sections, we assume that the data labels yn were generated from the inputs xn by some probabilistic program μ. Note that we do not make any assumptions about the inputs xn being drawn IID.

The inductor starts with a uniform prior over some finite hypervolume W in parameter space: W=(−L2,L2)D⊂RD. The inductor makes its predictions as 

Pf(yn∣xn,D<n)=∫dwf(yn∣xn,w)P(w∣D<n).

 where P(w∣D<n) is the probability density the inductor places on parameter configuration w after updating on data-label pairs (x1,y1,…,xn−1,yn−1).

Claim

The total expected prediction error of this induction measured in bits of cross-entropy is upper bounded by 

N∑n=1H(Pμ(⋅∣xn),Pf(⋅∣xn,D<n))≤C(w∗,ϵ,f)+Nϵ+N∑n=1H(Pμ(⋅∣xn),f(⋅∣xn,w∗)),(2.1)

 for all N∈N>0, all parameter configurations w∗∈W, and all ϵ∈R≥0. Here, C(w∗,ϵ,f) is the logarithm of the ratio between the volume of the whole prior VW:=∫Wdw=LD, and the volume taken up by the set Wϵ,w∗ of parameter configurations that are mapped to functions which match the log-probabilities of f(yn∣xn,w∗) up to tolerance ϵ: 

C(w∗,ϵ,f):=−log2⎛⎜⎝∫Wϵ,w∗dwVW⎞⎟⎠Wϵ,w∗:={w∈W∣|log2f(y∣x,w)−log2f(y∣x,w∗)|≤ϵ∀(x,y)∈X×{0,1}}

 In other words, if there is a parameter configuration w∗∈(−L2,L2)D that is mapped to a function f(w∗) that is an efficient predictor of the data, in the sense that f(w∗) approximates the probability assignments of μ up to some error, the inductor will predict the data as well as this efficient predictor f(w∗), plus a tolerance ϵ∈R≥0 we can freely choose, up to an offset equal to the cost in bits of specifying the predictor f(w∗) in the parameter space to ϵ precision. Intuitively, the reason we don't just always choose ϵ=0 in this bound is that our parameters are real numbers, so describing the location of f(w∗) in parameter space exactly could sometimes take infinitely many bits.

High-level proof summary

The proof works analogously to the proof for the error bound in Equation (1.1), except that sums over programs are replaced by integrals over parameters, and the semimeasure of programs in the prior matching the outputs of efficient predictor μ∗ on all possible input bit strings x∈{0,1}∗ is replaced by the volume of Wϵ,w∗, the set of parameter configurations matching the log-probabilities output by function f(w∗) to ϵ precision.

Proof

As before, we start by rearranging our expression for the total cross-entropy: 

N∑n=1H(Pμ(⋅∣xn),Pf(⋅∣xn,D<n))=−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)N∑n=1log2(∫Wdwf(yn∣xn,w)P(w∣D<n))=−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)log2(1VW∫WdwN∏n=1f(yn∣xn,w))

In the second equality, we used the assumption that our inductor adjusts its probabilities over parameter configurations in response to incoming evidence using Bayesian updating.

Now, we restrict the integral ∫Wdw to points in Wϵ,w∗, then use the bound |log2f(y∣x,w)−log2f(y∣x,w∗)|≤ϵ we have for points in this set: 

N∑n=1H(Pμ(⋅∣xn),Pf(⋅∣xn,D<n))=−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)log2(1VW∫WdwN∏n=1f(yn∣xn,w))≤−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)log2(1VW∫Wϵ,w∗dwN∏n=1f(yn∣xn,w))≤−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)log2(1VW∫Wϵ,w∗dw2∑Nn=1log2(f(yn∣xn,w∗))−Nϵ)=−∑yN,…,y1Pμ(yN∣xN)…Pμ(y1∣x1)log2⎛⎜⎝∫Wϵ,w∗dwVW2∑Nn=1log2(f(yn∣xn,w∗))2−Nϵ⎞⎟⎠=−log2⎛⎜⎝∫Wϵ,w∗dwVW⎞⎟⎠+N∑n=1H(Pμ(⋅∣xn),f(⋅∣xn,w∗))+Nϵ=C(w∗,ϵ,f)+N∑n=1H(Pμ(⋅∣xn),f(⋅∣xn,w∗))+Nϵ

 

 

Comments

  1. The bound holds for any w∗∈W and any ϵ∈R≥0, though the tightest bound will of course come from the w∗ and ϵ that balance the tradeoff between predictive accuracy and description length best.
  2. The larger N is, the more predictive accuracy matters relative to description length, and the smaller we might want to set ϵ if we want to obtain the tightest bound.
  3. If we have a more specific question about generalization error, we can get a tighter bound by restricting Wϵ,w∗ to range over a more limited set of possible input/output pairs. For example, if we're confident an image classifier will never receive input images above a certain brightness, we can exclude all images above that brightness level from the set X in the definition of Wϵ,w∗, which might make C(w∗,ϵ,f) smaller and the upper bound tighter.
  4. This derivation assumes a uniform prior over parameters for simplicity, but it goes through analogously for other priors such as Gaussians.

Relating the volume to SLT quantities

C(w∗,ϵ,f) is a pretty similar quantity to the ϵ-ball that defines the learning coefficient in Singular Learning Theory. There, the prediction error in-distribution includes the logarithm of a term of the form[10] 

V(ϵ):=∫WϵdwVWWϵ:={w∈W∣L(w)<ϵ}L(w):=∫dxq(x)∑yDKL(Pμ(y∣x),f(y∣x,w))

 where q(x) is the 'training distribution' over inputs x.

Comparing −log2V(ϵ) to C(w∗,ϵ,f):

  1. The uniform distribution over all inputs x over the whole domain X plays the same role for the definition of C(w∗,ϵ,f) that the ‘training distribution’ does for V(ϵ).
  2. In C(w∗,ϵ,f), the outputs f(y∣x,w∗) of the reference function f(w∗) play the role that the 'data labels' Pμ(y∣x) do in V(ϵ).
  3. C(w∗,ϵ,f) cares about the supremum of divergences over the distribution, V(ϵ) cares about the average. V(ϵ) is defined as the integral over the set of parameter configurations that diverge from the data labels by less than ϵ on average over the training data, as scored by KL-divergence, whereas C(w∗,ϵ,f) is defined as the volume of the set of points that diverge from f(w∗) by epsilon or less on all data points, as scored by log‑probability difference.

These differences are the price of describing prediction error in general instead of 'in-distribution'. In SLT, we assume that our data xn are drawn IID from some known distribution q(x). In that case, average error over that distribution is enough to bound the inductor's expected average error on future data. Without this assumption of IID sampling, we need to classify functions f(w) based on how closely their outputs align with our reference f(w∗) on all possible inputs x. Or at least, all inputs x we think the inductor might possibly encounter.

In SLT, ϵ is also set to ϵ∝1N in the expression for the prediction error instead of being left free as in Equation (2.1), because the theory is primarily concerned with the infinite data limit N→∞, where describing the solution to exact precision is 'worth it' and yields a tighter bound. To put that slightly more quantitatively: One term for the prediction error in SLT scales as Nϵ, while −log2V(ϵ) can be shown to only scale as −λlog2ϵ+o(log2ϵ)[11], where λ is a scalar number called the learning coefficient. So, in the vanilla SLT setup, letting ϵ go to zero for N→∞ as ϵ∝1N yields a tighter bound for the prediction error than setting it to some finite non-zero value.

Open problems and questions

Section TL;DR: What kind of neural network architectures have a prior equivalent to that of a bounded universal Turing machine? How can we estimate the description length C(w∗,ϵ,f) in practice?

How do the priors actually relate to each other?

So, we can sort of think about Bayesian learning on neural networks and compute-bounded Solomonoff induction using a similar set of mathematical tools. But how similar are these induction processes really? That is, how much does a uniform prior over parameter configurations on some neural network architecture resemble a uniform prior over programs on some compute-bounded UTM?

Another way of asking that question is whether we can get something like Equation (1.3) for switching between a bounded UTM and a neural network, and vice versa. So, can we get something like:

Conjecture 1 

For any parameter vector w∗ parametrizing some function f(w∗)∈F, the prediction error of an inductor using bounded UTM M2 is bounded by 

N∑n=1H(Pμ(⋅∣xn),PM2(⋅∣xn,D<n))≤C(w∗,ϵ,f)+C(f,M2)+N(ϵ+~ϵ)+N∑n=1H(Pμ(⋅∣xn),f(⋅∣xn,w∗)),

where C(f,M2) is the description length of the neural network architecture on UTM M2 to precision ~ϵ.[12] Provided M2 has enough time and memory to simulate f(w∗) on M, of course.

Conjecture 2 (Likely false for arbitrary NN architectures) 

For any program μ∗ on bounded UTM M2, the prediction error of an inductor using neural network architecture M is bounded by 

N∑n=1H(Pμ(⋅∣xn),Pf(⋅∣xn,D<n))≤C(μ∗,M2)+C(M2,ϵ,f)+Nϵ+N∑n=1H(Pμ(⋅∣xn),Pμ∗(⋅∣xn)),

where C(M2,ϵ,f) is the description length in bits of bounded UTM M2 on neural network architecture f to precision ϵ.

 

If both of these conjectures were true, doing Bayesian learning using neural networks instead of an ensemble of programs would be almost no more of a change than doing Bayesian learning in one programming language instead of another.[13]

My guess is that something vaguely like Conjecture 1 is true. With sufficient compute, we can use a UTM to simulate any neural network architecture to some arbitrary desired floating point precision. But I think Conjecture 2 is generally false, because not every neural network architecture can simulate a bounded UTM. Remember, our definition of 'neural network' here was basically just 'a space of functions F parametrized by some parameter-function map M:RD→F'. A family of polynomials would satisfy that definition. And I doubt a uniform prior over polynomials acts much like a uniform prior over compute-bounded programs. But maybe Conjecture 2 still holds for some more restricted set of neural network architectures? Like, say, architectures capable of simulating a bounded UTM. RNNs and transformers run in chain-of-thought mode can do this, for example.[14] I originally thought that this condition would be sufficient. But proving the conjecture for transformers in chain-of-thought mode turned out to be trickier than I thought.[15] Maybe you just need to be a bit cleverer about how to do the UTM simulation than I was so far. Or maybe, being able to simulate a bounded UTM actually isn't a sufficient requirement for Conjecture 2, and the architecture needs to have some additional property. In that case, it'd be interesting to find out what that property is.

What does C(w∗,ϵ,f) look like in practice?

For example, does C(w∗,ϵ,f) behave like V(ϵ) in vanilla SLT in the limit ϵ→0? That is, for small enough ϵ and some reasonable conditions on the NN architecture[16] do we often get something like 

C(w∗,ϵ,f)∝−λ′log2ϵ

 where λ′ would then be some number analogous to the learning coefficient λ in SLT? Can we use stochastic gradient Langevin dynamics (SGLD) sampling to estimate C(w∗,ϵ,f) on neural networks in real life, the same way people use it to determine the learning coefficient? If a solution can generalize out-of-distribution, it can certainly generalize in-distribution, but the reverse is generally not true. So, presumably C(w∗,ϵ,f) would tend to be smaller than −log2V(ϵ). But how much smaller does it tend to be in practice? In other words, how much does OOD generalization error usually differ from in-distribution generalization error?

Acknowledgments

Thanks to Aram Ebtekar for providing many useful citations. Thanks to Lee Sharkey for helping me come up with a title for this. Thanks to Cole Wyeth for spotting an unnecessary condition in my original definition of the UTMs. And thanks to Aram Ebtekar as well as Mathias Dellago, Alexander Gietelink Oldenziel, Daniel Murfet, David Quarel, Kaarel Hänni, John Wentworth, Dmitry Vaintrob, Jake Mendel, Linda Linsefors, and Scott Aaronson for various discussions and ideas that were causally upstream of my writing this.

  1. ^

    I expect that basically nothing in the first part is actually novel, though I haven't managed to track down citations for every result in it in the exact form given here. If you think you know your AIT and don't need to read this part, you might be right. But I'd advise some caution before skipping it. I wrote it partially because many AIT-knowledgeable people I talked to about this topic turned out not to know these things.[17]

  2. ^

    As in a program that outputs probabilities as strings in some kind of alphabet.

  3. ^

    We can also let N tend to infinity though; it shouldn't affect the proofs.

  4. ^

    So, not prefix-free. Lots of standard AIT results assume a prefix-free UTM. This post does not. AIT-savvy skim readers beware. If you compare the theorems here to results phrased for the setting of prefix-free programs, you might become pretty confused.

  5. ^

    See e.g. the start of section three here for more on the definition of monotone Turing machines.

  6. ^

    As in the standard setup for Solomonoff induction with probabilistic programs, M1(pn) can keep running and printing more output digits indefinitely, representing the output probabilities with ever higher precision.

  7. ^

    So, if μ is a deterministic program, this second term will vanish.

  8. ^

    See also, for example, theorem 1 this paper, which shows almost the same thing in an RL setting.

  9. ^

    See the proof here for example.

  10. ^

    For the special case of uniform prior, discrete outputs y, and our loss function being cross-entropy error measured in bits.

  11. ^

    SLT convention usually considers prediction error scored in nats and thus the expressions use natural logarithms, I'm translating to bits here.

  12. ^

    For some to-be-determined definition of 'precision' that makes the conjecture work. Basically, we'd want some measure of the floating‑point precision the UTM uses to approximate the real-number-valued parameters and network inputs.

  13. ^

    Still only "almost", because we still have those extra ϵ terms. Those don't show up when converting between UTMs.

  14. ^

    The results in those links come with some annoying caveats, so if you want to jump on this you might want to use different constructions to do the UTM emulation.

  15. ^

    Basically, making unused NN parameters not interfere with the computation no matter what values they're set to is kind of tough. As a consequence, I keep ending up with prefactors >1.0 in front of the C(μ∗,M2) term.

  16. ^

    e.g., analyticity.

  17. ^

    Not judging. I also thought myself somewhat AIT-knowledgeable, but did not realize some of what it had to say about degeneracy until fairly recently. I spent years being very confused about how learning works because of this.