From SLT to AIT: NN generalisation out-of-distribution

First of all, this is great stuff and very clearly written.

~~With regards to this:~~

~~Now, we're going to modify the setup from the previous section, bounding our induction to make it computable. Instead of running programs~~ ~~, where~~ $p$ ~~is drawn from the set of all bit strings of length~~ $T$ ~~, we restrict ourselves to programs of length~~ $T$ ~~that require a maximum of~~ $s$ ~~space and a maximum of~~ $t$ ~~time steps.~~
~~Specifically, we construct a new bounded UTM~~ $M_{2}$ ~~from our UTM~~ $M_{1}$ . $M_{2}$ ~~only has~~ $s + 1$ ~~cells of work tape available. If the head moves to the last cell of the work tape,~~ $M_{2}$ ~~immediately halts. Further,~~ $M_{2}$ ~~always halts itself after~~ $t$ execution steps if it has not halted yet. This bounded induction will no longer be optimal in the sense of having an SI-style bound on its total error that depends only on the entropy and Kolmogorov complexity of the data-generating process $μ$ ~~, because~~ $μ$ ~~might not be computable in time~~ $t$ ~~and space~~ $s$ .

~~And the open question~~

~~What kind of neural network architectures have a prior equivalent to that of a bounded universal Turing machine?~~

~~It seems intuitive to me that what you're describing is an RNN. Recalling Andrei Karpathy's writing on the~~ ~~unreasonable effectiveness of RNNs~~:

Moreover, as we’ll see in a bit, RNNs combine the input vector with their state vector with a fixed (but learned) function to produce a new state vector. This can in programming terms be interpreted as running a fixed program with certain inputs and some internal variables. Viewed this way, RNNs essentially describe programs. In fact, it is known that ~~RNNs are Turing-Complete~~ in the sense that they can to simulate arbitrary programs (with proper weights). But similar to universal approximation theorems for neural nets you shouldn’t read too much into this. In fact, forget I said anything.
~~If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs.~~

~~More directly: Instead of bit-string programs with length~~ T~~, we have the parameters of the RNN. The "work tape" with fixed length~~ s~~+1 can be compared to the hidden state vector of the RNN, which also has a fixed length. The enforced halting condition is similar to running an RNN for~~ t ~~steps.~~

Edited to add: I see that you also noted the ability of RNNs to simulate UTMs with regards to conjecture 2. Sorry for the somewhat redundant comment. However, you noted that you didn't succeed in proving conjecture 2 wrt transformers and chain of thought. Did you succeed for RNNs?

[-]Lucius Bushnaq3mo40

The same could be said of transformers run in chain of thought mode. But I tried deriving conjecture 2 for those, and didn't quite succeed.

The trouble is that you need to store the programs in the RNN/transformer weights, and do it in a way that doesn't 'waste' degrees of freedom. Suppose for example that we try to store the code for the programs in the MLPs, using one ReLU neuron to encode each bit via query/key lookups. Then, if we have more neurons than we need because the program $p$ is short, we have a lot of freedom in choosing the weights and biases of those unnecessary neurons. For example, we could set their biases to some very negative value to ensure the neurons never fire, and then set their input and output weights to pretty much any values. So long as the weights stay small enough to not overwhelm the bias, the computation of the network won't be affected by this, since the ReLUs never fire.

The problem is that this isn't enough freedom. To get $C (μ^{*}, M_{2})$ in the formula without a prefactor $> 1.0$ , we'd need the biases and weights for those neurons to be completely free, able to take any value in $W$ .

EDIT: Wrote the comment before your edit. No, I haven't tried it for RNNs.

[-]testingthewaters3mo42

Just to clarify, if there was some way to "fill up" the weights of the RNN/Transformer such that no weights are "free", that would satisfy the conditions that you've set up? As in, if the encoding of takes up every parameter, that would qualify as a valid solution for the purposes of $C (μ^{*}, M_{2})$ ? (Forgive me, I'm still very new to AIT and the associated work, and my knowledge is still very poor)

[-]Lucius Bushnaq3mo40

If every bit of every weight were somehow used to store one bit of , excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes.^[1]

I think that's maybe even harder than what I tried to do though. It's theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters, so long as the leftover parameter description bits are free to vary.

^{^}
There'd be some extra trickiness in that under these definitions, the parameters are technically real numbers and thus have infinity bits of storage capacity, though in real life they're of course actually finite precision floating point numbers.

[-]testingthewaters3mo20

the parameters are technically real numbers and thus have infinity bits of storage capacity

Yeah, a lot of the stuff about ARNNs having superturing computing capacity rely on the Analog part of the (A)RNN and the weights being therefore real valued. RNNs with rational weights are strictly turing complete I think. (cf. this article)

If every bit of every weight were somehow used to store one bit of p, excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes

It's theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters

Assuming that the bits to parameters encoding can be relaxed, there's some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren't linearly independent, for example, the same computation can be "spread" over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.

Otherwise, I don't think it'll be easy find a way for the remaining parameters to be fully free to vary, since it is a densely connected network. It might be interesting for you to look into the lottery ticket hypothesis literature, though even there I don't think the remaining non-ticket weights can be set to any value.

[-]Lucius Bushnaq3mo60

Assuming that the bits to parameters encoding can be relaxed, there's some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren't linearly independent, for example, the same computation can be "spread" over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.

There's a few other cases like this where we know how various specific forms of simplicity in the computation map onto freedom in the parameters. But those are not enough in this case. We need more freedom than that.

[-]Paul W3mo30

Hi! Interesting post!
Just to make sure I got it right: in Claim 1, if you substract $\sum_n H(P_{\mu}(\cdot | x_n))$ on both sides of inequality (1.1), you get that the sum of the KL-divergences $\sum_n KL(P_{\mu}(\cdot | x_n), P_{M_1}(\cdot | x_n, D_{<n})$ is smaller than the constant $C(\mu, M_1)$, right?
Then, dividing by N on both sides, you get that the average of said KL divergences goes to 0 as N goes to infinity, at least with rate 1/N, is that correct?

[-]Lucius Bushnaq3mo31

Yes, subtracting from inequality (1.1) does yield $\sum_{n = 1}^{N} D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n})) \leq C (μ, M_{1})$ . So, since the total KL divergence summed over the first $N$ data points is bounded by the same constant for any $N$ , and KL-divergences are never negative, $D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n}))$ must go to zero for large $n$ fast enough for the sum to not diverge to infinity, which implies it has to go to zero faster than 1/n.

Though note that in real life, where $N$ is finite, $D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n}))$ can still go to zero very unevenly; it doesn't have to be monotonic.

For example, you might have $D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n})) = 0$ from $n = 10^{3}$ to $n = 10^{6}$ , then suddenly see a small upward spike at $n = 10^{6} + 1$ . A way this might happen is if the first $10^{6}$ data points the inductor receives come from one data distribution, and the subsequent data points are drawn from a very different distribution. If there is a program $μ^{'}$ that is shorter than $μ$ (so $C (μ^{'}, M_{1}) < C (μ, M_{1})$ ) and that can predict the data labels for the first distribution but not the second distribution, whereas $μ$ can predict both distributions, the inductor would favour $μ^{'}$ over $μ$ and assign it higher probability until it starts seeing data from the second distribution. It might make up to $C (μ^{'}, M_{1})$ bits of prediction error early on before its posterior becomes largely dominated by predictions that match $μ^{'}$ at $n = 10^{3}$ . After that, the KL-divergence would go to zero for a while because everything is getting predicted accurately. Then, at $n = 10^{6} + 1$ , when we switch to the second data distribution, the KL-divergence would go up again for while, until the inductor has added another $\leq C (μ, M_{1}) - C (μ^{'}, M_{1})$ bits of prediction error to the total KL-divergence. From then on the inductor would make predictions that match $μ$ and so the KL-divergence would go back down to zero again and this time stay zero permanently.

^{^}

Though in practice, using the formulas for models trained on finite amounts of data does seem to (maybe?) work alright a lot of the time. In our experience so far at least. Provided we aren't screwing up things without noticing, which tends to happen a lot in research.

^{^}

It also does other things, but this is maybe its most famous result, and the one that matters most for this post.

^{^}

I expect that basically nothing in the first part is actually novel, though I haven't managed to track down citations for every result in it in the exact form given here. If you think you know your AIT and don't need to read this part, you might be right. But I'd advise some caution before skipping it. I wrote it partially because many AIT-knowledgeable people I talked to about this topic turned out not to know these things.^[19]

^{^}

As in a program that outputs probabilities as strings. Those probabilities are then used to sample the labels. Note that the program can output probabilities $1.0$ and $0$ , so the labels $y_{n}$ can be deterministic functions of the inputs $x_{n}$ , they just don't need to be.

^{^}

So, not prefix-free. Lots of standard AIT results assume a prefix-free UTM. This post does not. AIT-savvy skim readers beware. If you compare the theorems here to results phrased for the setting of prefix-free programs, you might become pretty confused.

^{^}

See e.g. the start of section three here for more on the definition of monotone Turing machines.

^{^}

As in the standard setup for Solomonoff induction with probabilistic programs, $M_{1} (p_{n})$ can keep running and printing more output digits indefinitely, representing the output probabilities with ever higher precision.

^{^}

So, if $μ$ is deterministic and only outputs probabilities $0$ or $1$ , this second term will vanish.

^{^}

See also, for example, theorem 1 this paper, which shows almost the same thing in an RL setting.

^{^}

See the proof here for example.

^{^}

The derivation should work for other kinds of settings as well though, such as next token prediction in language modelling. I'm just sticking to a specific simple case to make the math easier to follow.

^{^}

For the special case of uniform prior, discrete outputs $y$ , and our loss function being cross-entropy error measured in bits.

^{^}

SLT convention usually considers prediction error scored in nats, I'm translating to bits here.

^{^}

For some to-be-determined definition of 'precision' that makes the conjecture work. Basically, we'd want some measure of the floating‑point precision the UTM uses to approximate the real-number-valued parameters and network inputs.

^{^}

Still only "almost", because of the extra $ϵ$ terms. Those don't show up when converting between UTMs.

^{^}

The results in those links come with some annoying caveats, so if you want to jump on this you might want to use different constructions to do the UTM emulation.

^{^}

Basically, making unused NN parameters not interfere with the computation no matter what values they're set to is kind of tough. As a consequence, I keep ending up with prefactors $> 1.0$ in front of the $C (μ^{*}, M_{2})$ term.

^{^}

e.g., analyticity.

^{^}

Not judging. I also thought myself somewhat AIT-knowledgeable, but I did not realize some of what it had to say about degeneracy until fairly recently. I spent years being very confused about how learning works because of this.

LESSWRONG
LW

LESSWRONG
LW

114

From SLT to AIT: NN generalisation out-of-distribution

114

Ω 34

114

Ω 34

Introduction

Prediction error bounds for a computationally bounded Solomonoff induction

Claim 1: We can import Solomonoff induction into the learning-theoretic setting

Claim

High-level proof summary

Claim 2: A bounded induction still efficiently predicts efficiently predictable data

Setup

Claim

Claim 3: The bounded induction is still somewhat invariant under our choice of UTM

Prediction error bound for Bayesian learning on neural networks

Claim 4: We can obtain a similar prediction error bound for Bayesian learning on neural networks

Setup

Claim

High-level proof summary

Comments

Relating $C (w^{*}, ϵ, f)$ to SLT quantities

Open problems and questions

How do the priors actually relate to each other?

What does $C (w^{*}, ϵ, f)$ look like in practice?

Acknowledgments

114

From SLT to AIT: NN generalisation out-of-distribution

114

Ω 34

114

Ω 34

Introduction

Prediction error bounds for a computationally bounded Solomonoff induction

Claim 1: We can import Solomonoff induction into the learning-theoretic setting

Claim

High-level proof summary

Claim 2: A bounded induction still efficiently predicts efficiently predictable data

Setup

Claim

Claim 3: The bounded induction is still somewhat invariant under our choice of UTM

Prediction error bound for Bayesian learning on neural networks

Claim 4: We can obtain a similar prediction error bound for Bayesian learning on neural networks

Setup

Claim

High-level proof summary

Comments

Relating C(w∗,ϵ,f) to SLT quantities

Open problems and questions

How do the priors actually relate to each other?

What does C(w∗,ϵ,f) look like in practice?

Acknowledgments

Relating $C (w^{*}, ϵ, f)$ to SLT quantities

What does $C (w^{*}, ϵ, f)$ look like in practice?