Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

TLDR: The simplicity bias in Bayesian statistics is not just a bias towards short description length. 

 

The folklore relating the simplicity bias in Bayesian statistics to description length is incomplete: while it is true that the fewer parameters you use the better, the true complexity measure which appears in the mathematical theory of Bayesian statistics (that is, singular learning theory) is more exotic. The content of this complexity measure remains quite mysterious, but in this note we point out that in a particular setting it includes a bias towards runtime error-correction. This suggests caution when reasoning about the role of inductive biases in neural network training.

Acknowledgements. Thanks to Jesse Hoogland, Liam Carroll, Rumi Salazar and Simon Pepin Lehalleur for comments.

1. Background

1.1 Relevance to Deep Learning

Consider the problem of solving an ordinary differential equation. A constructive proof involves actually writing down a solution, or an algorithm that in finite time will produce a solution. The Picard-Lindelöf theorem proves that a solution to a broad class of initial value problems exists, but the proof is not constructive: it sets up a contraction mapping on a complete metric space and appeals to the Banach fixed point theorem. 

While the Picard-Lindelöf theorem uniquely characterises the solution as the fixed point of a contraction mapping, and gives an iterative process for approximating the solution, it does not construct the solution. However a construction is not necessary for many of the applications of Picard-Lindelöf (in differential geometry, topology and many parts of analysis). This mode of reasoning about mathematical objects, where it suffices to have characterised[1] them by (universal) properties, is pervasive in modern mathematics (in the above example, the characterising property is the differential equation, or its associated contraction mapping). However this may seem quite alien to a computer scientist or programmer, who for historical reasons tend to think that there is only one mode of reasoning about mathematical objects, and that is centred on the study of a construction.

In an era where programs are increasingly the product of gradient descent rather than human construction, this attitude is untenable. We may have to accept a mode of reasoning about learned programs, based on understanding the nature of the problems to which they are a solution and the iterative processes that produce them. To understand the implicit algorithms learned by neural networks, it may be necessary from this perspective to understand 

  • the computational structures latent in the data distribution, and
  • the inductive biases of neural network training.

We do not currently have a good understanding of these matters. If we understood these inductive biases better, it could conceivably help us in the context of AI alignment to answer questions like "how likely is deceptive alignment", "how likely is consequentialism", and "what goals are instrumentally convergent"?

This note is about the inductive biases of the Bayesian learning process (conditioned on more samples, the posterior increasingly localises around true parameters). Since Bayesian statistics is both fundamental and theoretically tractable, this seems potentially useful for understanding the inductive biases of neural network training. However it is worth noting that the relation between these is not understood at present.

1.2 Singular Learning Theory

The asymptotic expansion of the Bayesian free energy, or "free energy formula'', proven by Watanabe in Singular Learning Theory (SLT) introduces the learning coefficient  as a measure of complexity that balances model accuracy in the process of model selection. 

In models that are regular or minimally singular the learning coefficient is easy to understand: it is the effective number of parameters in the model, in a suitable sense. More precisely if a parameter is irrelevant, in the sense that any variation does not change the prediction, this leads to a decrease of  in the learning coefficient and thus  agrees with an effective parameter count in minimally singular models.[2] Therefore, in these cases, the tradeoff between model accuracy and complexity embodied in the minimisation of the free energy is familiar.

However, in more degenerate models such as neural networks, this connection between the learning coefficient and parameter counting breaks down. In particular, the learning coefficient depends on the data distribution. This is not an obstacle to theoretically deriving or empirically estimating the learning coefficient, but it does mean that we may lack good intuitions for what this (positive, rational) number is measuring.

1.3 Minimum Description Length

There is a circle of ideas containing Occam's razor, Kolmogorov complexity and the Minimum Description Length (MDL) which strongly informs the intuitions in the machine learning community about the meaning of "simplicity" and its role in determining the inductive biases of neural network training. However it is important to note that the mathematical hypotheses required for the attractive coherence among these ideas do not apply to neural networks.

A good reference for the classical treatment of the MDL is

For the relation of the MDL to Bayesian model selection and the asymptotic expansion of the free energy (in the minimally singular case) see

There has been some attempt in recent years to address the fact that the classical treatment of MDL is not applicable to singular models like neural networks:

It seems that using SLT one could give a generally correct treatment of MDL. However, until such results are established, one should not presume any fundamental theoretical connection between the simplicity bias of Bayesian statistics of neural networks and any simple notion of "description length". 

2. Turing Machines With Errors

In this section we give an example where the complexity measure in the free energy asymptotic for singular statistical models (the learning coefficient) is sensitive to algorithmic structure that goes beyond the number of effective parameters.

The example is derived from a literature that has attempted to bridge semantics of logic with statistical learning theory, based on ground-breaking work of Ehrhard-Regnier on differential linear logic. We have tried to make the presentation here self-contained, but the reader can find more information in the following references:

We consider the problem of predicting the outputs of a computable generating process by finding codes for a Universal Turing Machine (UTM). We have in mind a standard kind of UTM  with a description tape (where we write the code which specifies the machine to be simulated), a state tape (where the state of the simulated machine is written) and a work tape (containing the state of the tape of the simulated machine). Some input sequence  is written to the work tape, a code  is written to the description tape, an initial state is written to the state tape, and then the UTM proceeds to simulate the Turing Machine  with code  until it halts with the output  on the work tape, if it halts.

We can consider this process as actually instantiated in a machine or, more abstractly, an automata. The role of error in such processes is very interesting and leads ultimately to the problem of run-time error correction in modern computing: every step of communication or processing in a computer involves some probability of error and, although small, the large number of steps and size of the messages involved means that some error correction may be necessary for meaningful computation to take place.

We suppose that some computable process in the environment is outputting strings  according to some true distribution , with each symbol having some small error from the output  of a TM . The problem of statistical inference is to figure out which TM it is, from a given set of input-output pairs . Of course there will be (infinitely) many TMs consistent with the set of samples , so we expect that the result of statistical inference will be a probability distribution over codes. This is what we call the Bayesian posterior . We might bias this towards shorter machines by putting a prior  over the space of models (codes).

This is an old problem, but we consider it from a new angle. Rather than considering a discrete space of codes with a special prior, we extend our set of allowed models of the generating process beyond TMs (as represented by their codes) to include codes with error channels. In just the same way that each "organ'' of the automata[3] in von Neumann's work may have some probability of error, we allow each symbol  of the code  of  to have some probability of error when it is read by the UTM. The specification of the allowed distribution of errors for each symbol in  defines a point in the space  of codes-with-error-channels, which we take as the parameter space of our new extended statistical model.

Given one of these codes-with-error-channels  and an input sequence  the contents of the work tape of  after some given number of steps depends on the errors encountered during the execution of the UTM; that is, it is a probability distribution . This is our model. Those parameters  which lead to final work tapes close to the true output , in the sense that the KL divergence between the two probability distributions is small, are more highly preferred by the Bayesian posterior , which is now defined on the extended space of models .

2.1 Error as a Probe of Structure

Of course, if the generating process  is computable then its distribution of outputs  can be realised by a model  where  has no errors (for example, the code of ). So what is there to be gained by considering this larger space of models?

The distribution over outputs that results from a given distribution of possible errors in reading a symbol  of the code  reflects the way that the symbol is used in the computation. While this is clear informally (to understand how something works, try perturbing one of its components and see what happens) this can be given a formal interpretation in the context of differential linear logic, where there is a connection between this sensitivity analysis and the Taylor series expansion of a proof. The simplest example is that if the introduction of errors in  does not affect the output distribution at all then it is reasonable to conclude that  is irrelevant to the computation. This already suggests a natural connection between the local learning coefficient  and the program length of  whereby the posterior  tends to concentrate around shorter programs which predict the dataset  to the same degree of accuracy. We will turn to more interesting examples soon.

The upshot is that given a TM  with code  the way that the probability distribution  varies when we vary  near  (noting that ) encodes a lot of information about the structure of the algorithm  when executed on the input , via the effects of perturbing each one of the bits in the code . This variation in turn is reflected in the local geometry of the function

near , of which the local learning coefficient  is a scalar invariant. The conclusion is that the algebraic geometry of the function germ  should be expected to reflect the internal structure of the TM  to some degree. At the moment it remains unclear how strong this connection between geometry of  and internal structure of  actually is.

2.2 The Example

We fix an input  for which the true output is . For simplicity we assume this is a single symbol and that we judge models  on their predictions for the contents of this single tape square. The contribution of the data sample  to the KL divergence  can be shown to be equivalent[4] to  where

which is a polynomial in the coefficients specifying the probability distributions that make up . Here  is the Kronecker delta which is  if  and zero otherwise.

Let us now consider the local behaviour of  (equivalently ) at a point of parameter space which represents a TM  with code . We fix a single symbol of this code and a perturbation of this code in some direction . That is, we consider a parameter  which agrees with  except that in the th position we replace the symbol  by the distribution . For concreteness let us take the original symbol of the code to be  and take  to be the formal linear combination of symbols  so that for small  is a probability distribution representing a small chance of reading a  instead of a . Then we can expand  in powers of 

where each  is a function of  only (since we assume there is no uncertainty in the other positions of ). Taking  we obtain  which we may assume is zero; that is, we assume that  gives the correct answer  for the fixed input . Thus  and we have

We are interested in the order of this polynomial, that is the least  such that .

To go further we have to think in more detail about how the UTM  computes the probability  that is, how the uncertainty in the code  is propagated to uncertainty in the output. Since we run the simulated machine for some fixed number of steps, the UTM itself is run for some number of steps, and thus makes use of exactly  samples from the th symbol of the code for some  that is independent of . The possible trajectories of  are thus dictated by the sequences of possible results  and to each we assign a probability

In our example, the distribution in  at the th position is

so that we need only consider sequences where samples  appear, and

where  is the number of times  is sampled in  and .

One can compute that when you run the UTM  as above the resulting probability distribution over outputs is

where the sum is over all executions of  making use of a sequence of samples  that, on input , leads to the output symbol  equal to . Here  is defined by running the UTM and intervening in any step where the UTM goes to read from the th symbol of the code, so that on the th read for  the UTM sees the symbol .

Combining the above equations

which is zero since when  we have  and  so the first factor in every summand vanishes. Thus . For 

For 


Recall that  is  where  is the number of 's appearing in the sequence . Thus  is the number of s sampled in the sequence , that is, the number of errors. The derivative is nonzero only when  which for  is only possible (given that  and ) if . The only "paths'' or sequences  that contribute to the -fold derivative are ones which contain  errors. Substituting in the above yields

where we have used that

We can use this formula to compute the coefficient 
 

Definition. We say that the Turing Machine  with code  is robust to  errors on input  in position  if, for any execution path  of the UTM  initialised with  on the description tape and  on the work tape involving  errors, .


Lemma. If  is robust to  errors on input  in position  then  for . That is, the order of  in  is strictly greater than .

Proof. We set . If  is robust to  errors in the stated sense, then it is clear from the above that  when  and  since it is never true that  when  is such that . In the case where  and 

Hence for any  and any  we have . Thus for 

since every summand involves a term  with 
 

Example. Since any TM is robust to zero errors, we always have . If  is robust to one error on input  then , so the polynomial  is at least cubic in .

If  is robust to  errors on every input  then  has order  in , from which we infer that  has order  in  locally at . If  are the only allowed symbols in the code at this position then the local learning coefficient satisfies .

2.3 Summary

The example illustrates one simple way in which the structure of a TM  with code  (in this case, the capability to recover from read errors in its specification during run-time) influences the geometry of the function germ . We expect there are many further forms of degeneracy, which for example involve sophisticated interactions between bits of the code.

In a minimally singular model every used parameter costs  (in the sense that it increases  by that amount) and so our intuition might suggest that in the current setting every used bit in the code should cost . It is true that every used bit costs at most  but we have just seen that not all bits are created equal: a bit which is "error-corrected" in the sense that when the UTM executes  it is robust to  errors in that bit, only costs . The principle of minimisation of free energy therefore suggests that, all else being equal, the Bayesian posterior will prefer codes where the bits are error-corrected in this way. That is, two codes of equal length and matching the outputs of the generating process equally well, may nonetheless have neighbourhoods assigned different probability by the posterior, if one of them has error-correction and the other does not. Thus "simplicity'' (in the sense of ) includes robustness to errors.

We note that it is straightforward to provide a TM with run-time error-correction at the cost of execution time, by running the program multiple times and applying a majority vote to determine the answer. More sophisticated schemes are possible, and there is a literature that has worked out various ways of doing this.

3. Conclusion

The relation between description length and simplicity biases in Bayesian statistics is well-known, but is a phenomena that is confined to regular models and this class of models does not include neural networks. We do not yet possess conceptually simple general intuitions about the inductive biases of Bayesian statistics. In this note we have exhibited one case in which the simplicity bias is more exotic. 

3.1 Questions and Open Problems

The analysis in this note is preliminary. The set of people working on ideas like this is small, and if you have relevant background in mathematics or ML you could probably figure out something useful. Notes:

  • It is not known whether the inductive bias of neural network training contains a preference for run-time error-correction. The phenomenon of "backup heads" observed in transformers seems like a good candidate. Can you think of others?
  • It seems that in deep RL, policies which take effective control of the environment might be able to achieve a kind of run-time error-correction which allows them to simplify their policies and thus minimise the free energy. This might lead to a connection between simplicity in Bayesian statistics and the emergence of goal-directedness.
  • Simon Pepin Lehalleur (henceforth SPLH) asks "It would be remarkable if some related fact or downstream consequence of this observation had not been observed somewhere in the literature on error-correction and information theory?" There is an extensive literature on error-correction in naturally occurring computational systems. Interesting observations can be found in "Noisy dynamical systems evolve error correcting codes and modularity" by T. McCourt et al, "Resource savings from fault-tolerant circuit design" by A. K. Tan and I. L. Chuang, "Biological error correction codes generate fault-tolerant neural networks" by A. Zlokapa et al. I have only a superficial familiarity with the literature, but it seems to me one could make a career out of bringing modern mathematical techniques to bear on this field.
  • SPLH suggests the modest open problem of proving this is a consequence of SLT :) 
  1. ^

    Simon Pepin Lehalleur says: "defined implicitly" is another common way to get at the same idea. Statistics is in some sense all about implicit definitions ("Statistics is the inverse problem to probability").

  2. ^

    In the case where the KL divergence or loss function is locally, after a change of variables, a sum of squares  in  then since changing each of the  for  increases the loss, we refer to these parameters as "relevant" whereas changing each of the  for  does not change the loss so we refer to these parameters as "irrelevant". More precisely, there is a finite range within which these parameters can be varied without changing the loss.

  3. ^

    By an "organ" von Neumann means a fundamental unit, from which more complicated computations can be built.

  4. ^

    In the sense that there exist  such that with  we have . In particular the local learning coefficients of  and  agree.

New Comment
5 comments, sorted by Click to highlight new comments since:

[minor]

I'm a bit confused or I think you need some additional caveats in the intro. I would have said that Bayesian statistics with typical models is well understood as bias toward short description length but there are important caveats in the neural network case, at least conceptually.

(That said, minimum description length is off by constant factor, but this doesn't seem to be what you're getting at.)

You say:

TLDR: The simplicity bias in Bayesian statistics is not just a bias towards short description length.

And

while it is true that the fewer parameters you use the better, the true complexity measure which appears in the mathematical theory of Bayesian statistics (that is, singular learning theory) is more exotic

But later you say:

The relation between description length and simplicity biases in Bayesian statistics is well-known, but is a phenomena that is confined to regular models and this class of models does not include neural networks.

Maybe I can clarify a few points here:

  • A statistical model is regular if it is identifiable and the Fisher information matrix is everywhere nondegenerate. Statistical models where the prediction involves feeding samples from the input distribution through neural networks are not regular.
  • Regular models are the ones for which there is a link between low description length and low free energy (i.e. the class of models which the Bayesian posterior tends to prefer are those that are assigned lower description length, at the same level of accuracy).
  • It's not really accurate to describe regular models as "typical", especially not on LW where we are generally speaking about neural networks when we think of machine learning.
  • It's true that the example presented in this post is, potentially, not typical (it's not a neural network nor is it a standard kind of statistical model). So it's unclear to what extent this observation generalises. However, it does illustrate the general point that it is a mistake to presume that intuitions based on regular models hold for general statistical models.
  • A pervasive failure mode in modern ML is to take intuitions developed for regular models, and assume they hold "with some caveats" for neural networks. We have at this point many examples where this leads one badly astray, and in my opinion the intuition I see widely shared here on LW about neural network inductive biases and description length falls into this bucket.
  • I don't claim to know the content of those inductive biases, but my guess is that it is much more interesting and complex than "something like description length".
  • It is not known whether the inductive bias of neural network training contains a preference for run-time error-correction. The phenomenon of "backup heads" observed in transformers seems like a good candidate. Can you think of others?

I've heard thirdhand (?) of a transformer whose sublayers  will dampen their outputs when  is added to that sublayer's input. IE there might be a "target" amount of  to have in the residual stream after that sublayer, and the sublayer itself somehow responds to ensure that happens? 

If there was some abnormality and there was already a bunch of  present, then the sublayer "error corrects" by shrinking its output.

I'm surprised to see an application of the Banach fixed-point theorem as an example of something that's too implicit from the perspective of a computer scientist. After all, real quantities can only be represented in a computer as a sequence of approximations — and that's exactly what the theorem provides.

I would have expected you to use, say, the Brouwer fixed-point theorem instead, because Brouwer fixed points can't be computed to arbitrary precision in general.

(I come from a mathematical background, fwiw.)

This comment of mine is a bit cheeky, since there are plenty of theoretical computer scientists who think about characterising terms as fixed points, and logic programming is a whole discipline that is about characterising the problem rather than constructing a solution, but broadly speaking I think it is true among less theoretically-minded folks that "program" means "thing constructed step by step from atomic pieces".