# 52

Yuling Yao argues that Bayesian models are guaranteed to overfit. He summarizes his point as follows:

I have a different view. Bayesian model does overfit.

Moreover, Bayes is guaranteed to overfit, regardless of the model (being correct or wrong) or the prior ( “strong” or uninformative).

Moreover, Bayes is guaranteed to overfit on every realization of training data, not just in expectation.

Moreover, Bayes is guaranteed to overfit on every single point of the training data, not just in the summation.

He uses the following definition of "overfitting": a model "overfits" some data if its out-of-sample log loss exceeds its within-sample log loss. Interpreted in a different way, this is equivalent to saying that the model assigns higher probability to a data point after updating on it than before. Andrew Gelman makes the point that any proper fitting procedure whatsoever has this property, and alternative methods "overfit" more than ideal Bayesian methods.

I think the proper way to interpret the results is not that Bayesian methods are guaranteed to overfit but that the definition of "overfitting" used by Yuling Yao, while intuitively plausible at first glance, is actually poor. Still, proving the fact that Bayesian methods indeed must "overfit" in his sense is an interesting exercise. I tried understanding his derivation of this and gave up - I present an original derivation of the same fact below that I hope is clearer.

## Derivation

Suppose we have a model parametrized by parameters and the probability of seeing some data according to our model is . Now, suppose we draw independent samples . Denote this whole data vector by , and denote the data vector with the th sample omitted by . Under Bayesian inference, the within-sample probability of observing the value in the next sample we draw is

On the other hand, Bayes says that

Plugging in gives

or

We can decompose the expectation of the squared probability on the right hand side using the definition of variance as follows:

where I've used the fact that

to get rid of the expectation. The variance term on the right hand side is nonnegative by definition as it's a variance, and it's strictly positive as long as there's any uncertainty in our beliefs about after seeing the data that would influence our probability estimate of observing next. This will be the case in almost all nondegenerate situations, and if so, we obtain the strict inequality

## What does this mean?

The finding is intuitively obvious, but poses some challenges to formally defining the notion of overfitting. This is essentially because the ideal amount of fitting for a model to do on some data is nonzero, and overfitting should be "exceeding" this level of ideal fitting. In practice, though, it's difficult to know what is the "appropriate" amount of fitting for a model to be doing. Bayesian inference is ideal if the true model is within the class of models under consideration, but it might fail in unexpected ways if it's not, which is almost always the case in practice.

I think the lesson to draw from this is that overfitting is a relative concept and claiming that a particular method "overfits" the data doesn't make too much sense without a point of reference in mind. If people have alternative ways of trying to construct an absolute notion of overfitting with the above argument taken into account, I'd be interested to hear them.

## New to LessWrong?

New Comment

My first-level intuition says that if you had some sort of knob you could turn to adjust the amount of "fitting" while holding everything else constant, then "overfitting" would be when turning the knob higher makes the out-of-sample loss go up.

My more-detailed model--which I haven't thought super long about--looks like this:

In an idealized example where you had "perfect" training data that correctly labeled every example your classifier might encounter, you would want the classifier to learn a rule that puts literally 100% of the positive examples on one side of the boundary and literally 100% of the negative examples on the other side, because anything else would be inaccurate.  You'd want the classifier to get as complex as necessary to achieve that.

Some reasons you might want the classifier to stop before that extreme include:

1. You may have errors in your training data.  You want the classifier to learn the "natural" boundary instead of finding a way to reproduce the errors.
2. Your data set may be missing dimensions.  Maybe the "true" rule for classifying bleggs and rubes involves their color, but you can only view them through a black-and-white camera.  Even if none of your training points are mislabeled, the best rule you could learn might misclassify some of them because you don't have access to all the data that you'd need for a correct classification.
3. Rather than all possible data points, you may have a subset that is less-than-perfectly-representative.  Drawing a line half-way between the known positive examples and known negative examples would misclassify some of the points in between that aren't in your training set, because by bad luck there was some region of thingspace where your training set included positive examples that were very close to the true boundary but no equally-close negative examples (or vice versa).

The reason that less-than-maximum fitting might help with these is that we have an Occamian prior saying that the "true" (or best) classifying rule ought to be simple, and so instead of simply taking the best possible fit of the training data, we want to skew our result towards our priors.

Through this lens, "overfitting" could be described as giving too much weight to your training data relative to your priors.

I wonder if your more-detailed model could be included in a derivation like that in the post above. The post assumes that every observation the model has (the previous y values) is correct. Your idea of mislabelling's or sub-perfect observations might be includable as some rule that says y's have an X% chance of just being wrong.

We can imagine two similar models.  a "zoomed in" model consists of two parts, first a model for the real-world, and second a model of observation errors.  A "zoomed out" model that just combines the real world and the observation errors and tries to fit to that data. If  sees errors then the model is tweaked to predict errors. Equivalently in the maths, but importantly different in spirit is model  which when it encounters an obvious error does not update the world model, but might update the model of the observation errors.

My feeling is that some of this "overfitting" discussion might be fed by people intuitively wanting the model to do , but actually building or studying the much simpler . When  tries to include observation errors into the same map it uses to describe the world we cry "overfitting".

A tangential question: Does the overfitting issue from Bayesian statistics have an analog in Bayesian epistemology, i.e. when we only deal with propositional subjective degrees of belief, not with random variables and models?

A tangential question: Does the overfitting issue from Bayesian statistics have an analog in Bayesian epistemology, i.e. when we only deal with propositional subjective degrees of belief, not with random variables and models?

I think the problem is the same in both cases. Roughly speaking, there is some "appropriate amount" of belief updating to try to fit your experiences, and this appropriate amount is described by Bayes' rule under ideal conditions where

• it's computationally feasible to perform the full Bayesian update, and
• the correct model is within the class of models you're performing the update over.

If either of these is not true, then in general you don't know which update is good. If your class of models is particularly bad, it can be preferable to stick to an ignorance prior and perform no update at all.

Asymptotically, all update rules within the tempered Bayes paradigm (Bayes but likelihoods are raised to an exponent that's not in general equal to 1) in a stationary environment (i.i.d. samples and such) converge to MLE, where you have guarantees of eventually landing in a part of your model space which has minimal KL divergence with the true data generating process. However, this is an asymptotic guarantee, so it doesn't necessarily tell us what we should be doing when our sample is finite. Moreover, this guarantee is no longer valid if the data-generating process is not stationary, e.g. if you're drawing one long string of correlated samples from a distribution instead of many independent samples.

Using Bayes' rule at least gets the right credence ratios between the different models you're considering, but it's not clear if this is optimal from the point of view of e.g. an agent trying to maximize expected utility in an environment.

I think in practice the way people deal with these problems is to use a "lazily evaluated" version of the Bayesian paradigm. They start with an initial class of models , and perform usual Bayes until they notice that none of the models in seem to fit the data very well. They then search for an expanded class of models which can still fit the data well while trying to balance between the increased dimensionality of the models in and their better fit with data, and if a decent match is found, they keep using from that point on, etc.

You may well be right with all this! But I meant something else here, namely "Bayesian epistemology" in the sense of Bayesianism in philosophy, without any statistics, i.e. without notions of (a class of) models, sampling, or a true data generating process. It seems Bayesian statisticians and epistemologists often have problems understanding each other.

I admit this is a bit derailing the discussion, since I lack the statistical background and can't comment on your thoughts of this post. So feel free to ignore what follows.

To clarify, epistemologists usually identify their type of Bayesianism with the two epistemic rationality assumptions:

1. Probabilism (synchronic norm): At any point in time, all beliefs of an agent should satisfy the axioms of probability theory, where a probability function describes the degrees of belief (at some point in time) of an agent, and the objects of belief are propositions which may be combined with propositional logical connectives like negation, conjunction etc. (Similar to sets/"events" in standard statistics.)

2. Conditionalization (diachronic norm): If an agent has some degree of belief in some proposition H, and they observe a new piece of evidence E (where E is also a proposition), they should update H according to to the principle of conditionalization where and describe the beliefs at the points in time immediately before and after "learning" that E.

(Conditionalization implies that , so this norm is only plausible when the evidence is learned with certainty, like direct experience. It also implies "regularity", namely that . Note also that if Bayes' theorem is applied to to right hand side of the principle of conditionalization, we get something that is usually called Bayes' rule or simply Bayesian updating.)

Now probabilism is relatively uncontroversial, at least as an idealization which doesn't take cognitive/computational constraints into account. But conditionalization, the updating rule, is more controversial for various reasons, and some epistemologists reject an updating rule outright (Radical probabilism) or they replace conditionalization with some other updating rule. Others accept it but add at rule for cases where evidence isn't learned with certainty.

Relevant for the topic of overfitting is that conditionalization could be interpreted as "fitting" H and E in the wrong way (not sure though whether this has any connection to "over"fitting.) Recall that conditionalization implies regularity, that , i.e. that the conditional probabilities always stay fixed and only changes.

But that sounds perhaps too dogmatic. For example, assume that at first you think E and H are strongly negatively dependent, i.e. . And you believe in H, that is, being high, and disbelieve in E, being low. Which means you expect E would strongly disconfirm H, i.e. is low.

If you now indeed observe E (), that would be strong evidence against H, i.e. . And conditionalization concurs. But it seems the observation of E would be also be some evidence against E and H being as strongly negatively dependent as you assumed before, i.e. it would be evidence against being as low as you previously thought, which would mean you should increase your conditional probability here, i.e. . But conditionalization would forbid that, it assumes you can't update the conditional probabilities here, just the marginal probability of H.

To make this more clear, there is an analogy with "outright beliefs": Assume for a moment that beliefs don't come in degrees (e.g. as in belief revision theory or epistemic logic), and that you either "believe" a proposition p, or you "disbelieve" p (which just means you "believe" not-p), or you neither believe nor disbelieve p (you are agnostic about p). Then your beliefs can be described simply as a set of propositions (the proportions you believe in). Then a plausible rationality constraint would be that , your beliefs, should be logically consistent. (This would be the analogue to the norm probabilism which says that your gradual beliefs should be probabilistically coherent.)

Now assume your belief set contains initially exactly these three propositions, mirroring those in the probabilistic example above:

={[If E then not-H], [H], [not-E]}

(For the gradual belief case, [If E then not-H] corresponds to " is low", [H] corresponds to " is high", [not-E] corresponds to " is low")

Then assume again that you observe/learn E. So you replace [not-E] with [E]:

={[If E then not-H], [H], [E]}

But now your belief set is logically inconsistent! Any two entail the negation of the third. We need an updating rule which handles such cases, beyond just dropping the old "disbelief" in E and adding the new belief in E.

The first option would be to abandon [H] and keep [If E then not-H] (and [E]). This would correspond to the effect of conditionalization in the gradual belief case.

The second option would be to abandon the belief [If E then not-H] and keep [H] (and [E]).

Both of the above options would remove the inconsistentcy, but they are arbitrarily biased against either [If E then not-H] or [H].

A third option would be not to add [E] in the first place and be content with removing just [not-E].

A fourth option would be to remove both [If E then not-H] and [H] while only keeping [E].

Now it seems clear that neither 1 nor 2 are appropriate updating rules, since they are arbitrarily biased against one of the old beliefs. But conditionalization corresponds to 1. So it seems this is evidence that conditionalization is unacceptable.

Note that in the probabilistic case, an equivalent of 4 does seem quite plausible. Instead of strongly lowering the credence , we can instead do a medium lowering of both and . The latter is equivalent to increasing a medium amount. Which is incompatible with conditionalization, since it requires that is constant during updating.

Anyway, this is my case for conditionalization (and therefore Bayes' rule) producing "the wrong fit" between E and H. Though I'm unsure whether this can be interpreted as overfitting somehow.