Are Bayesian methods guaranteed to overfit?

Ege Erdil

Are Bayesian methods guaranteed to overfit?

by Ege Erdil

3 min read17th Jun 20235 comments

52

Bayes' TheoremLogic & Mathematics Probability & StatisticsWorld ModelingRationality

Frontpage

This is a linkpost for https://www.yulingyao.com/blog/2023/overfit/

Yuling Yao argues that Bayesian models are guaranteed to overfit. He summarizes his point as follows:

I have a different view. Bayesian model does overfit.

Moreover, Bayes is guaranteed to overfit, regardless of the model (being correct or wrong) or the prior ( “strong” or uninformative).

Moreover, Bayes is guaranteed to overfit on every realization of training data, not just in expectation.

Moreover, Bayes is guaranteed to overfit on every single point of the training data, not just in the summation.

He uses the following definition of "overfitting": a model "overfits" some data if its out-of-sample log loss exceeds its within-sample log loss. Interpreted in a different way, this is equivalent to saying that the model assigns higher probability to a data point after updating on it than before. Andrew Gelman makes the point that any proper fitting procedure whatsoever has this property, and alternative methods "overfit" more than ideal Bayesian methods.

I think the proper way to interpret the results is not that Bayesian methods are guaranteed to overfit but that the definition of "overfitting" used by Yuling Yao, while intuitively plausible at first glance, is actually poor. Still, proving the fact that Bayesian methods indeed must "overfit" in his sense is an interesting exercise. I tried understanding his derivation of this and gave up - I present an original derivation of the same fact below that I hope is clearer.

Derivation

Suppose we have a model parametrized by parameters and the probability of seeing some data $y$ according to our model is $P (y | θ)$ . Now, suppose we draw $n$ independent samples $y_{1}, y_{2}, \dots, y_{n}$ . Denote this whole data vector by $y$ , and denote the data vector with the $i$ th sample omitted by $y_{- i}$ . Under Bayesian inference, the within-sample probability of observing the value $y_{i}$ in the next sample we draw is

$P (y_{n + 1} = y_{i} | y) = \int_{θ} P (θ | y) P (y_{i} | θ) d θ$

On the other hand, Bayes says that

$P (θ | y) = P (θ | y_{- i}, y_{i}) = \frac{P (θ | y_{- i}) P (y_{i} | y_{- i}, θ)}{P (y_{i} | y_{- i})} = \frac{P (θ | y_{- i}) P (y_{i} | θ)}{P (y_{i} | y_{- i})}$

Plugging in gives

$P (y_{n + 1} = y_{i} | y) = \int_{θ} \frac{P (θ | y_{- i}) P (y_{i} | θ)^{2}}{P (y_{i} | y_{- i})} d θ$

$P (y_{n + 1} = y_{i} | y) P (y_{i} | y_{- i}) = E_{θ \sim P (θ | y_{- i})} [P (y_{i} | θ)^{2}]$

We can decompose the expectation of the squared probability on the right hand side using the definition of variance as follows:

$P (y_{n + 1} = y_{i} | y) P (y_{i} | y_{- i}) = E_{θ \sim P (θ | y_{- i})} [P (y_{i} | θ)]^{2} + {var}_{θ \sim P (θ | y_{- i})} (P (y_{i} | θ)) = P (y_{i} | y_{- i})^{2} + {var}_{θ \sim P (θ | y_{- i})} (P (y_{i} | θ))$

where I've used the fact that

$E_{θ \sim P (θ | y_{- i})} [P (y_{i} | θ)] = \int_{θ} P (θ | y_{- i}) P (y_{i} | θ) d θ = P (y_{i} | y_{- i})$

to get rid of the expectation. The variance term on the right hand side is nonnegative by definition as it's a variance, and it's strictly positive as long as there's any uncertainty in our beliefs about $θ$ after seeing the data $y_{- i}$ that would influence our probability estimate of observing $y_{i}$ next. This will be the case in almost all nondegenerate situations, and if so, we obtain the strict inequality

$P (y_{n + 1} = y_{i} | y) > P (y_{i} | y_{- i})$

What does this mean?

The finding is intuitively obvious, but poses some challenges to formally defining the notion of overfitting. This is essentially because the ideal amount of fitting for a model to do on some data is nonzero, and overfitting should be "exceeding" this level of ideal fitting. In practice, though, it's difficult to know what is the "appropriate" amount of fitting for a model to be doing. Bayesian inference is ideal if the true model is within the class of models under consideration, but it might fail in unexpected ways if it's not, which is almost always the case in practice.

I think the lesson to draw from this is that overfitting is a relative concept and claiming that a particular method "overfits" the data doesn't make too much sense without a point of reference in mind. If people have alternative ways of trying to construct an absolute notion of overfitting with the above argument taken into account, I'd be interested to hear them.

New to LessWrong?

Getting Started

FAQ