Exact calculation of posterior probabilities of each model via P[data|model]

Laplace approximation

Bayesian Information Criterion (BIC)

Based on the derivations, we can make some predictions:

Laplace is a second-order approximation of the exact integral around the maximum-likelihood estimate. It should work well when the integral is dominated by one very pointy peak. In English, it should work well when we have enough data to get a precise estimate of the unobserved parameters.

BIC is an approximation of Laplace as the number of data points N→∞. It ignores terms which don't scale with N (but do scale with number of parameters k), so it will do best when N is large and k is small.

Let's test these predictions out with some dice-rolling simulations.

The graphs below show the difference in bits of evidence (i.e. lnP[model|data] or its approximations) between a biased model and an unbiased model, as a function of the number of data points in each simulation, for each of the three methods. A positive difference indicates that the method favors the biased model; a negative difference indicates that the method favors the unbiased model.

First up, let's compare a simulation with a 60-40 biased coin to a simulation with an unbiased coin. Hopefully, our methods will favor the biased model in the biased coin simulation and the unbiased model in the unbiased coin simulation.

Here's what's going on in those plots:

All three methods agree very well. We can't see the exact method's line at all, because it's under the Laplace line. BIC does have a visible bias - it's usually a little below exact & Laplace - but the difference is small.

All three methods generally assign higher evidence to the biased model (line above zero) in the biased coin simulation, and higher evidence to the unbiased model (line below zero) in the unbiased simulation.

As the number of data points N increases, the evidence in favor of the biased model grows roughly linearly in the biased simulation. But in the unbiased simulation, the evidence in favor of the unbiased model grows much more slowly - logarithmically, in theory.

Because the BIC is a large-N approximation, ignoring terms which scale with k but not N, we'd expect BIC to perform worse as we crank up the number of parameters k. Let's try that: here's another pair of simulations with a 100-sided die (the biased die has 1/200 weight on half the faces and 3/200 on the other half).

This time, the BIC has a very large error - hundreds of bits of evidence in favor of an unbiased model, regardless of whether the coin is biased or not. That said, after the first few data points, the BIC's error mostly stays constant; recall that the terms ignored by the BIC are all roughly constant with respect to N. Meanwhile, the Laplace approximation agrees wonderfully with the exact calculation. (However, note that the Laplace approximation is absent in the leftmost part of each plot - for these models, it isn't well-defined until we've seen at least one of each outcome.)

Finally, notice that the exact calculation itself gives pretty reasonable probabilities in general, and in particular for small N. When the number of data points is small, it's always pretty close to zero, i.e. roughly indifferent between the models. In the high-k simulations, the exact solution gave reliably correct answers after a few hundred data points, and was roughly indifferent before that. Compare that to the BIC, which gave a very confident wrong answer in the biased case and only worked its way back to the correct answer after around 3000 data points. The moral of this story is: precise Bayesian calculations are more important when N is smaller and k is larger. We'll come back to that theme later.

Next post will add cross-validation into the picture, reusing the simulations above.

We've now seen three different methods for model comparison:

Based on the derivations, we can make some predictions:

Let's test these predictions out with some dice-rolling simulations.

The graphs below show the difference in bits of evidence (i.e. lnP[model|data] or its approximations) between a biased model and an unbiased model, as a function of the number of data points in each simulation, for each of the three methods. A positive difference indicates that the method favors the biased model; a negative difference indicates that the method favors the unbiased model.

First up, let's compare a simulation with a 60-40 biased coin to a simulation with an unbiased coin. Hopefully, our methods will favor the biased model in the biased coin simulation and the unbiased model in the unbiased coin simulation.

Here's what's going on in those plots:

Because the BIC is a large-N approximation, ignoring terms which scale with k but not N, we'd expect BIC to perform worse as we crank up the number of parameters k. Let's try that: here's another pair of simulations with a 100-sided die (the biased die has 1/200 weight on half the faces and 3/200 on the other half).

This time, the BIC has a very large error - hundreds of bits of evidence in favor of an unbiased model, regardless of whether the coin is biased or not. That said, after the first few data points, the BIC's error mostly stays constant; recall that the terms ignored by the BIC are all roughly constant with respect to N. Meanwhile, the Laplace approximation agrees wonderfully with the exact calculation. (However, note that the Laplace approximation is absent in the leftmost part of each plot - for these models, it isn't well-defined until we've seen at least one of each outcome.)

Finally, notice that the exact calculation itself gives pretty reasonable probabilities in general, and in particular for small N. When the number of data points is small, it's always pretty close to zero, i.e. roughly indifferent between the models. In the high-k simulations, the exact solution gave reliably correct answers after a few hundred data points, and was roughly indifferent before that. Compare that to the BIC, which gave a very confident wrong answer in the biased case and only worked its way back to the correct answer after around 3000 data points. The moral of this story is: precise Bayesian calculations are more important when N is smaller and k is larger. We'll come back to that theme later.

Next post will add cross-validation into the picture, reusing the simulations above.