LESSWRONG
LW

Model Comparison
Probability & Statistics
Frontpage

21

From Laplace to BIC

by johnswentworth
19th Jul 2019
2 min read
2

21

Probability & Statistics
Frontpage

21

Previous:
Laplace Approximation
3 comments30 karma
Next:
Bayesian Model Testing Comparisons
No comments15 karma
Log in to save where you left off
From Laplace to BIC
1Bucky
3johnswentworth
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 7:38 AM
[-]Bucky6y10

In removing the O(1) terms I think we're removing all of the widths of the peak in the various dimensions. So in the case where the widths are radically different between the models this would mean that N would need to be even larger for BIC to be a useful approximation.

The widths issue might come up, for example, when an additional parameter is added which splits the data into 2 populations with drastically different population sizes - the small population is likely to have a wider peak.

Is that right?

Reply
[-]johnswentworth6y30

That is exactly correct.

Reply
Moderation Log
More from johnswentworth
View more
Curated and popular this week
2Comments

The previous post outlined Laplace approximation, one of the most common tools used to approximate hairy probability integrals. In this post, we'll use Laplace approximation to derive the Bayesian Information Criterion (BIC), a popular complexity penalty method for comparing models with more free parameters to models with fewer free parameters.

The BIC is pretty simple:

  • Start with the maximum log likelihood lnP[data|θmax]
  • Subtract k2lnN to penalize for complexity, where N is the number of data points and k is the number of free parameters (i.e. dimension of θ).

Thus: lnP[data|θmax]−k2lnN. Using this magic number, we compare any two models we like.

Let's derive that.

BIC Derivation

As usual, we'll start from P[data|model]. (Caution: don't forget that what we really care about is P[model|data]; we can jump to P[data|model] only as long as our priors are close enough to be swamped by the evidence.) This time, we'll assume that we have N independent data points xi, all with the same unobserved parameters - e.g. N die rolls with the same unobserved biases. In that case, we have

P[data|model]=∫θP[data|θ]dP[θ]=∫θ∏Ni=1P[xi|θ]p[θ]dθ

Next, apply Laplace approximation and take the log.

lnP[data|model]≈∑ilnP[xi|θmax]+lnp[θmax]+k2ln(2π)−12lndet(H)

where the Hessian matrix H is given by

H=d2dθ2lnP[data|θ]|θmax=∑id2dθ2lnP[xi|θ]|θmax

Now for the main trick: how does each term scale as the number of data points N increases?

  • The max log likelihood ∑iP[xi|θmax] is a sum over data points, so it should scale roughly proportionally to N.
  • The prior density and the k2ln(2π) are constant with respect to N.
  • H is another sum over data points, so it should also scale roughly proportionally to N.

Let's go ahead and write H as N∗(1NH), to pull out the N-dependence. Then, if we can remember how determinants scale:

lndet(N∗(1NH))=lndet(1NH)+k∗lnN

so we can re-write our Laplace approximation as lnP[data|model]≈∑ilnP[xi|θmax]+p[θmax]+k2ln(2π)−12lndet(1NH)−k2lnN=lnP[data|θmax]−k2lnN+O(1)

where O(1) contains all the terms which are roughly constant with respect to N. The first two terms are the BIC.

In other words, the BIC is just the Laplace approximation, but ignoring all the terms which don't scale up as the number of data points increases.

When Does BIC Work?

What conditions need to hold for BIC to work? Let's go back through the derivation and list out the key assumptions behind our approximations.

  • First, in order to jump from P[model|data] to P[data|model], our models should have roughly similar prior probabilities P[model] - i.e. within a few orders of magnitude.
  • Second, in order for any point approximation to work, the posterior parameter distribution needs to be pointy and unimodal - most of the posterior probability mass must be near θmax. In other words, we need enough data to get a precise estimate of the unobserved parameters.
  • Third, we must have N large enough that k2lnN (the smallest term we're keeping) is much larger than the O(1) terms we're throwing away.

That last condition is the big one. BIC is a large-N approximation, so N needs to be large for it to work. How large? That depends how big lndet(1NH) is - N needs to be exponentially larger than that. We'll see an example in the next post.

Next post will talk more about relative advantages of BIC, Laplace, and exact calculation for comparing models. We'll see a concrete example of when BIC works/fails.

Mentioned in
15Bayesian Model Testing Comparisons