At least within Bayesian probability, there is a single unique unambiguously-correct answer to "how should we penalize for model complexity?": calculate the probability of each model, given the data. This is Hard to compute in general, which is why there's a whole slew of of other numbers which approximate it in various ways.
Here's how it works. Want to know whether model 1 or model 2 is more consistent with the data? Then compute and . Using Bayes' rule:
where Z is the normalizer. If we're just comparing two models, then we can get rid of that annoying Z by computing odds for the two models:
In English: posterior relative odds of the two models is equal to prior odds times the ratio of likelihoods. That likelihood ratio is the Bayes factor: it directly describes the update in the relative odds of the two models, due to the data. Calculating the Bayes factor - i.e. for each model - is the main challenge of Bayesian model comparison.
20 coin flips yield 16 heads and 4 tails. Is the coin biased?
Here we have two models:
- Model 1: coin unbiased
- Model 2: coin has some unknown probability of coming up heads (we'll use a uniform prior on for simplicity)
The second model has one free parameter (the bias) which we can use to fit the data, but it’s more complex and prone to over-fitting. When we integrate over that free parameter, it will fit the data poorly over most of the parameter space - thus the "penalty" associated with free parameters in general.
In this example, the integral is exactly tractable (it's a dirichlet-multinomial model), and we get:
So the Bayes factor is (.048)/(.0046) ~ 10, in favor of a biased coin. In practice, I'd say unbiased coins are at least 10x more likely than biased coins in day-to-day life a priori, so we might still think the coin is unbiased. But if we were genuinely unsure to start with, then this would be pretty decent evidence in favor.