What does “test-set performance” represent in Bayesian machine learning? In Bayesian ML:

  • we have some data 
  • we assume a model  (this includes any assumptions we make about the prior densities)
  • and we compute the posterior predictive density 

I have seen people argue that we need a test-set to compare between two models  and  as we do not know what “the one true model” is. I don’t fully understand how “evaluating performance” on “out-of-sample” data helps us with comparing two models but isn’t this what the quantity  is for?

New Answer
New Comment

3 Answers sorted by

[-][anonymous]3y10

👍 Could you please elaborate on how it relates to the Bayesian interpretation of test-set performance?

Radford Neal

Aug 09, 2021

20

See my comment at https://www.lesswrong.com/posts/Mw9if9wbfTBawynGs/?commentId=uEtuotdC6oD2H2FPw on a closely related question.

Briefly, if instead of using performance on a test set to judge future performance on real data, or to compare two models, you instead use a formal Bayesian approach that looks only at the training data, the quality of the answers from this formal Bayesian approach may depend very crucially on getting the Bayesian model specification (including priors for parameters) almost exactly right (in the sense of expression your true prior knowledge of the problem).  And getting it that close to exactly right may be beyond your ability.

And in any case, we all know that there is a non-negligible chance that your program to do the Bayesian computations simply has a bug.  So seeing how well you do on a held-out test set before launching your system to Jupiter is a good idea.

[-][anonymous]3y10

Thanks for your reply Professor Neal. Why does it make sense to use the test-set performance to judge performance on an arbitrary/future dataset? Does the test-set have some other interpretation that I am missing? If we wanted to judge future performance on real data or compare two model by future performance on real data then shouldn’t we just calculate the most likely performance on a arbitrary dataset?

2Radford Neal3y
I'm not sure what you're asking here.  The test set should of course be drawn from the same distribution as the future cases you actually care about.  In practice, it can sometimes be hard to ensure that.  But judging by performance on an arbitrary data set isn't an option, since performance in the future does depend on what data shows up in the future (for a classification problem, on both the inputs, and of course on the class labels).  I think I'm missing what you're getting at....

Wild Speculation:

I am 70% confident that if we were smarter then we would not need it. 

If you have some data that you (magically) know the likelihood and prior. Then you would have some uncertainty from the parameters in the model and some from the parameters, this would then change the form of the posterior for example from normal to a t-distribution to account for this extra uncertainty.

In the real world we assume a likelihood and guess a prior, and even with simple models such as y ~ ax + b we will usually model the residual errors as a normal distribution and thus thus loose some of the uncertainty, thus our residual errors are different in and out of sample.

Practical Reason

Also, a model with more* parameters will always have less residual errors (unless you screw up the prior) and thus the in sample predictions will seem better

Modern Bayesians have found two ways to solve this issue

  1. WAIC: Which uses information theory see how the posterior predictive distribution captures the generative process and penalizes for the effective number of parameters.
  2. PSIS-LOO: does a very fast version of LOO-CV where for each  you factor that  contribution to the posterior to get an out of sample posterior predictive estimate for .

Bayesian Models just like Frequentest Models are vulnerable to over fitting if they have many parameters and weak priors. 

 

*Some models have parameters which constrains other parameters thus what I mean is "effective" parameters according to the WAIC  or PSIS-LOO estimation, parameters with strong priors are very constrained and count as much less than 1.

"... a model with more* parameters will always have less residual errors (unless you screw up the prior) and thus the in sample predictions will seem better"

Not always (unless you're sweeping all exceptions under "unless you screw up the prior").  With more parameters, the prior probability for the region of the parameter space that fits the data well may be smaller, so the posterior may be mostly outside this region.  Note that "less residual errors" isn't avery clear concept in Bayesian terms - there's a posterior distribution of residual error... (read more)

[-][anonymous]3y10

It seems to me like there are two distinct issues: estimating error of model on future data and model comparison.

 It would be useful to know the most likely value of error on an future data before we actually use the model; but is this what test set error represents - the most likely value of error on future data?

 Why do we use techniques like WAIC and PSIS-LOO when we can (and should?) simply use  i.e. Bayes factors, Ockham factors, Model Evidence, etc.? These techniques seem to work well for over-fitting (see image below). O... (read more)