## LESSWRONGLW

Jan Christian Refsgaard

Data Scientist

Sorted by New

# Wiki Contributions

Question about Test-sets and Bayesian machine learning

Wild Speculation:

I am 70% confident that if we were smarter then we would not need it.

If you have some data that you (magically) know the likelihood and prior. Then you would have some uncertainty from the parameters in the model and some from the parameters, this would then change the form of the posterior for example from normal to a t-distribution to account for this extra uncertainty.

In the real world we assume a likelihood and guess a prior, and even with simple models such as y ~ ax + b we will usually model the residual errors as a normal distribution and thus thus loose some of the uncertainty, thus our residual errors are different in and out of sample.

Practical Reason

Also, a model with more* parameters will always have less residual errors (unless you screw up the prior) and thus the in sample predictions will seem better

Modern Bayesians have found two ways to solve this issue

1. WAIC: Which uses information theory see how the posterior predictive distribution captures the generative process and penalizes for the effective number of parameters.
2. PSIS-LOO: does a very fast version of LOO-CV where for each  you factor that  contribution to the posterior to get an out of sample posterior predictive estimate for .

Bayesian Models just like Frequentest Models are vulnerable to over fitting if they have many parameters and weak priors.

*Some models have parameters which constrains other parameters thus what I mean is "effective" parameters according to the WAIC  or PSIS-LOO estimation, parameters with strong priors are very constrained and count as much less than 1.

Do Bayesians like Bayesian model Averaging?

Good points, but can't you still solve the discrete problem with a single model and a stick breaking prior on the number of mints, right?

Do Bayesians like Bayesian model Averaging?

If there are 3 competing models then Ideally you can make a larger model where each submodel is realized by specific parameter combinations.

If a M2 is simply M1 with an extra parameter b2, then you should have a stronger prior b2 being zero in M2, if M3 is M1 with one parameter transformed, then you should have a parameter interpolating between this transformation so you can learn that between 40-90% interpolating describe the data better.

If it's impossible to translate between models like this then you can do model averaging, but it's a sign of you not understanding your data.

Do Bayesians like Bayesian model Averaging?

You are correct, we have to assume a model, just like we have to assume a prior. And strictly speaking the model is wrong and the prior is wrong :). But we can calculate how good the posterior predictive describe the data to get a feel for how bad our model is :)

Do Bayesians like Bayesian model Averaging?

I am a little confused by what x is on your statement, and by why you think we can't compute the likelihood or posterior predictive. In most real problems we can't compute the posterior but we can draw from it and thus approximate it via MCMC

Do Bayesians like Bayesian model Averaging?

I agree with Radford Neal, model average and Bayes factors are very sensitive to the priors specification of the models, if you absolutely have to do model average methods such as PSIS-LOO or WAIC that focus on the predictive distribution are much better. If you had two identical models where one simply had a 10 times boarder uniform prior then their posterior predictive distributions would be identical but their Bayes factor would be 1/10, so a model average (assuming uniform prior on p(M_i)) would favor the narrow prior by a factor 10 where the predictive approach would correctly cobclude that they describe the data equal well and thus conclude that the models should be weighed equal.

Finally model average is usually conseptually wrong and can be solved by making a larger model that encompass all potential models, such as a hierarchical model to partial pool between the group and subject level models, gelmans 8 schools data is a good example: there are 8 schools and there are 2 simple models one with 1 parameter (all schools are the same) and one with 8 (every school is a special snow flake), and then the hierarchical model with 9 parameters, one for each school and one for how much to pool the estimates towards the group mean, gelmans radon dataset is also good for learning about hierarchical models

Jaynesian interpretation - How does “estimating probabilities” make sense?

I am one of those people with an half baked epistemology and understanding of probability theory, and I am looking forward to reading Janes. And I agree there are a lot of ad hocisms in probability theory which means everything is wrong in the logic sense as some of the assumptions are broken, but a solid moden bayesian approach has much less adhocisms and also teaches you to build advanced models in less than 400 pages.

HMC is a sampling approach to solving the posterior which in practice is superior to analytical methods, because it actually accounts for correlations in predictors and other things which are usually assumed away.

WAIC is information theory on distributions which allows you to say that model A is better than model B because the extra parameters in B are fitting noice, basically minimum description length on steroids for out of sample uncertainty.

Also I studied biology which is the worst, I can perform experiments and thus do not have to think about causality and I do not expect my model to acout for half of the signal even if it's 'correct'

Jaynesian interpretation - How does “estimating probabilities” make sense?

I think the above is accurate.

I disagree with the last part, but it has two sources of confusion

1. Frequentists vs Bayesian is in principle about priors but in practice about about point estimates vs distributions
1. Good Frequentists use distributions and bad Bayesian use point estimates such as Bayes Factors, a good review is this is https://link.springer.com/article/10.3758/s13423-016-1221-4
2. But the leap from theta to probability of heads I think is an intuitive leap that happens to be correct but unjustified.

Philosophically then the posterior predictive is actually frequents, allow me to explain:
Frequents are people who estimates a parameter and then draws fake samples from that point estimate and summarize it in confidence intervals, to justify this they imagine parallel worlds and what not.

Bayesian are people who assumes a prior distributions from which the parameter is drawn, they thus have both prior and likelihood uncertainty which gives posterior uncertainty, which is the uncertainty of the parameters in their model, when a Bayesian wants to use his model to make predictions then they integrate their model parameters out and thus have a predictive distribution of new data given data*. Because this is a distribution of the data like the Frequentists sampling function, then we can actually draw from it multiple times to compute summary statistics much like the frequents, and calculate things such as a "Bayesian P-value" which describes how likely the model is to have generated our data, here the goal is for the p-value to be high because that suggests that the model describes the data well.

*In the real world they do not integrate out theta, they draw it 10.000 times and use thous samples as a stand in distribution because the math is to hard for complex models

Jaynesian interpretation - How does “estimating probabilities” make sense?

Regarding reading Jaynes, my understanding is its good for intuition but bad for applied statistics because it does not teach you modern bayesian stuff such as WAIC and HMC, so you should first do one of the applied books. I also think Janes has nothing about causality.

Jaynesian interpretation - How does “estimating probabilities” make sense?

Given 1. your model and 2 the magical no uncertainty in theta, then it's theta, the posterior predictive allows us to jump from infrence about parameters to infence about new data, it's a distribution of y (coin flip outcomes) not theta (which describes the frequency)