In this post, I want to give a from-first-principles account of Bayesian and frequentists approaches to statistical analysis. Bayesian and frequentism have been discussed here often, although LessWrong posts on these topics are often focused on Bayesians and frequentists as it comes to probability and rationality, or its use in AI. Here, I am going to focus on statistical analyses that are intended for human readers. This is a somewhat experimental attempt to cobble together a sensible framework from the often confusing and mismatched pieces I have encountered when trying to learn about this topic myself, and to keep things grounded on what statistical analysis is actually trying to accomplish rather than on abstract philosophical positions. I will assume the reader is familiar with the mathematics of probability and Bayes rule.
To start at the beginning: As staticians, we are interested in helping a reader understand and make inferences using data. We will do so by performing some computations on the data, and then presenting the results to the reader who will, hopefully, be able to use those results to draw conclusions faster and more accurately than they would if trying to parse the data on their own. The main thing we need to decide is what should be computed.
To put this in mathematical terms, let be the data we have observed. Let be a hypothesis about how the data was generated, and let be a random variable over possible values we could have observed. I will assume we also have a model of , meaning the probability distribution over those possible values given that a particular hypothesis, , is true. As a concrete example, imagine a friend asks us to help them decide whether a coin is fair (meaning it has a 50/50 chance of landing on heads if flipped), and they tell us they observed 13 heads after flipping the coin 20 times. Then would be a random variable with a probability distribution over the integers 0 to 20, will be a real value between 0 and 1 representing the odds of the coin landing on heads when flipped, and . will be a binomial distribution with . Our task is to use and our model of to give the audience some key bits of information that will help them decide if the coin is fair.
If you are willing to view the reader as a Bayesian reasoner we can formalize this a bit more. In that case, we know the reader ought to do inference by applying Bayes Rule to combine and with their own prior. However, we assume the reader might not be able to effectively execute this process themselves (if they could we could just hand them and and go home). Unfortunately, we do not know the reader’s prior, so we cannot simply do that computation for them either. Instead, we can view our task as to summarize the values of for different s in a way that allows the reader to more easily integrate that result with their own prior beliefs.
Regardless of whether you want to take that perspective or not, we should acknowledge that this is a subjective task. After all, the most effective approach will depend on the particular psychology of the reader. However, we can still propose general strategies that we think are likely to be successful. Bayesians and frequentists statisticians offer two approaches.
The most common frequentist approach is null hypothesis testing, where the focus is on how relates to . is a null hypothesis that readers might be interested in refuting. For example, in drug trials will usually be “the drug has no effect”. In our coin flipping example, would be “the coin is fair”. In particular, frequentists report , known as a p-value. What it means to be “less extreme” is defined by the statistician, and typically captures some notion of how far is from what we would expect to observe under . Low p-values are said to show the data is incompatible with .
The main pro of this approach is that we do not need to bring in any additional assumptions beyond our knowledge of . In some cases, depending on how “less extreme” is defined, we might be able to use even milder assumptions (e.g., if the definition only depends on the mean of the data, we only need to make assumptions about how the mean is distributed, not how is distributed). The primary downside is that we are given no information about any hypothesis other than , which leads to deficiencies when we try to do inference (there is also some issue with the subjectivity of defining what “less extreme” means, but I won’t focus on that here). In particular:
- It means we are not given any information about even though we are often interested in refuting those hypotheses as well. For example, when testing a drug we want to ensure it has a non-trivial effect, not just a non-zero effect. In our coin flipping example, if we are only told the data is incompatible with , we don’t technically know if the data is incompatible with , which we might consider to be effectively unbiased, because we have not been given any information that rules out the possibility and are dramatically different distributions. As a result p-values do not, on their own, give you any information about the significance of the results.
- Knowing that d is not compatible with also does not necessarily mean is more compatible with any other hypothesis. This is problematic because if the data does not fit , but it fits every other hypothesis even worse, we might still want to avoid rejecting . These kinds of situations can occur if the data is unlikely under any hypothesis. For example, if there was a 95% chance our friend would give up after flipping the coin after one flip, the observation of 13 heads is incompatible with every hypothesis about how biased the coin is because no matter what the bias of the coin is, there will always be a less than 5% chance to see more than 1 head. This leads to cases like the voltmeter story. As a result a p-value does not necessarily tell you if the null hypothesis is better or worse than other hypotheses.
Frequentists have little choice but to argue that, in practice, these issues are not common. In other words, that a low p-value usually does imply the result has at least some significance, and that low p-values are usually not caused by the data just being generally unlikely. After all, if this was not the case these statistics would have very little use. Frequentists can make a reasonable argument for this. The probability distributions that are used in practice are usually smooth enough so that, if the data is not compatible with it is also likely to be incompatible with very, very close to . Likewise, the odds of the data being generally unlikely is, by definition, unlikely and, for many commonly used distributions used for , if the data is not compatible with there will be some other hypothesis that the data is more compatible with. Of course, one could still object to the fact these possibilities are being hand-waved rather than accounted for in the math, and (1) and (2) are known to occur in practical settings (a classic example is that stopping rules can trigger the second issue).
Frequentists might also argue readers are smart and well-informed enough to account for (1) and (2) themselves (although this requires that readers are informed about the details of how the p-values were computed). A particular appeal of this line of thought is that handling these issues requires making some additional assumptions, and one can argue it is better that readers be left to make those assumptions for themselves so they can bring in their own particular beliefs and prior knowledge rather than having the statistician make assumptions on the reader’s behalf. This is, however, putting a non-trivial burden on the reader and the extent to which even well-trained readers do this is debatable.
Bayesians usually summarize how relates to , which is computed using a particular prior, . Here is a random variable over possible hypotheses with a distribution that is chosen by the statistician. Using we can apply Bayes Rule to compute . Bayesians then typically provide some kind of summary of , such as a credible interval like . Bayesians can also compute a likelihood ratio like / where is again a hypothesis we might want to reject, although it still requires a prior of sorts since we will need to integrate over “ is close to ” and “ not close to ” to compute it.
One thing we should emphasize is that the prior used here is not going to be the reader’s prior. Therefore Bayesians can’t claim the they compute is the the reader ought to believe if they are doing proper Bayesian reasoning. However, Bayesians can still make the case that knowing the results computed under their prior will be informative for the reader.
One point in favor of this is that this method does help avoid the concerns mentioned for the frequentist approach. Bayesians consider for a range of s, which means Bayesians at least take some account of hypotheses other than and therefore have some room to make claims about statistical significance. Bayesians also handle the second issue by reporting the ratio of two probabilities. If the data is unlikely under any hypothesis, that general unlikeliness will appear in both numerator and denominator, and thus cancel out.
One can still make a technical objection, not dissimilar to the ones discussed for frequentists, against the use of a prior. In particular, there is no mathematical guarantee that the results computed using the chosen prior would not have been dramatically different if that prior is only slightly different than the chosen prior. This is particularly problematic since the prior must be chosen somewhat arbitrarily by the statistician.
Bayesians could offer a similar response to the one frequentists offer, i.e., that in practice results are unlikely to change dramatically given a small change in the prior, and that readers can judge for themselves how and when this might affect the reported results. Bayesians can also robustify their analysis by additionally reporting how their results would change if the prior changed, a method known as Robust Bayesian analysis.
So, what method is preferable?
I am going to caveat this by saying that I don’t have a huge amount of practical experience with statistics so these are not very confident statements. However, if someone asked me what kind of statistical analysis I would find most useful, I would currently say:
- If is a simple model (i.e., normal or categorical) I think I can more-or-less trust p-values to mean what they seem to mean. One caveat here is that the larger the sample size the less a low p-value implies a large effect size, and I don’t feel like I have a good intuition about how exactly the two are connected.
- For more complex models, I am pretty wary of p-values. Ultimately, p-values require the reader to extrapolate from what was reported to how the model would behave for hypotheses other than , and once the models get complex I am not sure my intuition tracks that.
- I think my preferred approach would be a Bayesian result with sensitivity analysis. Maybe something following the structure of “If your prior of falls within <some reasonable range> then will have <some important property>”, which is a statement I think I could understand and find useful even for complex models of .
What about our assumption?
One could level a critique, again not dissimilar to the ones we have already presented, against both methods due to the assumption of an accurate model . In particular, we once again have no mathematical insurance that our results will not change dramatically if our model is even slightly wrong. As far as I can tell, standard practice is to politely ignore that fact, although in principle one could do some kind of robustness analysis here too. Non-parametric methods offer a way out of this as well.
Can frequentists robustify their analysis?
One might ask if there is an analog of Bayesian sensitivity analysis for frequentists so that some kind of mathematical guarantee is provided in regards to (1) and (2). If there is, I don’t know it. I suspect part of the difficulty with doing this is that it is easy for solutions to regress into being de-facto Bayesians methods. For example, to provide results that cover significance levels we will need to analyze hypotheses other than , but then to summarize that array of results we will likely need to do some kind of weighted average, which is starting to sound a lot like integrating over a particular prior. Likewise, dividing by , as done by Bayesians, seems like the most obvious solution to (2).
Is this hammers and screwdrivers?
Occasionally I see people attempt to find a high ground on the debate about which method is preferable by saying “Asking which method is preferable is like asking whether using a hammer or screwdriver is preferable, in reality the answer depends on the context.” I don't think this analogy is accurate because these methods are, unlike hammers and screwdrivers, trying to solve the same problem. While it is never wrong to say one should account for context, it is still perfectly reasonable to have a general philosophical preference for one approach or the other.
What about p-hacking?
Both options are vulnerable to p-hacking. Indeed, no amount of clever math or more complex statistical analysis can save us from p-hacking because, if a friend tells you they flipped a coin 20 times and got 16 heads, there is no way for you to distinguish between the case where that result was caused because the coin is biased, or because the friend flipped the coin dozens of times until they got a batch with 16 heads. The only solution is to better regulate how the experiment was run and reported.
Hence the American Statistics Association’s (ASA) statement on p-values states “A p-value, or statistical significance, does not measure the size of an effect or the importance of a result” ↩︎
Hence the American Statistics Association also states “By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis”. ↩︎
Which is part of the reason the ASA also says “Proper inference requires full reporting and transparency.” ↩︎
There's a ton of work on general sensitivity analysis in the semi-parametric stats literature.
An alternative to p-value NHST that's still frequentist is to produce the point estimate and confidence intervals for the effect size of interest. Assuming they are defined the same way (one-side vs two-sided, etc.), they will give the same conclusion if you choose to interpret them as a dichotomous 'significant or not' status. (I.e., the 95% CI will exclude the null in the same cases as where the p-value will be below 0.05.) But with the CIs you're getting that indication of the size of effect that you were craving.