Computing p-values is what Mr. Frequentist is all about.
For once I'd like to the bayesian/frequentist debate see the return of maximum likelihood vs maximum a-posteriori. P-values absolutely are not the only aspect of frequentist statistics! Yes they are one of the prominent so certainly fair game, but the way people talk about them its like they are all that matter. People have general problems with p-values beyond them being frequentist. To me the fact that they feature so prominently raises the question of how much certain commitments to "bayesianism" reflect actual usage of bayesian methods vs a kind of pop-science version of bayesianism.
Bayesian likelihood ratios
Is this meant to refer to a specific likelihood ratio method, or to suggest that likelihood ratios themselves are "bayesian"? Yes, "the likelihood principle" is a big source of criticism of p-values, but I don't see why likelihood ratios themselves are bayesian? I think Andrew Gelman once said something to the effect of both bayesian and frequentist methods need a likelihood, and often times that makes more of a difference than the prior. There's nothing strictly bayesian about "updating". I'm curious how often things that are identified as "bayesian" actual use Bayes' rule.
The frequentist approach, even if flawed in certain respects, still serves as a valuable heuristic. It teaches us to be wary of overfitting to outcomes, to ask about the process behind the numbers, and to maintain a healthy skepticism when interpreting results. Its insistence on method over outcome protects us from the temptation to rationalize or cherry-pick. I'd rather a scientist work with p-values than with their intuition alone.
I think I largely agree with the spirit here. I definitely think p-values have issues and in particular they way they have arguably contributed to publication bias is a highly reasonable criticism. That said, I think people like to make these "methodological" critiques more for philosophical than statistical reasons. In practice, we definitely should expect the application of all methods is going to have issues and be highly imperfect. So I agree that it makes sense to have a practical, "all methods are flawed, some are useful" view of things.
To me the fact that they feature so prominently raises the question of how much certain commitments to "bayesianism" reflect actual usage of bayesian methods vs a kind of pop-science version of bayesianism.
This is a valid concern. I'm new here and just going through the sequences (though I have a mathematics background), but I have yet to see a good framing of bayesian/frequentist debate as maximum likelihood vs maximum a-posteriori. (I welcome referrals)
I think people like to make these "methodological" critiques
Yes, there is a methodological critique to strict p-value calculations, but in the absence of informative priors p values are a really good indicator for experiment design. I feel that in hyping up Bayesian updates people are missing that and not offering a replacement. The focus on methods is a strength when you are talking about methods.
I'm new here and just going through the sequences (though I have a mathematics background), but I have yet to see a good framing of bayesian/frequentist debate as maximum likelihood vs maximum a-posteriori. (I welcome referrals)
I'm definitely not representative of lesswrong in my my views above I don't think. In fact in some sense I think I'm shadowboxing with lesswrong in some of my comments above, so sorry about any confusion that introduced.
I don't think I've ever seen maximum likelihood vs maximum a-posteriori discussed on lesswrong, and I'm kind of just griping about it! I don't have a references off to top of my head but I recall this appearing in debates elsewhere (i.e. not on lesswrong) like in more academic/stats settings. I can see if I can find examples. But in general it addresses an estimation perspective instead of hypothesis testing.
Yes, there is a methodological critique to strict p-value calculations, but in the absence of informative priors p values are a really good indicator for experiment design. I feel that in hyping up Bayesian updates people are missing that and not offering a replacement. The focus on methods is a strength when you are talking about methods.
I think I'm in agreement with you here. My "methodological" was directed at what I view as a somewhat more typical lesswrong perspective, similar to what is expressed in the Eliezer quote. Sure, if we take some simple case we can address a more philosophical question about frequentism vs bayesianism, but in practical situations there are going to so many analytical choices that you could make that there are always going to be issues. In an actual analysis you can always do stuff like look at multiple versions of an analysis and trying to use that to refine your understanding of a phenomenon. If you fix the likelihood but allow the data to vary then p-values are likely to be highly correlated with possible alternatives like bayes factors, a lot of the critiques I feel are focused on making a clean philosophical approach while ignoring the inherent messiness that would be introduced if you ever want to infer things from reasonably complicated data or observations. I don't think swapping likelihood ratios for p-values would sudden change things all that much, a lot of the core difficulties of inferring things from data would remain.
So, what difference does Mr. Frequentist see between the two experiments? In George's case we have no information except the final results. For Bessel on the other hand, once we understand the method that was used to determine the results, we know that at every intermediate step before the final result the cure rate was less than 70%.
I don't think that makes a difference. Supposing the scenario to assume that different patients have independent responses to treatment, the Bayesian's sequence of outcomes and the frequentist's are different permutations of 70 heads and 30 tails, of equal probability. Whatever they say about the efficacy of the treatment, they say the same thing.
It might be unlikely for the frequentist's sequence of outcomes to only hit his target p-value at the 100th patient, but this probability does not bear on the efficacy of the treatment, and the frequentist is ignoring it anyway.
In practice I would be more concerned that the motivation to get a desired result might corrupt the patient evaluations (assumed away in Jaynes's thought experiment).
Correct, in reality the world doesn't change if we reorder our results. The point is that for a frequentist it feels like it should. Because the method is flawed, it seems right for the result to be less right. This is a bad way of analyzing results, but not as bad a way to evaluate methodologies.
Your valid concern about corrupted results stems from the correlation between bad behavior and what a frequentist calls a bad methodology.
Bessel's methodology is not inherently bad either. If Bessel believed that the treatment would save lives and needed to keep going to prove it, wouldn't he behave the same way?
We need a Bayesian methodology that can help evaluate methodology with and without informative priors. This probably already exists in literature, but we won't be able to overcome the use of p-values until it is common knowledge.
we know that at every intermediate step before the final result the cure rate was less than 70%.
Actually, why? Bessel was worrying whether the rate of cures is greater than 60%, and after the 99th experiment there were either 69 cures out of 99 patients (~69.7%) or 70 cures out of 99 patients (~70.7%), so he could've stopped there and that would not hurt his reputation.
IIRC, when we discussed this essay in the reading group, one member said that Eliezer here did a bad job describing the thought experiment, and that actually both experimenters precommitted to treat at least 100 patients.
and that actually both experimenters precommitted to treat at least 100 patients.
That would be an interesting wrinkle. I haven't read the original source. Supposing this I would think Mr. Frequentist would still say Bessel is still more likely to be fooled by unlikely data than George (in the positive only direction) but honestly only by a very small amount. One could call that the trade-off for a method that won't be fooled by unlikely negative data.
Actually, why?
I was treating the description of Bessel and having a distinct stopping condition of 70%, otherwise he would have stopped at 69.7% like you said. If he was doing the tests one at a time we know 70.7% at 99 didn't occur because he stopped at 100.
I haven't read the original source.
It can be found (for example) here, pp 12-13. A higher quality but paywalled PDF is here.
The point Jaynes is making is that it does not matter what the stopping rule is. Once the data are obtained, the experimenter's state of mind is irrelevant to what the data imply about the phenomenon under study.
"Bayesian inference will not get us into this absurd situation, because it perceives automatically what common sense demands; that what is relevant for this inference is not the relative probabilities of imaginary data sets which were not observed, but the relative likelihoods of different parameter values, based on the one real data set which was observed; and this is the same for all the experimenters."
Isn't Bessel engaging in a kind of publication bias? After all, if he never stops getting the desired effect size out of a study, it never concludes, so presumably you won't hear from it.
You may have two different treatments A and B, and both have comparable effect sizes according to the literature, but you learn that all the published studies involving B were performed by Bessel who you know engages in publication bias. The published studies for A were conducted by George who, as is widely known, doesn't have this bias. So presumably, if you hear a study was conducted by Bessel, you should correct the reported effect size downwards when estimating the real (underlying) effect size. If you hear a study was conducted by George, you can assume no such publication bias exists, so you shouldn't correct the reported effect size downwards.
So, if A and B have the same overall reported effect size, you should assume that the effect size of B is lower than of A.
Now assume, unbeknownst to you, Bessel didn't actually have to withhold any studies, as the effect sizes all happened be above the desired range. Should you still correct the reported effect size downward? Answer: Yes of course, since you don't know that this is the case. The only thing you "know" is the published effect sizes and the fact that Bessel (the person) engages in biased reporting, which is evidence that the reported effect sizes overestimate the real effect size.
This is similar to how your subjective probability that you have won the lottery is very low before you have checked the results, even if, as luck would have it, you did indeed win the lottery.
Interesting. Presumably if Bessel never got the results he wanted, he could (assuming he's honest) continue until the negative data was enough to convince himself that he was wrong. Depending on his prior that might not happen, he could run out of money or motivation before he gave up and published a negative result. Avoiding this seems related to issues about publishing negative results and timely reporting of raw data.
With regards to the biased reporting, I'll just mention that we would have to adjust for known bias wether we were using Bayesian or frequentist methods.
Today I experienced the Sequences post Beautiful Probability for the first time. I will begin by quoting the quotation that Eliezer began with:
Let me introduce this issue by borrowing a complaint of the late great Bayesian Master, E. T. Jaynes (1990):
"Two medical researchers use the same treatment independently, in different hospitals. Neither would stoop to falsifying the data, but one had decided beforehand that because of finite resources he would stop after treating N=100 patients, however many cures were observed by then. The other had staked his reputation on the efficacy of the treatment, and decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, however many patients that might require. But in fact, both stopped with exactly the same data: n = 100 [patients], r = 70 [cures]. Should we then draw different conclusions from their experiments?" (Presumably the two control groups also had equal results.)
This quote captured me completely—I found myself caught up thinking through the implications before I could even finish reading the rest of the post. It gave me an unexpected insight into the internal logic and motivations behind the frequentist approach to statistics. There is a genuine difference in the information conveyed by the two scenarios Jaynes describes.
To make the discussion more approachable, let's introduce names. Let's call the first researcher George. He stopped after treating 100 patients regardless of the results. The second researcher, whom we'll call Bessel, continued until the results supported his belief. This small act of naming helps us think about their mindsets and intentions. Similarly I'm going to call Frequentist the method that assigns to the experiments different p-values, and Bayesian the method that assigns them the same likelihood ratios.
Computing p-values is what Mr. Frequentist is all about. This method is centered on pre-defined procedures that allow us to interpret data in a principled and repeatable way: "What is the chance that using the same method would give this result under the null hypothesis?" Necessarily this approach says a lot about the method being used. While this mindset may seem flawed, it reflects an important goal: measure the a priori expectations of how often a method will yield false positives.
When the rules are followed, the frequentist framework offers a safeguard against self-deception and overinterpretation. Even if it doesn’t capture all the nuances of belief and uncertainty, it helps maintain epistemic discipline—especially when evaluating experimental methods in advance.
So, what difference does Mr. Frequentist see between the two experiments? In George's case we have no information except the final results. For Bessel on the other hand, once we understand the method that was used to determine the results, we know that at every intermediate step before the final result the cure rate was less than 70%.
George’s result is evaluated at face value: 70 out of 100 patients were cured.
To the frequentist, however, Bessel’s result isn't just one experiment—it looks like a sequence of 99 experiments that failed to produce the result he wanted, followed by one that did. Even though he stopped at 100 patients just like George, the manner in which he arrived at the data changes its meaning.
It feels wrong to treat the final dataset from Bessel as equivalent to George’s, because Bessel ignored the negative results of 99 experiments to get to one that was positive. If we assume an agnostic prior, George's method is less likely to be fooled than Bessel's approach. The higher p-value for Bessel is a way of punishing Bessel for bad experimental design.
Viewed this way, frequentism is a kind of procedural epistemology—a way of knowing that emerges not from belief, but from method. It’s a mindset that values long-term calibration, error control, and clarity about what an analysis can and cannot say.
George and Bessel may have produced identical datasets, but the paths they took to get there matter. With no view of a priori expectations besides the null hypothesis, frequentists are forced to fall back on analyzing methods rather than priors.
The frequentist approach, even if flawed in certain respects, still serves as a valuable heuristic. It teaches us to be wary of overfitting to outcomes, to ask about the process behind the numbers, and to maintain a healthy skepticism when interpreting results. Its insistence on method over outcome protects us from the temptation to rationalize or cherry-pick. I'd rather a scientist work with p-values than with their intuition alone.
The Bayesian perspective can still view Bessel's results as a series of 100 experiments. Because we believe in things like time and sample independence, however, the final total of the 100th experiment screens off the results of the first 99. There is no information in the first 99 results that is not contained in the final total. So Mr. Bayesian assigns them equivalent likelihood ratios. Or, if they believe that both applied the treatment the same way, assigns a joint likelihood ratio based on r = 140 and n = 200.
Bayesian likelihood ratios are a better measure of reporting final results than p-values. But frequentist approaches have been a useful tool, especially when comparing experimental methods beforehand. Before we resign p-values permanently to the dust bin we need to be able to replace its strength in evaluating experimental design.