A Look Inside a Frequentist

Computing p-values is what Mr. Frequentist is all about.

For once I'd like to the bayesian/frequentist debate see the return of maximum likelihood vs maximum a-posteriori. P-values absolutely are not the only aspect of frequentist statistics! Yes they are one of the prominent so certainly fair game, but the way people talk about them its like they are all that matter. People have general problems with p-values beyond them being frequentist. To me the fact that they feature so prominently raises the question of how much certain commitments to "bayesianism" reflect actual usage of bayesian methods vs a kind of pop-science version of bayesianism.

Bayesian likelihood ratios

Is this meant to refer to a specific likelihood ratio method, or to suggest that likelihood ratios themselves are "bayesian"? Yes, "the likelihood principle" is a big source of criticism of p-values, but I don't see why likelihood ratios themselves are bayesian? I think Andrew Gelman once said something to the effect of both bayesian and frequentist methods need a likelihood, and often times that makes more of a difference than the prior. There's nothing strictly bayesian about "updating". I'm curious how often things that are identified as "bayesian" actual use Bayes' rule.

The frequentist approach, even if flawed in certain respects, still serves as a valuable heuristic. It teaches us to be wary of overfitting to outcomes, to ask about the process behind the numbers, and to maintain a healthy skepticism when interpreting results. Its insistence on method over outcome protects us from the temptation to rationalize or cherry-pick. I'd rather a scientist work with p-values than with their intuition alone.

I think I largely agree with the spirit here. I definitely think p-values have issues and in particular they way they have arguably contributed to publication bias is a highly reasonable criticism. That said, I think people like to make these "methodological" critiques more for philosophical than statistical reasons. In practice, we definitely should expect the application of all methods is going to have issues and be highly imperfect. So I agree that it makes sense to have a practical, "all methods are flawed, some are useful" view of things.

[-]Eggs6mo10

To me the fact that they feature so prominently raises the question of how much certain commitments to "bayesianism" reflect actual usage of bayesian methods vs a kind of pop-science version of bayesianism.

This is a valid concern. I'm new here and just going through the sequences (though I have a mathematics background), but I have yet to see a good framing of bayesian/frequentist debate as maximum likelihood vs maximum a-posteriori. (I welcome referrals)

I think people like to make these "methodological" critiques

Yes, there is a methodological critique to strict p-value calculations, but in the absence of informative priors p values are a really good indicator for experiment design. I feel that in hyping up Bayesian updates people are missing that and not offering a replacement. The focus on methods is a strength when you are talking about methods.

[-]TFD6mo10

I'm new here and just going through the sequences (though I have a mathematics background), but I have yet to see a good framing of bayesian/frequentist debate as maximum likelihood vs maximum a-posteriori. (I welcome referrals)

I'm definitely not representative of lesswrong in my my views above I don't think. In fact in some sense I think I'm shadowboxing with lesswrong in some of my comments above, so sorry about any confusion that introduced.

I don't think I've ever seen maximum likelihood vs maximum a-posteriori discussed on lesswrong, and I'm kind of just griping about it! I don't have a references off to top of my head but I recall this appearing in debates elsewhere (i.e. not on lesswrong) like in more academic/stats settings. I can see if I can find examples. But in general it addresses an estimation perspective instead of hypothesis testing.

Yes, there is a methodological critique to strict p-value calculations, but in the absence of informative priors p values are a really good indicator for experiment design. I feel that in hyping up Bayesian updates people are missing that and not offering a replacement. The focus on methods is a strength when you are talking about methods.

I think I'm in agreement with you here. My "methodological" was directed at what I view as a somewhat more typical lesswrong perspective, similar to what is expressed in the Eliezer quote. Sure, if we take some simple case we can address a more philosophical question about frequentism vs bayesianism, but in practical situations there are going to so many analytical choices that you could make that there are always going to be issues. In an actual analysis you can always do stuff like look at multiple versions of an analysis and trying to use that to refine your understanding of a phenomenon. If you fix the likelihood but allow the data to vary then p-values are likely to be highly correlated with possible alternatives like bayes factors, a lot of the critiques I feel are focused on making a clean philosophical approach while ignoring the inherent messiness that would be introduced if you ever want to infer things from reasonably complicated data or observations. I don't think swapping likelihood ratios for p-values would sudden change things all that much, a lot of the core difficulties of inferring things from data would remain.

[-]Richard_Kennaway6mo*42

So, what difference does Mr. Frequentist see between the two experiments? In George's case we have no information except the final results. For Bessel on the other hand, once we understand the method that was used to determine the results, we know that at every intermediate step before the final result the cure rate was less than 70%.

I don't think that makes a difference. Supposing the scenario to assume that different patients have independent responses to treatment, the Bayesian's sequence of outcomes and the frequentist's are different permutations of 70 heads and 30 tails, of equal probability. Whatever they say about the efficacy of the treatment, they say the same thing.

It might be unlikely for the frequentist's sequence of outcomes to only hit his target p-value at the 100th patient, but this probability does not bear on the efficacy of the treatment, and the frequentist is ignoring it anyway.

In practice I would be more concerned that the motivation to get a desired result might corrupt the patient evaluations (assumed away in Jaynes's thought experiment).

[-]Eggs6mo30

Correct, in reality the world doesn't change if we reorder our results. The point is that for a frequentist it feels like it should. Because the method is flawed, it seems right for the result to be less right. This is a bad way of analyzing results, but not as bad a way to evaluate methodologies.

Your valid concern about corrupted results stems from the correlation between bad behavior and what a frequentist calls a bad methodology.

Bessel's methodology is not inherently bad either. If Bessel believed that the treatment would save lives and needed to keep going to prove it, wouldn't he behave the same way?

We need a Bayesian methodology that can help evaluate methodology with and without informative priors. This probably already exists in literature, but we won't be able to overcome the use of p-values until it is common knowledge.

[-]Roman Malov6mo30

we know that at every intermediate step before the final result the cure rate was less than 70%.

Actually, why? Bessel was worrying whether the rate of cures is greater than 60%, and after the 99th experiment there were either 69 cures out of 99 patients (~69.7%) or 70 cures out of 99 patients (~70.7%), so he could've stopped there and that would not hurt his reputation.

IIRC, when we discussed this essay in the reading group, one member said that Eliezer here did a bad job describing the thought experiment, and that actually both experimenters precommitted to treat at least 100 patients.

[-]Eggs6mo10

and that actually both experimenters precommitted to treat at least 100 patients.

That would be an interesting wrinkle. I haven't read the original source. Supposing this I would think Mr. Frequentist would still say Bessel is still more likely to be fooled by unlikely data than George (in the positive only direction) but honestly only by a very small amount. One could call that the trade-off for a method that won't be fooled by unlikely negative data.

Actually, why?

I was treating the description of Bessel and having a distinct stopping condition of 70%, otherwise he would have stopped at 69.7% like you said. If he was doing the tests one at a time we know 70.7% at 99 didn't occur because he stopped at 100.

[-]Richard_Kennaway6mo30

I haven't read the original source.

It can be found (for example) here, pp 12-13. A higher quality but paywalled PDF is here.

The point Jaynes is making is that it does not matter what the stopping rule is. Once the data are obtained, the experimenter's state of mind is irrelevant to what the data imply about the phenomenon under study.

"Bayesian inference will not get us into this absurd situation, because it perceives automatically what common sense demands; that what is relevant for this inference is not the relative probabilities of imaginary data sets which were not observed, but the relative likelihoods of different parameter values, based on the one real data set which was observed; and this is the same for all the experimenters."

[-]cubefox6mo20

Isn't Bessel engaging in a kind of publication bias? After all, if he never stops getting the desired effect size out of a study, it never concludes, so presumably you won't hear from it.

You may have two different treatments A and B, and both have comparable effect sizes according to the literature, but you learn that all the published studies involving B were performed by Bessel who you know engages in publication bias. The published studies for A were conducted by George who, as is widely known, doesn't have this bias. So presumably, if you hear a study was conducted by Bessel, you should correct the reported effect size downwards when estimating the real (underlying) effect size. If you hear a study was conducted by George, you can assume no such publication bias exists, so you shouldn't correct the reported effect size downwards.

So, if A and B have the same overall reported effect size, you should assume that the effect size of B is lower than of A.

Now assume, unbeknownst to you, Bessel didn't actually have to withhold any studies, as the effect sizes all happened be above the desired range. Should you still correct the reported effect size downward? Answer: Yes of course, since you don't know that this is the case. The only thing you "know" is the published effect sizes and the fact that Bessel (the person) engages in biased reporting, which is evidence that the reported effect sizes overestimate the real effect size.

This is similar to how your subjective probability that you have won the lottery is very low before you have checked the results, even if, as luck would have it, you did indeed win the lottery.

[-]Eggs6mo10

Interesting. Presumably if Bessel never got the results he wanted, he could (assuming he's honest) continue until the negative data was enough to convince himself that he was wrong. Depending on his prior that might not happen, he could run out of money or motivation before he gave up and published a negative result. Avoiding this seems related to issues about publishing negative results and timely reporting of raw data.

With regards to the biased reporting, I'll just mention that we would have to adjust for known bias wether we were using Bayesian or frequentist methods.

LESSWRONG
LW

LESSWRONG
LW

5

A Look Inside a Frequentist

5

5

The Frequentist Approach

What It Feels Like to Be a Frequentist

Mr. Bayesian on Mr. Frequentist