Andrew Gelman recently responded to a commenter on the Yudkowsky/Gelman diavlog; the commenter complained that Bayesian statistics were *too subjective *and *lacked rigor*. I shall explain why this is unbelievably ironic, but first, the comment itself:

However, the fundamental belief of the Bayesian interpretation, that all probabilities are subjective, is problematic -- for its lack of rigor... One of the features of frequentist statistics is the ease of testability. Consider a binomial variable, like the flip of a fair coin. I can calculate that the probability of getting seven heads in ten flips is 11.71875%... At some point a departure from the predicted value may appear, and frequentist statistics give objective confidence intervals that can precisely quantify the degree to which the coin departs from fairness...

Gelman's first response is "Bayesian probabilities don't have to be subjective." Not sure I can back him on that; probability is ignorance and ignorance is a state of mind (although indeed, some Bayesian probabilities can correspond very directly to observable frequencies in repeatable experiments).

My own response is that frequentist statistics are *far* more subjective than Bayesian likelihood ratios. Exhibit One is the notion of "statistical significance" (which is what the above comment is actually talking about, although "confidence intervals" have almost the same problem). Steven Goodman offers a nicely illustrated example: Suppose we have at hand a coin, which may be fair (the "null hypothesis") or perhaps biased in some direction. So lo and behold, I flip the coin six times, and I get the result TTTTTH. Is this result statistically significant, and if so, what is the p-value - that is, the probability of obtaining a result at least this extreme?

Well, that depends. Was I planning to *flip the coin six times,* and *count the number of tails?* Or was I planning to *flip the coin until it came up heads,* and *count the number of trials?* In the first case, the probability of getting "five tails or more" from a fair coin is 11%, while in the second case, the probability of a fair coin requiring "at least five tails before seeing one heads" is 3%.

Whereas a Bayesian looks at the experimental result and says, "I can now calculate the likelihood ratio (evidential flow) between all hypotheses under consideration. Since your state of mind doesn't affect the coin in any way - doesn't change the probability of a fair coin or biased coin producing this exact data - there's no way your private, unobservable state of mind can affect my interpretation of your experimental results."

If you're used to Bayesian methods, it may seem difficult to even imagine that the statistical interpretation of the evidence ought to depend on a factor - namely the experimenter's state of mind - which has no causal connection whatsoever to the experimental result. (Since Bayes says that evidence is about correlation, and no systematic correlation can appear without causal connection; evidence requires entanglement.) How can frequentists manage even in *principle* to make the evidence depend on the experimenter's state of mind?

It's a complicated story. Roughly, the trick is to make yourself artificially ignorant of the data - instead of knowing the *exact* experimental result, you pick a class of possible results which *includes* the actual experimental result, and then pretend that you were told *only* that the result was somewhere in this class. So if the actual result is TTTTTH, for example, you can pretend that this is part of the class {TTTTTH, TTTTTTH, TTTTTTTH, ...}, a class whose total probability is 3% (1/32). Or if I preferred to have this experimental result *not* be statistically significant with p < 0.05, I could just as well pretend that some helpful fellow told me only that the result was in the class {TTTTTH, HHHHHT, TTTTTTH, HHHHHHT, ...}, so that the total probability of the class would be 6%, n.s. (In frequentism this question is known as applying a "two-tailed test" or "one-tailed test".)

The arch-Bayesian E. T. Jaynes ruled out this sort of reasoning by telling us that a Bayesian ought only to condition on events that actually happened, not events that could have happened but didn't. (This is not to be confused with the dog that doesn't bark. In this case, the dog was *in fact* silent; the silence of the dog happened in the real world, not somewhere else. We are rather being told that a Bayesian should not have to worry about alternative possible worlds in which the dog *did* bark, while estimating the evidence to take from the *real* world in which the dog did *not* bark. A Bayesian only worries about the experimental result that was, in fact, obtained; not other experimental results which could have been obtained, but weren't.)

The process of throwing away the actual experimental result, and substituting a class of possible results which contains the actual one - that is, deliberately losing some of your information - introduces a dose of *real* subjectivity. Colin Begg reports on one medical trial where the data was variously analyzed as having a significance level - that is, probability of the "experimental procedure" producing an "equally extreme result" if the null hypothesis were true - of p=0.051, p=0.001, p=0.083, p=0.28, and p=0.62. Thanks, but I think I'll stick with the conditional probability of the *actual* experiment producing the *actual* data.

Frequentists are apparently afraid of the possibility that "subjectivity" - that thing they were accusing Bayesians of - could allow some unspecified terrifying abuse of the scientific process. Do I need to point out the general implications of being allowed to throw away your actual experimental results and substitute a class you made up? In general, if this sort of thing is allowed, I can flip a coin, get 37 heads and 63 tails, and decide that it's part of a *class *which includes all mixtures with at least 75 heads *plus* this exact particular sequence. As if I only had the output of a fixed computer program which was written in advance to look at the coinflips and compute a yes-or-no answer as to whether the data is in that class.

Meanwhile, Bayesians are accused of being "too subjective" because we might - gasp! - assign the wrong prior probability to something. First of all, it's obvious from a Bayesian perspective that science papers should be in the business of reporting likelihood ratios, not posterior probabilities - likelihoods add up across experiments, so to get the latest posterior you wouldn't *just* need a "subjective" prior, you'd also need all the *cumulative* evidence from other science papers. Now, this accumulation might be a *lot* more straightforward for a Bayesian than a frequentist, but it's not the sort of thing a typical science paper should have to do. Science papers should report the likelihood ratios for any popular hypotheses - but above all, make the actual raw data available, so the likelihoods can be computed for *any* hypothesis. (In modern times there is absolutely no excuse for not publishing the raw data, but that's another story.)

And Bayesian likelihoods *really are *objective - so long as you use the actual exact experimental data, rather than substituting something else.

Meanwhile, over in frequentist-land... what if you told everyone that you had done 127 trials because that was how much data you could afford to collect, but* really* you kept performing more trials until you got a p-value that you liked, and then stopped? Unless I've got a bug in my test program, a limit of up to 500 trials of a "fair coin" would, 30% of the time, arrive on some step where you could stop and reject the null hypothesis with p<0.05. Or 9% of the time with p<0.01. Of course this requires some degree of scientific dishonesty... or, perhaps, some minor confusion on the scientist's part... since if this is what you are *thinking,* you're supposed to use a *different* test of "statistical significance". But it's not like we can actually look inside their heads to find out what the experimenters were thinking. If we're worried about scientific dishonesty, surely we should worry about *that?* (A similar test program done the Bayesian way, set to stop as soon as finding likelihood ratios of 20/1 and 100/1 relative to an alternative hypothesis that the coin was 55% biased, produced false positives of 3.2% and 0.3% respectively. Unless there was a bug; I didn't spend that much time writing it.)

The *actual* subjectivity of standard frequentist methods, the ability to manipulate "statistical significance" by choosing different tests, is not a minor problem in science. There are ongoing scandals in medicine and neuroscience from lots of "statistically significant" results failing to replicate. I would point a finger, not just at publication bias, but at scientists armed with powerful statistics packages with lots of complicated tests to run on their data. Complication is really dangerous in science - unfortunately, it looks like instead we have the social rule that throwing around big fancy statistical equations is highly prestigious. (I suspect that some of the opposition to Bayesianism comes from the fact that Bayesianism is too simple.) The obvious fix is to (a) require raw data to be published; (b) require journals to *accept papers *before the *experiment* is performed, with the advance paper including a specification of what statistics were selected in advance to be run on the results; (c) raising the standard "significance" level to p<0.0001; and (d) junking all the damned overcomplicated status-seeking impressive nonsense of classical statistics and going to simple understandable Bayesian likelihoods.

Oh, and this frequentist business of "confidence intervals"? Just as subjective as "statistical significance". Let's say I've got a measuring device which returns the true value plus Gaussian noise. If I know you're about to collect 100 results, I can write a computer program such that, before the experiment is run, it's 90% probable that the true value will lie within the interval output by the program.

So I write one program, my friend writes another program, and my enemy writes a third program, all of which make this same guarantee. And in all three cases, the guarantee is true - the program's interval will indeed contain the true value at least 90% of the time, if the experiment returns the true value plus Gaussian noise.

So you run the experiment and feed in the data; and the "confidence intervals" returned are [0.9-1.5], [2.0-2.2], and ["Cheesecake"-"Cheddar"].

The problem may be made clearer by considering the third program, which works as follows: 95% of the time, it does standard frequentist statistics to return an interval which will contain the true value 95% of the time, and on the other 5% of the time, it returns the interval ["Cheesecake"-"Cheddar"]. It is left as an exercise to the reader to show that this program will output an interval containing the true value at least 90% of the time.

BTW, I'm pretty sure I recall reading that "90% confidence intervals" as published in journal papers, in those cases where a true value was later pinned down more precisely, did *not* contain the true value 90% of the time. So what's the point, even? Just show us the raw data and maybe give us a summary of some likelihoods.

Parapsychology, the control group for science, would seem to be a thriving field with "statistically significant" results aplenty. Oh, sure, the effect sizes are minor. Sure, the effect sizes get even smaller (though still "statistically significant") as they collect more data. Sure, if you find that people can telekinetically influence the future, a similar experimental protocol is likely to produce equally good results for telekinetically influencing the past. Of which I am less tempted to say, "How amazing! The power of the mind is not bound by time or causality!" and more inclined to say, "Bad statistics are time-symmetrical." But here's the thing: Parapsychologists are constantly protesting that they are playing by all the standard scientific rules, and yet their results are being ignored - that they are unfairly being held to higher standards than everyone else. I'm willing to believe that. It just means that the *standard* statistical methods of science are so weak and flawed as to permit a field of study to sustain itself in the complete absence of any subject matter. With two-thirds of medical studies in prestigious journals failing to replicate, getting rid of the entire actual subject matter would shrink the field by only 33%. We have to raise the bar high enough to exclude the results claimed by parapsychology under classical frequentist statistics, and then fairly and evenhandedly apply the *same* bar to the rest of science.

Michael Vassar has a theory that when an academic field encounters advanced statistical methods, it becomes really productive for ten years and then bogs down because the practitioners have learned how to game the rules.

For so long as we do not have infinite computing power, there may yet be a place in science for non-Bayesian statistics. The Netflix Prize was not won by using strictly purely Bayesian methods, updating proper priors to proper posteriors. In that acid test of statistical discernment, what worked *best* was a gigantic ad-hoc mixture of methods. It may be that if you want to get the most mileage out of your data, in this world where we do not have infinite computing power, you'll have to use some ad-hoc tools from the statistical toolbox - tools that throw away some of the data, that make themselves artificially ignorant, that take all sorts of steps that can't be justified in the general case and that are potentially subject to abuse and that will give wrong answers now and then.

But don't do that, and *then *turn around and tell me that - of all things! - Bayesian probability theory is *too subjective. *Probability theory is the *math *in which the results are *theorems *and every theorem is compatible with every other theorem and you never get different answers by calculating the same quantity in different ways. To resort to the ad-hoc variable-infested complications of frequentism while preaching your *objectivity? * I can only compare this with the politicians who go around preaching "Family values!" and then get caught soliciting sex in restrooms. So long as you deliver loud sermons and make a big fuss about painting yourself with the right labels, you get identified with that flag - no one bothers to look very hard at what you *do.* The case of frequentists calling Bayesians "too subjective" is worth dwelling on for that aspect alone - emphasizing how important it is to look at what's actually going on instead of just listening to the slogans, and how rare it is for anyone to even glance in that direction.

-- Black Belt Bayesian

I really like the idea of parapsychology as the control group for science; it deserves to be better known.

http://www.newscientist.com/article/mg20427375.700 'nuff said

The demon you're describing can't be exorcised simply by switching from frequentism to Bayesianism - it torments the Bayesians as well. It is more an issue of intellectual honesty than of statistical paradi... (read more)

What is the source for the claim that two thirds of scientific studies that are published in prestigious medical journals fail to replicate?

If you want medical researchers to publish all data they use in their experiments, you have to say good bye to privacy. If you have enough data about the individual patients who entered a study on AIDS drugs you will be able to data mine whether your friend Joe has been one of the patients in the study.

Are we willing to live in a world where all medical data about everyone is in the public domain? The trade offs are enormous but there isn't much broad discussion about the question in the mainstream.

Would you mind publishing your test code?

And they transpose the conditional! If a sample is likely given the hypothesis, it does not necessarily follow that the the hypothesis is likely given the sample. This always struck me as the most egregious failure of naive significance testing.

Although this post is old, I really dislike the description of confidence levels. An appropriate confidence level is made out of the points with the "best" likelihoods: Every point in the point has a higher likelihood than every point outside. Therefore, an output of '["Cheesecake"-"Cheddar"]' is a clear sign of bad science. A good confidence level is just a summary of likelihood ratios. And for symmetric likelihood profiles (at least near the minimum), it gives a very good idea of the function in the interesting range.

And mos... (read more)

One aspect of the frequentist approach that I think should be mentioned is its compression of information, in regards its results.

This is prenicious for specialists, but for non-specialists it's a boon. Rather than carting around precise numerical data for every proposed theory (numerical data that we can never remember, as the uncertainty over the 2/3 non-replication figure shows - it gets even worse if we have to remember whole distributions), you simply need to remember a binary result: significant/not significant.

(Things would be even simpler if we got... (read more)

One plausible counter-argument / limitation, is that researchers need incentives to collect data. Overall I agree with you. Here is sociologist Jeremy Freese on this issue: http://www.jeremyfreese.com/docs/freese-reproducibility-webdraft.pdf and http://www.jeremyfreese.com/docs/Freese%20-%20OpenSourceSocialScience%20-%20062807.pdf

See also "How to lie with statistics" , an oldie but goodie

http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728

Broken link there.

http://yudkowsky.net/rational/bayes - link error

Regarding frequentists' concerns about subjectivity in the Bayesian interpretation:

...and randomness instead of "true randomness". Mind the mind.

"In modern times there is absolutely no excuse for not publishing the raw data, but that's another story."

Nope. If, like many studies, you have data on real live humans there are perfectly-sane ethical and legal considerations which make publishing raw data a non-starter. Even publishing summaries is a problem; see the way the Wellcome Trust Case Control Studies recently hauled aggregate results of their website when it became clear individual diseases status could be ascertained from them.

Do you think that that's worse than the way things are done currently?