Recently, a colleague was reviewing an article whose key justification rested on some statistics that seemed dodgy to him, so he came to me for advice. (I guess my boss, the resident statistician, was out of his office.) Now, I'm no expert in frequentist statistics. My formal schooling in frequentist statistics comes from my undergraduate chemical engineering curriculum -- I wouldn't rely on it for consulting. But I've been working for someone who is essentially a frequentist for a year and a half, so I've had some hands-on experience. My boss hired me on the strength of my experience with Bayesian statistics, which I taught myself in grad school, and one thing reading the Bayesian literature voraciously will equip you for is critiquing frequentist statistics. So I felt competent enough to take a look.^{1}

The article compared an old, trusted experimental method with the authors' new method; the authors sought to show that the new method gave the same results on average as the trusted method. They performed three replicates using the trusted method and three replicates using the new method; each replicate generated a real-valued data point. They did this in nine different conditions, and for each condition, they did a statistical hypothesis test. (I'm going to lean heavily on Wikipedia for explanations of the jargon terms I'm using, so this post is actually a lot longer than it appears on the page. If you don't feel like following along, the punch line is three paragraphs down, last sentence.)

The authors used what's called a Mann-Whitney U test, which, in simplified terms, aims to determine if two sets of data come from different distributions. The essential thing to know about this test is that it doesn't depend on the actual data except insofar as those data determine the ranks of the data points when the two data sets are combined. That is, it throws away most of the data, in the sense that data sets that generate the same ranking are equivalent under the test. The rationale for doing this is that it makes the test "non-parametric" -- you don't need to assume a particular form for the probability density when all you look at are the ranks.

The output of a statistical hypothesis test is a p-value; one pre-establishes a threshold for statistical significance, and if the the p-value is lower than the threshold, one draws a certain conclusion called "rejecting the null hypothesis". In the present case, the null hypothesis is that the old method and the new method produce data from the same distribution; the authors would like to see data that do not lead to rejection of the null hypothesis. They established the conventional threshold of 0.05, and for each of the nine conditions, they reported either "p > 0.05" or "p = 0.05"^{2}. Thus they did not reject the null hypothesis, and argued that the analysis supported their thesis.

Now even from a frequentist perspective, this is wacky. Hypothesis testing can reject a null hypothesis, but cannot confirm it, as discussed in the first paragraph of the Wikipedia article on null hypotheses. But this is not the real WTF, as they say. There are twenty ways to choose three objects out of six, so there are only twenty possible p-values, and these can be computed even when the original data are not available, since they only depend on ranks. I put these facts together within a day of being presented with the analysis and quickly computed all twenty p-values. Here I only need discuss the most extreme case, where all three of the data points for the new method are to one side (either higher or lower) of the three data points for the trusted method. This case provides the most evidence against the notion that the two methods produce data from the same distribution, resulting in the smallest possible p-value^{3}: p = 0.05. In other words, * even before the data were collected it could have been known that this analysis would give the result the authors wanted*.

^{4}

When I canvassed the Open Thread for interest in this article, Douglas Knight wrote: "If it's really frequentism that caused the problem, please spell this out." Frequentism *per se* is not the proximate cause of this problem, that being that the authors either never noticed that their analysis could not falsify their hypothesis, or they tried to pull a fast one. But it is a distal cause, in the sense that it forbids the Bayesian approach, and thus requires practitioners to become familiar with a grab-bag of unrelated methods for statistical inference^{5}, leaving plenty of room for confusion and malfeasance. Technologos's reply to Douglas Knight got it exactly right; I almost jokingly requested a spoiler warning.

^{1} I don't mind that it wouldn't be too hard to figure out who I am based on this paragraph. I just use a pseudonym to keep Google from indexing all my blog comments to my actual name.

^{2} It's rather odd to report a p-value that is exactly equal to the significance threshold, one of many suspicious things about this analysis (the rest of which I've left out as they are not directly germane).

3 For those anxious to check my math, I've omitted some blah blah blah about one- and two-sides tests and alternative hypotheses.

^{4} I quickly emailed the reviewer; it didn't make much difference, because when we initially talked about the analysis we had noticed enough other flaws that he had decided to recommend rejection. This was just the icing on the coffin.

^{5} ... none of which *actually address the question*

**! ... phew. Sorry.**

*OF DIRECT INTEREST*
This is going to sound silly, but...could someone explain frequentist statistics to me?

Here's my current understanding of how it works:

We've got some hypothesis H, whose truth or falsity we'd like to determine. So we go out and gather some evidence E. But now, instead of trying to quantify our degree of belief in H (given E) as a conditional probability estimate using Bayes' Theorem (which would require us to know P(H), P(E|H), and P(E|~H)), what we do is simply calculate P(E|~H) (techniques for doing this being of course the principal concern of statistics texts), and then place H into one of two bins depending on whether P(E|~H) is below some threshold number ("p-value") that somebody decided was "low": if P(E|~H) is below that number, we put H into the "accepted" bin (or, as they say, we reject the null hypothesis ~H); otherwise, we put H into the "not accepted" bin (that is, we fail to reject ~H).

Now,

ifthat is a fair summary, then this big controversy between frequentists and Bayesians must mean that there is a sizable collection of people who think that the above procedure is abetter way of obtaining knowledgethan performing Bayesian... (read more)Non-Bayesianism for Bayesians (based on a poor understanding of Andrew Gelman and Cosma Shalizi)

Lakatos (and Kuhn) are philosophers of science who studied science as scientists actually do it, as opposed to how scientists (at the time) claimed scientists do it. This is in contrast to taking the "scientific method" that we learned in grade school literally. Theories are not rejected at the first evidence that they have failed, they are patched, and so on.

Gelman and Shalizi's criticism of Bayesian rhetoric (as far as I can make out from their blog posts and the slides of Gelman's talk) is (explicitly) similar - what Bayesians do is different than what Bayesians say Bayesians do.

In particular, humans (as opposed to ideal, which is to say nonexistent, Bayesians) do not SIMPLY update on the evidence. There are other important steps in the process, such as checking whether, given the new data, your original model still looks reasonable. (This is "posterior predictive model checking"). This step looks a lot like computing a p-value, though Gelman recommends a graphical presentation, rather than condensing to a single number. In general, the notion of doing research o... (read more)

I like your point but not your example.

That isn't just a computational limitation. It's an outright bug. Something that assigns 0 to Z is just not even an approximation of a Bayesian. A sane agent with limited resources may, for example, assign a probability to "A,B,C and 'something else'". If it explicitly assigned an (arbitrarily close to) 0 to Z then it just fails at life.

I've always thought it would be nice to have a "Frequentist-to-Bayesian" guide. Sort of a "Here's some example problems, here's how you might go about it doing frequentist methods, here's how you might go about it using Bayesian techniques." My introduction to statistics began with an AP course in high school (and I used this HyperStat source to help out), and of course they teach hypothesis testing and barely give a nod to Bayes' Theorem.

What you've described is the "statistical hypothesis testing" technique, and yes, you've got it right. The only reason it functions at all is that by and large, people who use it aren't stupid, and they know that they have to submit it to peer review to other people who aren't stupid. Nevertheless, a lot of crap gets through, just because the approach is so wrong-headed. ETA: Oops! I left an important detail out of this response.

There are other techniques for frequentist statistics, e.g., unbiased estimators, minimum mean squared error estimators, method of moments, robust estimators, confidence intervals, confidence distributions, maximum likelihood, profile likelihood, empirical likelihood, empirical Bayes, estimating equations, PAC learning, etc., etc., ad nauseum.

Without

acknowledginga prior.Some frequentist techniques are strictly incoherent from a Bayesian point of view. In that case there is no prior.

This seems to use "frequentist" to mean "as statistics are actually practiced." It is unreasonable to compare the implementation of A to the ideal form of B. In particular, the problem of the Mann-Whitney test seem to me that the authors looked up a recipe in a cookbook without understanding it, which they could have done just as easily in a bayesian cookbook.

I'm not seeing why what you call "the real WTF" is evidence of a problem with frequentist statistics. The fact that the hypothesis test would have given a statistically insignificant

p-value whatever the actual 6 data points were just indicates that whatever the population distributions, 6 data points are simply not enough to disconfirm the null hypothesis. In fact you can see this if you look at Mann & Whitney's original paper! (See then=3 subtable in table I, p. 52.)I can picture someone counterarguing that this is not immediately obvious from the details of the statistical test, but I would hope that any competent statistician, frequentist or not, would be sceptical of a nonparametric comparison of means for samples of size 3!

I agree that frequentist statistics are often poorly taught and understood, and that this holds however you like to do your statistics. Still, the main post feels to me like a sales pitch for Bayes brand chainsaws that's trying to scare me off Neyman-Pearson chainsaws by pointing out how often people using Neyman-Pearson chainsaws accidentally cut off a limb with them. (I am aware that I may be the only reader who feels this way about the post.)

Yes, but it is not sufficient evidence to reject the null hypothesis of incompetence at the 0.05 significance level. (I keed, I keed.)

Now that I understand the situation better, I thought I'd share this interesting coincidence.

Last Friday, I was talking my friend's brother, a PhD candidate in electrical engineering (by all I could tell, a very intelligent man), who had to teach a statistics class to grad students in education. He told what it was like to teach one of their 8-hour lessons, where he had to explain many of the things in the frequentist toolbox.

I told him about my interest in information theory and Bayesian statistics and asked if the course covers any of that. While he sh... (read more)

How difficult would it be to rewrite the Mann-Whitney U test to give a Bayesian likelihood ratio?

Nice analysis. But:

If this is important enough to use as the summary for your post, it's important enough for you to either quote it, or link to it, and thus save us, cumulatively, hours of time searching for it.

How does that follow? The authors didn't know what data they would get, right? So they didn't know the ranking that would show up. They would know in advance that result X would pass the test, but they wouldn't know that they would get result X.