Recently, a colleague was reviewing an article whose key justification rested on some statistics that seemed dodgy to him, so he came to me for advice. (I guess my boss, the resident statistician, was out of his office.) Now, I'm no expert in frequentist statistics. My formal schooling in frequentist statistics comes from my undergraduate chemical engineering curriculum -- I wouldn't rely on it for consulting. But I've been working for someone who is essentially a frequentist for a year and a half, so I've had some hands-on experience. My boss hired me on the strength of my experience with Bayesian statistics, which I taught myself in grad school, and one thing reading the Bayesian literature voraciously will equip you for is critiquing frequentist statistics. So I felt competent enough to take a look.1

The article compared an old, trusted experimental method with the authors' new method; the authors sought to show that the new method gave the same results on average as the trusted method. They performed three replicates using the trusted method and three replicates using the new method; each replicate generated a real-valued data point. They did this in nine different conditions, and for each condition, they did a statistical hypothesis test. (I'm going to lean heavily on Wikipedia for explanations of the jargon terms I'm using, so this post is actually a lot longer than it appears on the page. If you don't feel like following along, the punch line is three paragraphs down, last sentence.) 

The authors used what's called a Mann-Whitney U test, which, in simplified terms, aims to determine if two sets of data come from different distributions. The essential thing to know about this test is that it doesn't depend on the actual data except insofar as those data determine the ranks of the data points when the two data sets are combined. That is, it throws away most of the data, in the sense that data sets that generate the same ranking are equivalent under the test. The rationale for doing this is that it makes the test "non-parametric" -- you don't need to assume a particular form for the probability density when all you look at are the ranks.

The output of a statistical hypothesis test is a p-value; one pre-establishes a threshold for statistical significance, and if the the p-value is lower than the threshold, one draws a certain conclusion called "rejecting the null hypothesis". In the present case, the null hypothesis is that the old method and the new method produce data from the same distribution; the authors would like to see data that do not lead to rejection of the null hypothesis. They established the conventional threshold of 0.05, and for each of the nine conditions, they reported either "p > 0.05" or "p = 0.05"2. Thus they did not reject the null hypothesis, and argued that the analysis supported their thesis.

Now even from a frequentist perspective, this is wacky. Hypothesis testing can reject a null hypothesis, but cannot confirm it, as discussed in the first paragraph of the Wikipedia article on null hypotheses. But this is not the real WTF, as they say. There are twenty ways to choose three objects out of six, so there are only twenty possible p-values, and these can be computed even when the original data are not available, since they only depend on ranks. I put these facts together within a day of being presented with the analysis and quickly computed all twenty p-values. Here I only need discuss the most extreme case, where all three of the data points for the new method are to one side (either higher or lower) of the three data points for the trusted method. This case provides the most evidence against the notion that the two methods produce data from the same distribution, resulting in the smallest possible p-value3: p = 0.05. In other words, even before the data were collected it could have been known that this analysis would give the result the authors wanted.4

When I canvassed the Open Thread for interest in this article, Douglas Knight wrote: "If it's really frequentism that caused the problem, please spell this out." Frequentism per se is not the proximate cause of this problem, that being that the authors either never noticed that their analysis could not falsify their hypothesis, or they tried to pull a fast one. But it is a distal cause, in the sense that it forbids the Bayesian approach, and thus requires practitioners to become familiar with a grab-bag of unrelated methods for statistical inference5, leaving plenty of room for confusion and malfeasance. Technologos's reply to Douglas Knight got it exactly right; I almost jokingly requested a spoiler warning.


1 I don't mind that it wouldn't be too hard to figure out who I am based on this paragraph. I just use a pseudonym to keep Google from indexing all my blog comments to my actual name.

2 It's rather odd to report a p-value that is exactly equal to the significance threshold, one of many suspicious things about this analysis (the rest of which I've left out as they are not directly germane).

3 For those anxious to check my math, I've omitted some blah blah blah about one- and two-sides tests and alternative hypotheses.

4 I quickly emailed the reviewer; it didn't make much difference, because when we initially talked about the analysis we had noticed enough other flaws that he had decided to recommend rejection. This was just the icing on the coffin.

5 ... none of which actually address the question OF DIRECT INTEREST! ... phew. Sorry.


New Comment
100 comments, sorted by Click to highlight new comments since: Today at 9:00 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

This is going to sound silly, but...could someone explain frequentist statistics to me?

Here's my current understanding of how it works:

We've got some hypothesis H, whose truth or falsity we'd like to determine. So we go out and gather some evidence E. But now, instead of trying to quantify our degree of belief in H (given E) as a conditional probability estimate using Bayes' Theorem (which would require us to know P(H), P(E|H), and P(E|~H)), what we do is simply calculate P(E|~H) (techniques for doing this being of course the principal concern of statistics texts), and then place H into one of two bins depending on whether P(E|~H) is below some threshold number ("p-value") that somebody decided was "low": if P(E|~H) is below that number, we put H into the "accepted" bin (or, as they say, we reject the null hypothesis ~H); otherwise, we put H into the "not accepted" bin (that is, we fail to reject ~H).

Now, if that is a fair summary, then this big controversy between frequentists and Bayesians must mean that there is a sizable collection of people who think that the above procedure is a better way of obtaining knowledge than performing Bayesian... (read more)

Non-Bayesianism for Bayesians (based on a poor understanding of Andrew Gelman and Cosma Shalizi)

Lakatos (and Kuhn) are philosophers of science who studied science as scientists actually do it, as opposed to how scientists (at the time) claimed scientists do it. This is in contrast to taking the "scientific method" that we learned in grade school literally. Theories are not rejected at the first evidence that they have failed, they are patched, and so on.

Gelman and Shalizi's criticism of Bayesian rhetoric (as far as I can make out from their blog posts and the slides of Gelman's talk) is (explicitly) similar - what Bayesians do is different than what Bayesians say Bayesians do.

In particular, humans (as opposed to ideal, which is to say nonexistent, Bayesians) do not SIMPLY update on the evidence. There are other important steps in the process, such as checking whether, given the new data, your original model still looks reasonable. (This is "posterior predictive model checking"). This step looks a lot like computing a p-value, though Gelman recommends a graphical presentation, rather than condensing to a single number. In general, the notion of doing research o... (read more)

I like your point but not your example.

Suppose (because you're a computationally-limited Bayesian) that you only include in your model the N highest-probability hypotheses. That is, you include A, B, C, in your model, but you neglect Z - that is, you put zero probability on it. (We can make Z's pre-evidence probability arbitrarily small, to make this seem reasonable at the time.) When one, or even N balls turn out to be labeled Z, the model (due to the initial zero probability on Z) continues insisting that the balls came from one of the initially-specified hypotheses.

That isn't just a computational limitation. It's an outright bug. Something that assigns 0 to Z is just not even an approximation of a Bayesian. A sane agent with limited resources may, for example, assign a probability to "A,B,C and 'something else'". If it explicitly assigned an (arbitrarily close to) 0 to Z then it just fails at life.

Hi. I found the paper containing the example in question - it's Bayesians sometimes cannot ignore even very implausible theories []. I don't understand everything in the paper, but it seems like they've anticipated your objection and have another example which explicitly includes a "Something else" case.
Forgive my confusion, I'm a bad statistician, of any sort. How do you include 'something else' in your model? Don't you need to at least (for monte carlo techniques) be able to generate "forward" from parameters to simulated data? Or do you include Gelman's posterior predictive check in the model somehow, so that data that is sufficiently surprising causes a "misspecification alarm" to go off?
I'm not sure how the best way to handle simplifying a model without doing insane things. I do know that if what you are doing amounts to overtly "putting zero probability on it" then what you are doing is a terminal mistake that makes the process distinctly non-bayesian. I get the impression that the mistakes that bayesians are trying to correct with their after the fact testing of the model are different ones to this one. If common 'bayesian statisticians' do in fact make mistakes that are of this order then consider me mistaken but also consider their claims to be 'bayesians' also, more or less, lies.
If you choose a single model to work with, you are effectively putting zero probability on all other models (that are not contained in your chosen model as sub-models). Gelman's posterior predictive checks aren't motivated by this consideration (one of his non-mainstream-for-a-Bayesian stances is that model probabilities aren't useful). Nevertheless, the checks are directed at identifying ways in which the model fits the data poorly, with an eye to guiding further model elaboration, so they do address this issue in a sense. Philosophically this is true, but practically speaking, it's not. Setting certain posterior probabilities to zero can be a good approximation to a fully Bayesian analysis (e.g., this paper []). In fact, if it's appropriate to use a small number of sigfigs in your results, this approximation can yield the exact same results far faster. I don't think it's fair to call the labeling of such an analysis as Bayesian a lie.
I follow this reasoning and it applies in many cases. The reason I do not consider it applicable to the example given is due to the explicit mentioning of "We can make Z's pre-evidence probability arbitrarily small, to make this seem reasonable at the time." That changes the meaning of the example significantly in my understanding. I claim that if Z is given enough consideration that 'arbitrarily small' is plugged in rather than mere exclusion from a model then it is just an error not an approximation. There are valid examples of bayes-in-practice that support the position John takes but I just don't consider this example a fair representation. Partly because the mistake is a bad way to handle urns and partly because explicitly plugging in bad priors for Z should make you explicitly expect bad posteriors for Z. Exclusion from the model itself is a different problem.
Good answer. I neglected to read up-thread with enough thoroughness.
Good answer. I got a bit confused because Z has two meanings: "ball labelled Z was observed" (data), and "ball came from urn Z" (hypothesis). John's model can assign zero probability to data than could possibly be observed, and that's the big no-no.
In the example provided it would be by having the labels "A, B, C and Zooblefuzz" where Zooblefuzz is clearly defined 'any other urn than A, B or C'.
for context: Gelman is a bayesian and Shalizi is an anti-bayesian.
If Pr(ball labelled Z | urn) = 0 for all urns under consideration then Pr(ball labelled Z) = 0 too, so the model tries to evaluate 0 / 0 and crashes.
Tangent: I was a huge fan of Proofs and Refutations, which is about mathematics; is there a book of Lakatos's on the philosophy of science you would recommend?
I liked Proofs and Refutations a lot too. However, I'm ashamed to admit I have no special knowledge of Lakatos. All I know about his philosophy of science stuff (which I believe is closely related) is from his Wikipedia page (and Feyerabend's). Gelman's slides made the analogy with Lakatos explicitly.

I've always thought it would be nice to have a "Frequentist-to-Bayesian" guide. Sort of a "Here's some example problems, here's how you might go about it doing frequentist methods, here's how you might go about it using Bayesian techniques." My introduction to statistics began with an AP course in high school (and I used this HyperStat source to help out), and of course they teach hypothesis testing and barely give a nod to Bayes' Theorem.

What you've described is the "statistical hypothesis testing" technique, and yes, you've got it right. The only reason it functions at all is that by and large, people who use it aren't stupid, and they know that they have to submit it to peer review to other people who aren't stupid. Nevertheless, a lot of crap gets through, just because the approach is so wrong-headed. ETA: Oops! I left an important detail out of this response.

There are other techniques for frequentist statistics, e.g., unbiased estimators, minimum mean squared error estimators, method of moments, robust estimators, confidence intervals, confidence distributions, maximum likelihood, profile likelihood, empirical likelihood, empirical Bayes, estimating equations, PAC learning, etc., etc., ad nauseum.

The central difficulty of Bayesian statistics is the problem of choosing a prior: where did it come from, how is it justified? How can Bayesians ever make objective scientific statements, if all of their methods require an apparently arbitrary choice for a prior? Frequentist statistics is the attempt to do probabilistic inference without using a prior. So, for example, the U-test Cyan linked to above makes a statement about whether two data sets could be drawn from the same distribution, without having to assume anything about what the distribution actually is. That's my understanding, anyway - I would also be happy to see a "Frequentism for Bayesians" post.

Frequentist statistics is the attempt to do probabilistic inference without using a prior.

Without acknowledging a prior.

Some frequentist techniques are strictly incoherent from a Bayesian point of view. In that case there is no prior.

I believe you and would like to know some examples for future reference.
The OP is one such -- Bayesians aren't permitted to ignore any part of the data except those which leave the likelihood unchanged. One classic example is that in some problems, a confidence interval procedure can return the whole real line. A mildly less pathological example also concerning a wacky confidence interval is here [].
Yes; in Bayesian terms, many frequentist testing methods tend to implicitly assume a prior of 50% for the null hypothesis.
A prior gives you as much information as the mean of a distribution. So, can't I by the same token accuse both frequentist and Bayesian statistics of attempting to do probabilistic inference without using a distribution? I mean, the frequentist uses the U-test to ask whether 2 data sets could be drawn from the same distribution, without assuming what the mean of the distribution is. The Bayesian would use some other test, assuming a prior or perhaps a mean for the distribution, but not assuming a shape for the distribution. And some other, uninvented, and (by the standards of LW) superior statistical methodology would use another test, assuming a mean and a shape for the distribution.
No, not in general, it can give much more or much less; it depends entirely on how detailed you can make your prior. Expanding out e.g. as a series of central moments can give you as detailed a shape as you want. It may reduce to knowing only the mean in certain very special inference problems. In other problems, you may know that the distribution is very definitely Cauchy (EDIT: which doesn't even have a well-defined mean), but not know the parameters, and put some reasonable prior on them -- flat for the center over some range, and approximately using a (1/x) improper prior for the width, perhaps cutting it off at physically relevant length scales. All that information can be encoded in the prior. The prior covers your probabilities over the space of your hypotheses, not a direct probabilistic encoding of what you think one sample will be.
7Eliezer Yudkowsky13y
No no no. That would be a hundred times saner than frequentism. What you actually do is take the real data e-12 and put it into a giant bin E that also contains e-1, e-3, and whatever else you can make up a plausible excuse to include or exclude, and then you calculate P(E|~H). This is one of the key points of flexibility that enables frequentists to get whatever answer they like, the other being the choice of control variables in multivariate analyses. See e.g. this part of the article:

This seems to use "frequentist" to mean "as statistics are actually practiced." It is unreasonable to compare the implementation of A to the ideal form of B. In particular, the problem of the Mann-Whitney test seem to me that the authors looked up a recipe in a cookbook without understanding it, which they could have done just as easily in a bayesian cookbook.

Can you elaborate on that?
4Eliezer Yudkowsky13y
Well, the blatant version would be to take 5 possible control variables and try all 32 possible omissions and inclusions to see if any of the combinations turns up "statistically significant". This might look a little suspicious if you collected the data and then threw some of it away. If you were running regressions on an existing database with lots of potential control variables, why, they'll just have to trust that you never secretly picked and chose. Someone who did that might not be able to convince themselves they weren't cheating... but someone who, somehow or other, got an idea of which variables would be most convenient to control for, might well find themselves influenced just a bit in that direction.
I don't see how being a Bayesian gets you out of cherry-picking your causal structure from a large set. You still have to decide which variables are conditional on which other variables.
You put in all the variables, use a hierarchical structure for the prior, use a weakly informative hyperprior, and let the data sort itself out if it can. Key phrase: automatic relevance determination []; David MacKay originated the term while doing Bayesian inference for neural nets [].
Is that a 'were not'?
Is that not precisely the problem? Often, the H you are interested in is so vague ("there is some kind of effect in a certain direction") that it is very difficult to estimate P(E / H) - or even to define it. OTOH, P(E / ~H) is often very easy to compute from first principles, or to obtain through experiments (since conditions where "the effect" is not present are usually the most common). Example: I have a coin. I want to know if it is "true" or "biased". I flip it 100 times, and get 78 tails.Now how do I estimate the probability of obtaining this many tails, knowing that the coin is "biased"? How do I even express that analytically? By contrast, it is very easy to compute the probability of this sequence (or any other) with a "non-biased" coin. So there you have it. The whole concept of "null hypotheses" is not a logical axiom, it simply derives from real-world observation: in the real world, for most of the H we are interested in, estimating P(E / ~H) is easy, and estimating P(E / H) is either hard or impossible. P(H) is silently set to .5. If you know P(E / ~H), this makes P(E / H) unnecessary to compute the real quantity of interest, P(H / E) / P(~H / E). I think.
There needs to be a post specifically devoted to arguments of the form "It's okay to do things wrong, because doing them right would be hard". I've seen this so many times, in so many places, in so many subjects, that I have to conclude that people just don't see what is wrong with it. (No, I'm not talking about making simplifying assumptions or idealizations in models. More like presenting a collection of sometimes-useful ad-hoc tricks as a competing theory, which is then argued for as a theory against its competitors on the basis of its being "easier to apply".) Bayes' Theorem says that P(H|E) = P(H)P(E|H)/P(E). That's, like, the law. You don't get to take P(E|H) out of the equation, or pretend it isn't there, just because it's difficult to estimate. As I've said elsewhere [], if you have a belief, then you've done a Bayesian update -- which means you have some assumption about each of those quantities appearing in the formula, whether you choose to confront these assumptions or not. As a matter of fact, if you find P(E|H) overly difficult to estimate, that means your H isn't paying its rent [].
Not necessarily better. Just more convenient for the thumbs up/thumbs down way of looking at evidence that scientists tend to like. It's a convention. The point is to have a pre-agreed, low significance level so that testers can't screw with the result of a test by arbitrary jacking the significance level up (if they want to reject a hypothesis) or turning it down (if they don't). The significance level has to be low to minimize the risk of a type I error. The certainty level is effectively communicated via the significance level and p-value itself. (And the use of a reject vs. don't reject dichotomy can be desirable if one wishes to decide between performing some action and not performing it based on some data.) A frequentist can deal in likelihoods, for example by doing hypothesis tests of likelihood ratios. As for priors, a frequentist encapsulates them in parametric and sampling assumptions about the data. A Bayesian might give a low weight to a positive result from a parapsychology study because of their "low priors", but a frequentist might complain about sampling procedures or cherrypicking being more likely than a true positive. As I see it, the two say essentially the same thing; the frequentist is just being more specific than the Bayesian.
No. P-values are not equivalent when they are calculated using different statistics, or even the same statistic but a different sample size. On the latter point see Royall, 1986 []. I'd say the frequentist is using Bayesian reasoning informally; Jaynes discusses this exact problem from a Bayesian perspective at the beginning of Chapter 5 [] of his magnum opus [].
Sorry. You are quite right, and I was sloppy. I had in mind the implicit idea that holding the choices of statistical test and data collection procedure constant, different p-values suggest how strongly one should reject the null hypothesis, and I should have made that explicit. It is absolutely true that if I just ask someone, "Test A gave me p = 0.008 and Test B gave me p = 0.4, which test's null hypothesis is worse off?", the correct answer is "how should I know?" Yep. I think this is an example of the frequentist encapsulating what a Bayesian would call priors in their sampling assumptions.
I too would like to see a good explanation of frequentist techniques, especially one that also explains their relationships (if any) to Bayesian techniques. Based on the tiny bit I know of both approaches, I think one appealing feature of frequentist techniques (which may or may not make up for their drawbacks) is that your initial assumptions are easier to dislodge the more wrong they are. It seems to be the other way around with Bayesian techniques because of a stronger built-in assumption that your assumptions are justified. You can immunize yourself against any particular evidence by having a sufficiently wrong prior. EDIT: Grammar
But you won't be able to convince other Bayesians who don't share that radically wrong prior. Similarly, there doesn't seem to be something intrinsic to frequentism that keeps you from being persistently wrong. Rather, frequentists are kept in line because, as Cyan said [], they have to persuade each other. Fortunately, for Bayesians and frequentists alike, a technique's being persuasive to the community correlates with its being liable to produce less wrong answers.
The ability to get a bad result because of a sufficiently wrong prior is not a flaw in Bayesian statistics; it is a flaw is our ability to perform Bayesian statistics. Humans tend to overestimate their confidence of probabilities with very low or very high values. As such, the proper way to formulate a prior is to imagine hypothetical results that will bring the probability into a manageable range, ask yourself what you would want your posterior to be in such cases, and build your prior from that. These hypothetical results must be constructed and analyzed before the actual result is obtained to eliminate bias. As Tyrrell said, the ability of a wrong prior to result in a bad conclusion is a strength because other Bayesians will be able to see where you went wrong by disputing the prior.
Someone correct me if I'm wrong here, but I don't think even having a strong prior P(H) against the evidence is much help, because that makes your likelihood ratio on the evidence P(E|H)/P(E|~H) that much stronger. (This issue is one my stumbling blocks in Bayescraft.)
The likelihood ratio P(E|H)/P(E|~H) is entirely independent of the prior P(H) Or did I misunderstand what you said?
In theory, yes, but we're talking about a purported "unswayable Bayesian". If someone strongly believes leprechauns don't exist (low P(H), where H is "leprechauns exist" ), they should strongly expect not to see evidence of leprechauns (low P(E|~H), where E is direct evidence of leprechauns, like finding one in the forest), which suggests a high likelihood ratio P(E|H)/P(E|~H). I remember Eliezer Yudkowsky referring to typical conversations that go like: Non-rationalist: "I don't think there will ever be an artificial general intelligence, because my religion says that can't happen." EY: "So if I showed you one, that means you'd leave your religion?"
He did mention pulling that off once [], but I don't believe he said it was typical.
Thanks, that was what I had in mind.
I'm not entirely sure I understand your point. The example you're citing is more the guy saying "I believe X, and X implies ~Y, therefore ~Y", so Eliezer is saying "So Y implies ~X then?" But the "X implies ~Y" belief can happen when one has low belief in X or high belief in X. Or are you saying "the likelihoods assigned led to past interpretation of analogous (lack of) evidence, and that's why the current prior is what it is?
komponisto nailed the intuition I was going from: the likelihood ratio is independent of the prior, but an unswayable Bayesian fixes P(E), forcing extreme priors to have extreme likelihood ratios.
*blinks* I think I'm extra confused. The law of conservation of probability is basically just saying that the change in belief may be large or small, so evidence may be strong or weak in that sense. But that doesn't leave the likelihoods up for grabs, (well, okay, P(E|~H) could depend on how you distribute your belief over the space of hypotheses other than H, but... I'm not sure that was your point)
Okay, point conceded ... that still doesn't generate a result that matches the intuition I had. I need to spend more time on this to figure out what assumptions I'm relying on to claim that "extremely wrong beliefs force quick updates".
Remember, though, that even fixing both P(E) and P(H), you can still make the ratio P(E|H)/P(E|~H) anything you want; the equation a = bx + (1-b)(cx) is guaranteed to have a solution for any a,b,c.
P(E) = P(E|H) P(H) + P(E|~H)P(~H) The quantities P(H), P(E|H), and P(E|~H) are in general independent of each other, in the sense that you can move any one of them without changing the others -- provided you adjust P(E) accordingly.
Thanks, that helps. See how I apply that point in my reply to Psy-Kosh here [].
Well, P(E|H) is actually pretty easy to calculate under a frequentist framework. That's the basis of power analysis, a topic covered in any good intro stat course. The real missing ingredient, as you point out, is P(H). I'm not fully fluent in Bayesian statistics, so while I'm on the topic I have a question: do Bayesian methods involve any decision making? In other words, once we've calculated P(H|E), do we just leave it at that? No criteria to decide on, just revising of probabilities? This is my current understanding, but it just seems so contrary to everyday human reasoning. What we would really like to say at the end of the day (or, rather, research program) is something like "Aha! Given the accumulated evidence, we can now cease replication. Hypothesis X must be true." Being humans, we want to make a decision. But decision making necessarily involves the ultimately arbitrary choice of where to set the criterion. Is this anti-Bayesian?
The formal decision-making machinery involves picking a loss function [] and minimizing posterior expected loss.
Okay, but is it a part of the typical Bayesian routine to wield formal decision theory, or do we just calculate P(H|E) and call it a day?
I don't think formal decision theory is common in applied Bayesian stats in science; the only paper I can quickly recall that did a decision analysis is Andrew Gelman's radon remediation study []. Maybe econometrics is different, since it's a lot easier to define losses in that context.

I'm not seeing why what you call "the real WTF" is evidence of a problem with frequentist statistics. The fact that the hypothesis test would have given a statistically insignificant p-value whatever the actual 6 data points were just indicates that whatever the population distributions, 6 data points are simply not enough to disconfirm the null hypothesis. In fact you can see this if you look at Mann & Whitney's original paper! (See the n=3 subtable in table I, p. 52.)

I can picture someone counterarguing that this is not immediately obvious from the details of the statistical test, but I would hope that any competent statistician, frequentist or not, would be sceptical of a nonparametric comparison of means for samples of size 3!

I'm an econometrician by training and when I was taught non-parametric testing I was told the minimum sample size to get a useful result was 10. Either the authors of the article had forgotten this, or there is something very wrong with how they were taught this test.
Thanks for the pointer to the original paper. Check out the title: abuse of frequentist statistics. Yes, at the end, I argue from a Bayesian perspective, but you don't have to be a Bayesian to see the structural problems with frequentist statistics as currently taught to and practiced by working scientists. Me too. But not all papers with shoddy statistics are sent to statisticians for review. Experimental biologists in particular have a reputation [] for math-phobia. (Does the fact that when I saw the sample size the word "underpowered []" instantly jumped into my head count as evidence that I am competent?)

Check out the title: abuse of frequentist statistics. Yes, at the end, I argue from a Bayesian perspective, but you don't have to be a Bayesian to see the structural problems with frequentist statistics as currently taught to and practiced by working scientists.

I agree that frequentist statistics are often poorly taught and understood, and that this holds however you like to do your statistics. Still, the main post feels to me like a sales pitch for Bayes brand chainsaws that's trying to scare me off Neyman-Pearson chainsaws by pointing out how often people using Neyman-Pearson chainsaws accidentally cut off a limb with them. (I am aware that I may be the only reader who feels this way about the post.)

(Does the fact that when I saw the sample size the word "underpowered" instantly jumped into my head count as evidence that I am competent?)

Yes, but it is not sufficient evidence to reject the null hypothesis of incompetence at the 0.05 significance level. (I keed, I keed.)

I get that impression a lot around here
It's a fair point; I'm not exactly attacking the strongest representative of frequentist statistical practice. My only defense is that this actually happened, so it makes a good case study.
That's true, and having been reminded of that, I think I may have been unduly pedantic about a fine detail at the expense of the main point.
It's a good case study, but it's not evidence of a problem with frequentist statistics.
I assert that it is evidence in my concluding paragraph, but it's true that I don't give an actual argument. Whether one counts it as evidence would seem to depend on the causal assumptions one makes about the teaching and practice of statistics.
Perhaps it's frequentist evidence against frequentist statistics.
I think this is just a glib rejoinder, but if there's a deeper thought there, I'd be interested to hear it.
The critique of frequentist statistics, as I understand it - and I don't think I do - is that frequentists like to count things, and trust that having large sample sizes will take care of biases for them. Therefore, a case in which frequentist statistics co-occurs with bad results counts against use of frequentist statistics, and you don't have to worry about why the results were bad. The whole Bayesian vs. frequentist argument seems a little silly to me. It's like arguing that screws are better than nails. It's true that, for any particular individual joint you wish to connect, a screw will probably connect it more securely and reversibly than a nail. That doesn't mean there's no use for nails.
I think that, in this case, the underlying problem was not caused by the way frequentist statistics are commonly taught and practiced by working scientists: I'm no statistician, but I'm pretty sure you're not supposed to make your favored hypothesis the null hypothesis. That's a pretty simple rule and I think it's drilled into students and enforced in peer review. I see that as the underlying problem because it reverses the burden of proof. If they had done it the right way around, six data points would have been not enough to support their method instead of being not enough to reject it. Making your favored hypothesis the null hypothesis can allow you, in the extreme, to rely on a single data point.
In the OP I did refer to that when I wrote: You wrote: Not all papers are reviewed by people who know the rule. I was taught that rule over ten years ago, and I didn't remember it when my colleague showed me the analysis. (I did recall it eventually, just after I ran the sanity check. Evidence against my competence!) My colleague whose job it was to review the paper didn't know/recall the rule either.
Well, I don't see the structural problems. (I don't even know what a structural problem is.) Somebody, please write a top-level post addressing this. Stop saying "Frequentists are bad" and leaving it at that. This is a great story; but it's not valid argumentation to try to convert it into an anti-frequentist tract.
I'd love to see a top-level post where someone suggests the best and/or most realistic way for scientists to do their statistics. I'm actually rather ignorant with regards to probability theory. I got a D in second semester frequentist statistics (hard teacher + I didn't go to class or try very hard on the homework) which is indicative of how little I learned in that class. I did better in my applied statistics classes. When is it good for scientists to do null hypothesis testing?
What specifically is the "this" you want addressed? I'm not sure what its referent is.
Right - show us how you would have done this test correctly using Bayesian statistics.
That did come up in comments; you can find the discussion here [].

Now that I understand the situation better, I thought I'd share this interesting coincidence.

Last Friday, I was talking my friend's brother, a PhD candidate in electrical engineering (by all I could tell, a very intelligent man), who had to teach a statistics class to grad students in education. He told what it was like to teach one of their 8-hour lessons, where he had to explain many of the things in the frequentist toolbox.

I told him about my interest in information theory and Bayesian statistics and asked if the course covers any of that. While he sh... (read more)

Statistics actually aren't used very much in many areas of physics and engineering. It mainly only comes into your work if you are an experimentalist, and even then, a very modest knowledge is often sufficient.

How difficult would it be to rewrite the Mann-Whitney U test to give a Bayesian likelihood ratio?

There is no direct translation -- throwing out the actual data is not legitimate. Even if all you were given were the ranks, the likelihood of that ordering under the alternative hypothesis depends on the underlying probability densities, so you haven't simplified the problem. Also, there's nothing exclusively Bayesian about likelihood ratios, at least when the null and alternative hypotheses are completely specified.
Fair enough - in that case, how would you have produced a statistical measure of how well the distributions matched? What kind of assumptions would you employ?
I have three answers for that. The first is I wouldn't have bothered -- in this field, data collection is not super-precise, so three data points per method isn't going to tell me shit. But this really avoids the question, so... The second answer is that Bayesian analogues to this approach do exist []. But we know a priori that the two methods won't generate data from the same distribution -- there's really no need to even formulate the null hypothesis. What we really care about is the accuracy and precision of the new method, so... The third answer is given enough data, I would have set up a hierarchical Bayesian model [] to estimate the accuracy and precision of the new method, where accuracy is defined as "matching the old method as closely as possible".

Nice analysis. But:

Technologos's reply to Douglas Knight got it exactly right; I almost jokingly requested a spoiler warning.

If this is important enough to use as the summary for your post, it's important enough for you to either quote it, or link to it, and thus save us, cumulatively, hours of time searching for it.

It's the link on the word "wrote" in the first sentence of the last paragraph.

even before the data were collected it could have been known that this analysis would give the result the authors wanted.

How does that follow? The authors didn't know what data they would get, right? So they didn't know the ranking that would show up. They would know in advance that result X would pass the test, but they wouldn't know that they would get result X.

What the OP is saying is that "no matter what the outcome would have been, the mathematics of this particular test would have, in this case, failed to reject the null hypothesis. There is no possible outcome that would have led to rejecting the null hypothesis given the way this was done" EDIT: blargh, meant to say that it was set up so that no matter what, no possible outcome would have led to a rejecting of the null hypothesis, not "no possible outcome would have led to it being accepted"
Yes, but I don't understand from the details presented why that follows. Why couldn't result X contain the new method ranking below the standard method?
What the OP was saying was that this test only depends on the rankings. So to check for sanity, he calculated what the p values would have been for all possible rankings and found that none of those p values would be below .05. In other words, it was a mathematical impossibility for this test, when treated this way, to result in a rejection of the null hypothesis. There was no possible outcome given this many data points, analyzed using this method, a rejection. (in other words, it was a "heads I win, tails you lose" situation)
Okay, I think that makes sense. Let me put it into my own words: The test is guaranteed to be not statistically significant merely by virtue of cutting up the outcome space into pieces, each of which has at least 5% chance of happening. And further, because the null hypothesis has been (arbitrarily) defined to be "the two methods are the same", statistical insignificance means a favorable result. Does that about cover it? If so, that's pretty bad.
That part isn't right, but the rest is.
So I should have said "for the nine outcomes they considered, they all had at least 5% chance of happening"?
The p-value is the probability of getting a result "at least this extreme" given the null hypothesis, where "extreme" means "deviating from the null hypothesis", however that's defined. So, the test cut the outcome space into pieces, the most extreme of which had at least a 5% chance of happening. I think.
... under the null hypothesis. I actually forgot this detail when replying to komponisto [].
Wait... actually it may even be worse than that. I'm not even sure it's cleanly partitioning the outcome space. 1/20 = .05, so if some outcomes are above .05, then other outcomes would have to be below .05, right? So the calculation to get the final result doesn't even really do a proper partitioning of the outcomes if some of the outcomes can be greater than .05 and none less than .05 EDIT: so yeah, it's cutting up not just the outcome space into pieces corresponding to rankings, but mushing some of those together (at best).
That's more or less my understanding of the situation. And yes... that is indeed pretty bad. :)
More of a double-headed coin situation, actually.
Well... different ranking outcomes (different sides of the coin) are possible. Just that the interpretation will always be "don't reject the null hypothesis" but yeah. :) Either way, my overall reaction to your post is "yuck" (not your post itself! That I upvoted. I mean the whole situation... That a relatively standard statistical test could allow this sort of madness. I mean, I know frequentist stats isn't the Bayesian way, but that relatively standard methods in it can be this pathological does not at all give me warm fuzzies)
I concur with your "yuck", but would phrase it slightly differently. The specific type of statistical test applied, plus the number of samples taken, has the effect, as Cyan said, of guaranteeing the results that the authors wanted. Note that, more generally, the fact that the authors chose to phrase their analysis so that accepting the null hypothesis was the result they wanted plus choosing a nonparametric statistical test, which is always weaker than a parametric one is in and of itself suspicious. If they had had enough samples so that it would be theoretically possible for the null hypothesis to be rejected (say if they had taken more samples) but they had still wanted the null result and they had still chosen a nonparametric test I would still be suspicious. As Cyan said, the nonparametric tests throw away most of the information.
It's not the fault of the method if someone abuses it.
In general, no. However, if a method is more easily abused than others, that that is something worth pointing out.

New to LessWrong?