Like The Cognitive Science of Rationality, this is a post for beginners. Send the link to your friends!

Science is broken. We know why, and we know how to fix it. What we lack is the will to change things.

In 2005, several analyses suggested that most published results in medicine are false. A 2008 review showed that perhaps 80% of academic journal articles mistake "statistical significance" for "significance" in the colloquial meaning of the word, an elementary error every introductory statistics textbook warns against. This year, a detailed investigation showed that half of published neuroscience papers contain one particular simple statistical mistake.

Also this year, a respected senior psychologist published in a leading journal a study claiming to show evidence of precognition. The editors explained that the paper was accepted because it was written clearly and followed the usual standards for experimental design and statistical methods.

Science writer Jonah Lehrer asks: "Is there something wrong with the scientific method?"

Yes, there is.

This shouldn't be a surprise. What we currently call "science" isn't the *best* method for uncovering nature's secrets; it's just the first set of methods we've collected that *wasn't totally useless* like personal anecdote and authority generally are.

As time passes we learn new things about how to do science better. The Ancient Greeks practiced some science, but few scientists tested hypotheses against mathematical models before Ibn al-Haytham's 11th-century *Book of Optics* (which also contained hints of Occam's razor and positivism). Around the same time, Al-Biruni emphasized the importance of repeated trials for reducing the effect of accidents and errors. Galileo brought mathematics to greater prominence in scientific method, Bacon described eliminative induction, Newton demonstrated the power of consilience (unification), Peirce clarified the roles of deduction, induction, and abduction, and Popper emphasized the importance of falsification. We've also discovered the usefulness of peer review, control groups, blind and double-blind studies, plus a variety of statistical methods, and added these to "the" scientific method.

In many ways, the best science done today is better than ever — but it still has problems, and most science is done poorly. The good news is that we know what these problems are and we know multiple ways to fix them. What we lack is the *will* to change things.

This post won't list all the problems with science, nor will it list all the promising solutions for any of these problems. (Here's one I left out.) Below, I only describe a few of the basics.

### Problem 1: Publication bias

When the study claiming to show evidence of precognition was published, psychologist Richard Wiseman set up a registry for advance announcement of new attempts to replicate the study.

Carl Shulman explains:

A replication registry guards against publication bias, and at least 5 attempts were registered. As far as I can tell, all of the subsequent replications have, unsurprisingly, failed to replicate Bem's results. However, JPSP and the other high-end psychology journals refused to publish the results, citing standing policies of not publishing straight replications.

From the journals' point of view, this (common) policy makes sense: bold new claims will tend to be cited more and raise journal prestige (which depends on citations per article), even though this means most of the 'discoveries' they publish will be false despite their low p-values (high statistical significance). However, this means that overall the journals are giving career incentives for scientists to massage and mine their data for bogus results, but not to challenge bogus results presented by others.

This is an example of publication bias:

Publication bias is the term for what occurs whenever the research that appears in the published literature is systematically unrepresentative of the population of completed studies. Simply put, when the research that is readily available differs in its results from the results of

allthe research that has been done in an area, readers and reviewers of that research are in danger of drawing the wrong conclusion about what that body of research shows. In some cases this can have dramatic consequences, as when an ineffective or dangerous treatment is falsely viewed as safe and effective. [Rothstein et al. 2005]

Sometimes, publication bias can be more deliberate. The anti-inflammatory drug Rofecoxib (Vioxx) is a famous case. The drug was prescribed to 80 million people, but in it was later revealed that its maker, Merck, had withheld evidence of the drug's risks. Merck was forced to recall the drug, but it had already resulted in 88,000-144,000 cases of serious heart disease.

#### Example partial solution

One way to combat publication bias is for journals to only accept experiments that were registered in a public database before they began. This allows scientists to see which experiments were conducted but never reported (perhaps due to negative results). Several prominent medical journals (e.g. *The Lancet* and *JAMA*) now operate this way, but this protocol is not as widespread as it could be.

### Problem 2: Experimenter bias

Scientists are humans. Humans are affected by cognitive heuristics and biases (or, really, humans just *are* cognitive heuristics and biases), and they respond to incentives that may not align with an optimal pursuit of truth. Thus, we should expect *experimenter bias* in the practice of science.

There are many stages in research during which experimenter bias can occur:

- in reading-up on the field,
- in specifying and selecting the study sample,
- in [performing the experiment],
- in measuring exposures and outcomes,
- in analyzing the data,
- in interpreting the analysis, and
- in publishing the results. [Sackett 1979]

Common biases have been covered elsewhere on Less Wrong, so I'll let those articles explain how biases work.

#### Example partial solution

There is some evidence that the skills of rationality (e.g. cognitive override) are teachable. Training scientists to notice and meliorate biases that arise in their thinking may help them to reduce the magnitude and frequency of the thinking errors that may derail truth-seeking attempts during each stage of the scientific process.

### Problem 3: Bad statistics

I remember when my statistics professor first taught me the reasoning behind "null hypothesis significance testing" (NHST), the standard technique for evaluating experimental results. NHST uses "p-values," which are statements about the probability of getting some data (e.g. one's experimental results) *given* the hypothesis being tested. I asked my professor, "But don't we want to know the probability of the hypothesis we're testing *given* the data, not the other way around?" The reply was something about how this was the best we could do. (But that's false, as we'll see in a moment.)

Another problem is that NHST computes the probability of getting data as unusual as the data one collected by considering what might be expected if that particular experiment was repeated many, many times. But how do we know anything about these imaginary repetitions? If I want to know something about a particular earthquake, am I supposed to imagine a few dozen repetitions of that earthquake? What does that even *mean*?

I tried to answer these questions on my own, but all my textbooks assumed the soundness of the mistaken NHST framework for scientific practice. It's too bad I didn't have a class with biostatistican Steven Goodman, who says:

The p-value is almost nothing sensible you can think of. I tell students to give up trying.

The sad part is that the logical errors of NHST are old news, and have been known ever since Ronald Fisher began advocating NHST in the 1920s. By 1960, Fisher had out-advocated his critics, and philosopher William Rozeboom remarked:

Despite the awesome pre-eminence [NHST] has attained... it is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research.

There are many more problems with NHST and with "frequentist" statistics in general, but the central one is this: NHST does not follow from the axioms (foundational logical rules) of probability theory. It is a grab-bag of techniques that, depending on how those techniques are applied, can lead to *different* results when analyzing the *same data* — something that should horrify every mathematician.

The inferential method that solves the problems with frequentism — and, more importantly, follows deductively from the axioms of probability theory — is Bayesian inference.

So why aren't *all* scientists using Bayesian inference instead of frequentist inference? Partly, we can blame the vigor of NHST's early advocates. But we can also attribute NHST's success to the simple fact that Bayesian calculations can be *more difficult* than frequentist calculations. Luckily, new software tools like WinBUGS let computers do most of the heavy lifting required for Bayesian inference.

There's also the problem of sheer momentum. Once a practice is enshrined, it's hard to dislodge it, even for good reasons. I took three statistics courses in university and *none* of my textbooks mentioned Bayesian inference. I didn't learn about it until I dropped out of university and studied science and probability theory on my own.

Remember the study about precognition? Not surprisingly, it was done using NHST. A later Bayesian analysis of the data disconfirmed the original startling conclusion.

#### Example partial solution

This one is obvious: teach students probability theory instead of NHST. Retrain current scientists in Bayesian methods. Make Bayesian software tools easier to use and more widespread.

### Conclusion

If I'm right that there is unambiguous low-hanging fruit for improving scientific practice, this suggests that particular departments, universities, or private research institutions can (probabilistically) out-perform their rivals (in terms of actual discoveries, not just publications) given similar resources.

I'll conclude with one particular *specific* hypothesis. If I'm right, then a research group should be able to hire researchers trained in Bayesian reasoning and in catching publication bias and experimenter bias, and have them extract from the existing literature valuable medical truths that the mainstream medical community doesn't yet know about. This prediction, in fact, is about to be tested.

I only had time to double-check one of the scary links at the top, and I wasn't too impressed with what I found:

But the careful review you link to claims that studies funded by the industry report 85% positive results, compared to 72% positive by independent organizations and 50% positive by government - which is not what I think of when I hear four times! They also give a lot of reasons to think the difference may be benign: industry tends to do different kinds of studies than independent orgs. The industry studies are mainly Phase III/IV - a part of the approval process where drugs that have already been shown to work in smaller studies are tested on a larger population; the nonprofit and government studies are more often Phase I/II - the first check to see whether a promising new chemical works at all. It makes sense that studies on a drug which has already been found to probably work are more positive than the first studies on a tota... (read more)

Yes, "four times as likely" is not the same as an odds ratio of four. And the problem here is the same as the problem in army1987's LL link that odds ratios get mangled in transmission.

But I like odds ratios. In the limit of small probability, odds ratios are the same as "times as likely." But there's nothing 4x as likely as 50%. Does that mean that 50% is very similar to all larger probabilities? Odds ratios are unchanged (or inverted) by taking complements: 4% to 1% is an odds ratio of about 4; 99% to 96% is also 4 (actually 4.1 in both cases). Complementation is exactly what's going on here. The drug companies get 1.2x-1.3x more positive results than the independent studies. That doesn't sound so big, but everyone is likely to get positive results. If we speak in terms of negative results, the independent studies are 2-3x likely to get negative results as the drug companies. Now it sounds like a big effect.

Odds ratios give a canonical distance between probabilities that doesn't let people cherry-pick between 34% more positives and 3x more negatives. They give us a way to compare any two probabilities that is the obvious one for very small probabilities and is related to the obvi... (read more)

Thanks for this. I've removed the offending sentence.

Language Log: Thou shalt not report odds ratios

Or if you want to appropriate a different popular phrase, "Never tell me the odds ratio!"

I object, for reasons wonderfully stated by gwern here

... (read more)That was actually just a slightly-edited-for-Hacker-News excerpt from my standing mini-essay explaining why we can't trust science too much; the whole thing currently lives at http://www.gwern.net/DNB%20FAQ#fn51

I am skeptical of the teaching solution section under 2), relative to institutional shifts (favoring confirmatory vs exploratory studies, etc). Section 3 could also bear mention of some of the many ways of abusing Bayesian statistical analyses (e.g. reporting results based on gerrymandered priors, selecting which likelihood ratio to highlight in the abstract and get media attention for, etc). Cosma Shalizi would have a lot to say about it.

I do like the spirit of the post, but it comes across a bit boosterish.

On this note, I predict that if Bayesian statistical analyses ever displaced NHST as the mainstream standard, they would be misused about as much as NHST.

Currently there's a selection bias: NHST is much more widely taught than Bayesian analyses, so NHST users are much more likely to be lowest common demoninator crank-turners who don't really understand statistics generally. By contrast, if you've managed to find out how to do Bayesian inference, you're probably better at statistics than the average researcher and therefore less likely to screw up whatever analysis you choose to do. If every researcher were taught Bayesian inference this would no longer be true.

As a biologist, I can say that most statistical errors are just that: errors. They are not tricks. If researchers understand the statistics that they are using, a lot of these problems will go away.

A person has to learn a hell of a lot before they can do molecular biology research, and statistics happens to be fairly low on the priority list for most molecular biologists. In many situations we are able to get around the statistical complexities by generating data with very little noise.

Please don't insert gratuitous politics into LessWrong posts.

I removed the global warming phrase.

What David_G said. Global warming is a scientific issue. Maybe "what we lack is the will to change things" is the right analysis of the policy problems, but among climate change experts there's a whole lot more consensus about global warming than there is among AI researchers about the Singularity. "You can't say controversial things about global warming, but can say even more controversial things about AI" is a rule that makes about as much sense as "teach the controversy" about evolution.

Evolution is also a political issue. Shall we now refrain from talking about evolution, or mentioning what widespread refusal to accept evolution, up to the point of there being a strong movement to undermine the teaching of evolution in US schools, says about human rationality?

I get that it can be especially hard to think rationally about politics. And I agree with what Eliezer has written about government policy being complex and almost always involving some trade-offs, so that we should be careful about thinking there's an obvious "rationalist view" on policy questions.

However, a ban on discussing issues that

happen to be politicizedis idiotic, because it puts us at the mercy of contingent facts about what forms of irrationality happen to be prevalent in political discussion at this time. Evolution is a prime example of this. Also, if the singularity became a political issue, would we ban discussion of that from LessWrong?We should not insert political issues which are not relevant to the topic, because the more political issues one brings to the discussion, the less rational it becomes. It would be most safe to discuss all issues separately, but sometimes is it not possible, e.g. when the topic being discussed relies heavily on evolution.

One part of trying to be rational is to accept that people are not rational, and act accordingly. For every political topic there is a number of people whose minds will turn off if they read something they disagree with. It does not mean we should be quiet on the topic, but we should not insert it where it is not relevant.

Explaining why X is true, in a separate article, is correct approach. Saying or suggesting something like "by the way, people who don't think X is true are wrong" in an unrelated topic, is wrong approach. Why is it so? In the first example you expect your proof of X to be discussed in the comments, because it is the issue. In the second example, discussions about X in comments are off-topic. Asserting X in a place where discussion of X is unwelcome, is a kind of Dark Arts; we should avoid it even if we think X is true.

The topic of evolution, unlike the topic of climate change, is entangled with human psychology, AI, and many other important topics; not discussing it would be highly costly. Moreover, if anyone on LessWrong disagrees with evolution, it's probably along Newsomian eccentric lines, not along tribal political lines. Also, lukeprog's comments on the subject made implicit claims about the policy implications of the science, not just about the science itself, which in turn is less clear-cut than the scientific case against a hypothesis requiring a supernatural agent, though for God's sake please nobody start arguing about exactly how clear-cut.

As a matter of basic netiquette, please use words like "mistaken" or "harmful" instead of "idiotic" to describe views you disagree with.

But two Bayesian inferences from the same data can also give different results. How could this be a non-issue for Bayesian inference while being indicative of a central problem for NHST? (If the answer is that Bayesian inference is rigorously deduced from probability theory's axioms but NHST is not, then the fact that NHST can give different results for the same data is not a true objection, and you might want to rephrase.)

By a coincidence of dubious humor, I recently read a paper on exactly this topic, how NHST is completely misunderstood and employed wrongly and what can be improved! I was only reading it for a funny & insightful quote, but Jacob Cohen (as in, 'Cohen's d') in pg 5-6 of "The Earth Is Round (

p< 0.05)" tells us that we shouldn't seek to replace NHST with a "magic alternative" because "it doesn't exist". What we should do is focus on understanding the data with graphics and datamining techniques; report confidence limits on effect sizes, which gives us various things I haven't looked up; and finally, place way more emphasis on replication than we currently do.An admirable program; we don't have to shift all the way to Bayesian reasoning to improve matters. Incidentally, what Bayesian inferences are you talking about? I thought the usual proposals/methods involved principally reporting log odds, to avoid exactly the issue of people having varying priors and updating on trials to get varying posteriors.

It's ridiculous to call non-scientific methods are "useless". Our civilization is based on such non-scientific methods. Observation, anecdotal evidence, trial and error, markets etc. are all deeply unscientific and extremely useful ways of gaining useful knowledge. Next to these Science is really a fairly minor pursuit.

One senses that the author (the one in the student role) neither has understood the relative-frequency theory of probability nor has performed any empirical research using statistics--lending the essay the tone of an arrogant neophyte. The same perhaps for the professor. (Which institution is on report here?) Frequentists reject the very concept of "the probability of the theory given the data." They take probabilities to be objective, so they think it a category error to remark about the probability of a theory: the theory is either true or false, and probability has nothing to do with it.

You can reject relative-frequentism (I do), but you can't successfully understand it in Bayesian terms. As a first approximation, it may be better understood in falsificationist terms. (Falsificationism keeps getting trotted out by Bayesians, but

thatconstruct has no place in aBayesianaccount. These confusions are embarrassingly amateurish.) The Fischer paradigm is that ... (read more)You are reading "the probability of the data D given the theory T" to mean p(D | T), which in turn is short for a ratio p(D & T)/p(T) of probabilities with respect to some universal prior p. But, for the frequentist, there is no universal prior p being invoked.

Rather, each theory comes with its own probability distribution p_T over data, and "the probability of the data D given the theory T" just means p_T(D). The different distributions provided by different theories don't have any relationship with one another. In particular, the different distributions are not the result of conditioning on a common prior. They are incommensurable, so to speak.

The different theories are just more or less correct. There is a "true" probability of the data, which describes the objective propensity of reality to yield those data. The different distributions from the different theories are comparable only in the sense that they each get that true distribution more or less right.

I hope they're not using that landing page for anything important. It's not clear what product (if any) they're selling, there's no call to action, and in general it looks to me like it's doing a terrible job of overcoming inferential distances. I'd say you did a far better job of selling them than they did. Someone needs to read a half a dozen blog posts about how customers only think of themselves, etc.

Great post by the way, Luke.

What should I read to get a good defense of Bayesianism--that isn't just pointing out difficulties with frequentism, NHST, or whatever? I understand the math, but am skeptical that it can be universally applied, due to problems with coming up with the relevant priors and likelihoods.

It's like the problem with simple deduction in philosophy. Yes, if your premises are right, valid deductions will lead you to true conclusions, but the problem is knowing whether the premises used by the old metaphysicians (or modern ones, for that matter) are true. Bayesianis... (read more)

You seem to be conflating Bayesian inference with Bayes Theorem. Bayesian inference is a method, not a proposition, so cannot be the conclusion of a deductive argument. Perhaps the conclusion you have in mind is something like "We should use Bayesian inference for..." or "Bayesian inference is the best method for...". But such propositions cannot follow from mathem... (read more)

disclaimer: I'm not very knowledgeable in this subject to say the least.

This seems relevant: Share likelihood ratios, not posterior beliefs

It would seem useful for them to publish p(data|hypothesis) because then I can use my priors for p(hypothesis) and p(data) to calculate p(hypothesis|data).

Otherwise, depending on what information they updated on to get their priors I might end up updating on something twice.

Cigarette smoking: an underused tool in high-performance endurance training

musical contrast and chronological rejuvenation

Effects of remote, retroactive intercessory prayer on outcomes i... (read more)

Prior methods weren't completely useless. Humans went from hunter-gatherers to civilization without the scientific method or a general notion of science. It is probably more fair to say that science was just much better than all previous methods.

Wait, that confused me. I thought the p-value was the chance of the data

given the null hypothesis.In the vernacular, when "testing a hypothesis" we refer to the hypothesis

of interestas the one being tested, i.e. the alternative to the null - not the null itself. (For instance, we say things like "test the effect of gender", not the more cumbersome "test the null hypothesis of the absence of an effect of gender".)In any case it wouldn't hurt the OP, and could only make it clearer, to reword it to remove the ambiguity.

I really like the discussions of the problems, but I would have loved to see more discussions of the solutions. How do we know, more specifically, that they will solve things? What are the obstacles to putting them into effect -- why, more specifically, do people just not want to do it? I assume it's something a bit more complex than a bunch of people going around saying "Yeah, I know science is flawed, but I don't really feel like fixing it." (Or maybe it isn't?)

I know this is stating the obvious, but the next stage after this is for people to regard "science" as what's in the database rather than what's in the journals. Otherwise there's still publication bias (unless people like writing up boring results and journals like publishing them)

This, I think, is just one symptom of a more general problem with scientists: they don't emphasize rigorous logic as much as they should. Science, after all, is not only about (a) observation but about (b) making logical inferences from observation. Scientists need to take (b) far more seriously (not that all don't, but many do not). You've heard the old saying "Scientists make poor philosophers." It's true (or at least, true more often than it should be). That has to change. Scientists ought to be amongst the best philosophers in the world, precisely because they ought to be masters of logic.

It is important here to distinguish two roles of statistics in science: exploration and confirmation. It seems likely that Bayesian methods are more powerful (and less prone to misuse) than non-Bayesian methods the exploratory paradigm.

However, for the more important issue of confirmation, the primary importance of statistical theory is to: 1) provide a set of quantitative guidelines for scientists to design effective (confirmatory) experiments and avoid being mislead by the results of poorly designed experiments or experiments with inadequate sample size... (read more)

Thanks for putting this together. There are many interesting links in there.

I am hopeful that Bayesian methods can help to solve some of our problems, and there is constant development of these techniques in biology.

Scientists should pay more attention to their statistical tests, and I often find myself arguing with others when I don't like their tests. The most important thing that people need to remember is what "NHST" actually does -- it rejects the null hypothesis. Once they think about what the null hypothesis is, and realize that they have done nothing more than reject it, they will make a lot of progress.

Not only can't I get my head around explanations for bayes formalisms, I have no idea how to apply it to my science. And that's as a Lesswronger. WinBugs looks 1000x times more complicated that those 'intuitive' explanations of bayes like 'update your beliefs' and 'your priors should affect your expectations, then be updated'.

Nitpick: actually last year (March 2011, per http://www.ncbi.nlm.nih.gov/pubmed/21280961 ).

I imagine you intended to link to consilience the concept, not the book. Then again you may just be trying to be subtle.