Like The Cognitive Science of Rationality, this is a post for beginners. Send the link to your friends!

Science is broken. We know why, and we know how to fix it. What we lack is the will to change things.


In 2005, several analyses suggested that most published results in medicine are false. A 2008 review showed that perhaps 80% of academic journal articles mistake "statistical significance" for "significance" in the colloquial meaning of the word, an elementary error every introductory statistics textbook warns against. This year, a detailed investigation showed that half of published neuroscience papers contain one particular simple statistical mistake.

Also this year, a respected senior psychologist published in a leading journal a study claiming to show evidence of precognition. The editors explained that the paper was accepted because it was written clearly and followed the usual standards for experimental design and statistical methods.

Science writer Jonah Lehrer asks: "Is there something wrong with the scientific method?"

Yes, there is.

This shouldn't be a surprise. What we currently call "science" isn't the best method for uncovering nature's secrets; it's just the first set of methods we've collected that wasn't totally useless like personal anecdote and authority generally are.

As time passes we learn new things about how to do science better. The Ancient Greeks practiced some science, but few scientists tested hypotheses against mathematical models before Ibn al-Haytham's 11th-century Book of Optics (which also contained hints of Occam's razor and positivism). Around the same time, Al-Biruni emphasized the importance of repeated trials for reducing the effect of accidents and errors. Galileo brought mathematics to greater prominence in scientific method, Bacon described eliminative induction, Newton demonstrated the power of consilience (unification), Peirce clarified the roles of deduction, induction, and abduction, and Popper emphasized the importance of falsification. We've also discovered the usefulness of peer review, control groups, blind and double-blind studies, plus a variety of statistical methods, and added these to "the" scientific method.

In many ways, the best science done today is better than ever — but it still has problems, and most science is done poorly. The good news is that we know what these problems are and we know multiple ways to fix them. What we lack is the will to change things.

This post won't list all the problems with science, nor will it list all the promising solutions for any of these problems. (Here's one I left out.) Below, I only describe a few of the basics.


Problem 1: Publication bias

When the study claiming to show evidence of precognition was published, psychologist Richard Wiseman set up a registry for advance announcement of new attempts to replicate the study.

Carl Shulman explains:

A replication registry guards against publication bias, and at least 5 attempts were registered. As far as I can tell, all of the subsequent replications have, unsurprisingly, failed to replicate Bem's results. However, JPSP and the other high-end psychology journals refused to publish the results, citing standing policies of not publishing straight replications.

From the journals' point of view, this (common) policy makes sense: bold new claims will tend to be cited more and raise journal prestige (which depends on citations per article), even though this means most of the 'discoveries' they publish will be false despite their low p-values (high statistical significance). However, this means that overall the journals are giving career incentives for scientists to massage and mine their data for bogus results, but not to challenge bogus results presented by others.

This is an example of publication bias:

Publication bias is the term for what occurs whenever the research that appears in the published literature is systematically unrepresentative of the population of completed studies. Simply put, when the research that is readily available differs in its results from the results of all the research that has been done in an area, readers and reviewers of that research are in danger of drawing the wrong conclusion about what that body of research shows. In some cases this can have dramatic consequences, as when an ineffective or dangerous treatment is falsely viewed as safe and effective. [Rothstein et al. 2005]

Sometimes, publication bias can be more deliberate. The anti-inflammatory drug Rofecoxib (Vioxx) is a famous case. The drug was prescribed to 80 million people, but in it was later revealed that its maker, Merck, had withheld evidence of the drug's risks. Merck was forced to recall the drug, but it had already resulted in 88,000-144,000 cases of serious heart disease.


Example partial solution

One way to combat publication bias is for journals to only accept experiments that were registered in a public database before they began. This allows scientists to see which experiments were conducted but never reported (perhaps due to negative results). Several prominent medical journals (e.g. The Lancet and JAMA) now operate this way, but this protocol is not as widespread as it could be.


Problem 2: Experimenter bias

Scientists are humans. Humans are affected by cognitive heuristics and biases (or, really, humans just are cognitive heuristics and biases), and they respond to incentives that may not align with an optimal pursuit of truth. Thus, we should expect experimenter bias in the practice of science.

There are many stages in research during which experimenter bias can occur:

  1. in reading-up on the field,
  2. in specifying and selecting the study sample,
  3. in [performing the experiment],
  4. in measuring exposures and outcomes,
  5. in analyzing the data,
  6. in interpreting the analysis, and
  7. in publishing the results. [Sackett 1979]

Common biases have been covered elsewhere on Less Wrong, so I'll let those articles explain how biases work.


Example partial solution

There is some evidence that the skills of rationality (e.g. cognitive override) are teachable. Training scientists to notice and meliorate biases that arise in their thinking may help them to reduce the magnitude and frequency of the thinking errors that may derail truth-seeking attempts during each stage of the scientific process.


Problem 3: Bad statistics

I remember when my statistics professor first taught me the reasoning behind "null hypothesis significance testing" (NHST), the standard technique for evaluating experimental results. NHST uses "p-values," which are statements about the probability of getting some data (e.g. one's experimental results) given the hypothesis being tested. I asked my professor, "But don't we want to know the probability of the hypothesis we're testing given the data, not the other way around?" The reply was something about how this was the best we could do. (But that's false, as we'll see in a moment.)

Another problem is that NHST computes the probability of getting data as unusual as the data one collected by considering what might be expected if that particular experiment was repeated many, many times. But how do we know anything about these imaginary repetitions? If I want to know something about a particular earthquake, am I supposed to imagine a few dozen repetitions of that earthquake? What does that even mean?

I tried to answer these questions on my own, but all my textbooks assumed the soundness of the mistaken NHST framework for scientific practice. It's too bad I didn't have a class with biostatistican Steven Goodman, who says:

The p-value is almost nothing sensible you can think of. I tell students to give up trying.

The sad part is that the logical errors of NHST are old news, and have been known ever since Ronald Fisher began advocating NHST in the 1920s. By 1960, Fisher had out-advocated his critics, and philosopher William Rozeboom remarked:

Despite the awesome pre-eminence [NHST] has attained... it is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research.

There are many more problems with NHST and with "frequentist" statistics in general, but the central one is this: NHST does not follow from the axioms (foundational logical rules) of probability theory. It is a grab-bag of techniques that, depending on how those techniques are applied, can lead to different results when analyzing the same data — something that should horrify every mathematician.

The inferential method that solves the problems with frequentism — and, more importantly, follows deductively from the axioms of probability theory — is Bayesian inference.

So why aren't all scientists using Bayesian inference instead of frequentist inference? Partly, we can blame the vigor of NHST's early advocates. But we can also attribute NHST's success to the simple fact that Bayesian calculations can be more difficult than frequentist calculations. Luckily, new software tools like WinBUGS let computers do most of the heavy lifting required for Bayesian inference.

There's also the problem of sheer momentum. Once a practice is enshrined, it's hard to dislodge it, even for good reasons. I took three statistics courses in university and none of my textbooks mentioned Bayesian inference. I didn't learn about it until I dropped out of university and studied science and probability theory on my own.

Remember the study about precognition? Not surprisingly, it was done using NHST. A later Bayesian analysis of the data disconfirmed the original startling conclusion.


Example partial solution

This one is obvious: teach students probability theory instead of NHST. Retrain current scientists in Bayesian methods. Make Bayesian software tools easier to use and more widespread.



If I'm right that there is unambiguous low-hanging fruit for improving scientific practice, this suggests that particular departments, universities, or private research institutions can (probabilistically) out-perform their rivals (in terms of actual discoveries, not just publications) given similar resources.

I'll conclude with one particular specific hypothesis. If I'm right, then a research group should be able to hire researchers trained in Bayesian reasoning and in catching publication bias and experimenter bias, and have them extract from the existing literature valuable medical truths that the mainstream medical community doesn't yet know about. This prediction, in fact, is about to be tested.

New to LessWrong?

New Comment
144 comments, sorted by Click to highlight new comments since: Today at 9:01 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I only had time to double-check one of the scary links at the top, and I wasn't too impressed with what I found:

In 2010, a careful review showed that published industry-sponsored trials are four times more likely to show positive results than published independent studies, even though the industry-sponsored trials tend to use better experimental designs.

But the careful review you link to claims that studies funded by the industry report 85% positive results, compared to 72% positive by independent organizations and 50% positive by government - which is not what I think of when I hear four times! They also give a lot of reasons to think the difference may be benign: industry tends to do different kinds of studies than independent orgs. The industry studies are mainly Phase III/IV - a part of the approval process where drugs that have already been shown to work in smaller studies are tested on a larger population; the nonprofit and government studies are more often Phase I/II - the first check to see whether a promising new chemical works at all. It makes sense that studies on a drug which has already been found to probably work are more positive than the first studies on a tota... (read more)

Yes, "four times as likely" is not the same as an odds ratio of four. And the problem here is the same as the problem in army1987's LL link that odds ratios get mangled in transmission.

But I like odds ratios. In the limit of small probability, odds ratios are the same as "times as likely." But there's nothing 4x as likely as 50%. Does that mean that 50% is very similar to all larger probabilities? Odds ratios are unchanged (or inverted) by taking complements: 4% to 1% is an odds ratio of about 4; 99% to 96% is also 4 (actually 4.1 in both cases). Complementation is exactly what's going on here. The drug companies get 1.2x-1.3x more positive results than the independent studies. That doesn't sound so big, but everyone is likely to get positive results. If we speak in terms of negative results, the independent studies are 2-3x likely to get negative results as the drug companies. Now it sounds like a big effect.

Odds ratios give a canonical distance between probabilities that doesn't let people cherry-pick between 34% more positives and 3x more negatives. They give us a way to compare any two probabilities that is the obvious one for very small probabilities and is related to the obvi... (read more)

Thank you for this. I've always been frustrated with odds ratios, but somehow it never occurred to me that they have the beautiful and useful property you describe.
I don't know as much about odds ratios as I would like to, but you've convinced me that they're something I should learn thoroughly, ASAP. Does anybody have a link to a good explanation of them?
2Rhwawn12y would be helpful for you, I think, since an explanation/introduction was the stated goal.
Sorry, I don't have any sources. If you want suggestions from other people, you should try the open thread. Some related words that may be helpful in searching for material are logit and logistic (regression).

Thanks for this. I've removed the offending sentence.

Or if you want to appropriate a different popular phrase, "Never tell me the odds ratio!"

At the least, it allows one to argue that the claim "scientific papers are generally reliable" is self-undermining. The prior probability is also high, given the revolving door of "study of the week" science reporting we all are regularly exposed to.
A lot of the literature on cognitive biases is itself among the best examples of how biased people are (though unfortunately not usually in ways that would prove their point, with the obvious exception of confirmation bias).
Seems like both teaching about biases and learning about biases is dangerous.

We've also discovered the usefulness of peer review

I object, for reasons wonderfully stated by gwern here

Why do we need the process of peer review? Peer review is not robust against even low levels of collusion ( Scientists who win the Nobel Prize find their other work suddenly being heavily cited (, suggesting either that the community either badly failed in recognizing the work's true value or that they are now sucking up & attempting to look better by the halo effect. (A mathematician once told me that often, to boost a paper's acceptance chance, they would add citations to papers by the journal's editors - a practice that will surprise none familiar with Goodhart's law and the use of citations in tenure & grants.) Physicist Michael Nielsen points out ( that peer review is historically rare (just one of Einstein's 300 papers was peer reviewed! the famous Nature did not institute peer review until 1967), has been poorly studied ( & not shown to be

... (read more)

That was actually just a slightly-edited-for-Hacker-News excerpt from my standing mini-essay explaining why we can't trust science too much; the whole thing currently lives at

2Pablo Repetto4y
That link points to your Dual N-Back piece. I think you meant

I am skeptical of the teaching solution section under 2), relative to institutional shifts (favoring confirmatory vs exploratory studies, etc). Section 3 could also bear mention of some of the many ways of abusing Bayesian statistical analyses (e.g. reporting results based on gerrymandered priors, selecting which likelihood ratio to highlight in the abstract and get media attention for, etc). Cosma Shalizi would have a lot to say about it.

I do like the spirit of the post, but it comes across a bit boosterish.

Section 3 could also bear mention of some of the many ways of abusing Bayesian statistical analyses

On this note, I predict that if Bayesian statistical analyses ever displaced NHST as the mainstream standard, they would be misused about as much as NHST.

Currently there's a selection bias: NHST is much more widely taught than Bayesian analyses, so NHST users are much more likely to be lowest common demoninator crank-turners who don't really understand statistics generally. By contrast, if you've managed to find out how to do Bayesian inference, you're probably better at statistics than the average researcher and therefore less likely to screw up whatever analysis you choose to do. If every researcher were taught Bayesian inference this would no longer be true.

Still, I think Bayesian methods are superior enough that the net benefit of that would be positive. (Also, proper Bayesian training would also cover how to construct ignorance priors, and I suspect nefariously chosen priors would be easier to spot than nefarious frequentist mistreatment of data.)
Bayesian methods are better in a number of ways, but ignorant people using a better tool won't necessarily get better results. I don't think the net effect of a mass switch to Bayesian methods would be negative, but I do think it'd be very small unless it involved raising the general statistical competence of scientists. Even when Bayesian methods get so commonplace that they could be used just by pushing a button in SPSS, researchers will still have many tricks at their disposal to skew their conclusions. Not bothering to publish contrary data, only publishing subgroup analyses that show a desired result, ruling out inconvenient data points as "outliers", wilful misinterpretation of past work, failing to correct for doing multiple statistical tests (and this can be an issue with Bayesian t-tests, like those in the Wagenmakers et al. reanalysis lukeprog linked above), and so on.

As a biologist, I can say that most statistical errors are just that: errors. They are not tricks. If researchers understand the statistics that they are using, a lot of these problems will go away.

A person has to learn a hell of a lot before they can do molecular biology research, and statistics happens to be fairly low on the priority list for most molecular biologists. In many situations we are able to get around the statistical complexities by generating data with very little noise.

Hanlon's Razor FTW.
ISTM a large benefit of commonplace Bayes would be that competent statisticians could do actually meaningful meta-analyses...? Which would counteract widespread statistical ineptitude to a significant extent...?
I'm not sure it'd make much difference. From reading & skimming meta-analyses myself I've inferred that the main speedbumps with doing them are problems with raw data themselves or a lack of access to raw data. Whether the data were originally summarized using NHST/frequentist methods or Bayesian methods makes a lot less difference. Edit to add: when I say "problems with raw data themselves" I don't necessarily mean erroneous data; a problem can be as mundane as the sample/dataset not meeting the meta-analyst's requirements (e.g. if the sample were unrepresentative, or the dataset didn't contain a set of additional moderator variables).
I think that teaching Bayesian methods would itself raise the general statistical competence of scientists as a side effect, among other things because the meaning of p-values is seriously counter-intuitive (so more scientists would actually grok Bayesian statistics in such a world than actually grok frequentist statistics right now).
You could well be right. I'm pessimistic about this because I remember seeing lots of people at school & university recoiling from any statistical topic more advanced than calculating means and drawing histograms. If they were being taught about conjugate priors & hyperparameters I'd expect them to react as unenthusiastically as if they were being taught about confidence levels and maximum likelihood. But I don't have any rock solid evidence for that hunch.

But as with the problem of global warming and its known solutions, what we lack is the will to change things.

Please don't insert gratuitous politics into LessWrong posts.

I removed the global warming phrase.


What David_G said. Global warming is a scientific issue. Maybe "what we lack is the will to change things" is the right analysis of the policy problems, but among climate change experts there's a whole lot more consensus about global warming than there is among AI researchers about the Singularity. "You can't say controversial things about global warming, but can say even more controversial things about AI" is a rule that makes about as much sense as "teach the controversy" about evolution.

...and what to do about it is a political issue.
It's also a political issue, to a much greater extent than the possibility and nature of a technological singularity.

Evolution is also a political issue. Shall we now refrain from talking about evolution, or mentioning what widespread refusal to accept evolution, up to the point of there being a strong movement to undermine the teaching of evolution in US schools, says about human rationality?

I get that it can be especially hard to think rationally about politics. And I agree with what Eliezer has written about government policy being complex and almost always involving some trade-offs, so that we should be careful about thinking there's an obvious "rationalist view" on policy questions.

However, a ban on discussing issues that happen to be politicized is idiotic, because it puts us at the mercy of contingent facts about what forms of irrationality happen to be prevalent in political discussion at this time. Evolution is a prime example of this. Also, if the singularity became a political issue, would we ban discussion of that from LessWrong?

We should not insert political issues which are not relevant to the topic, because the more political issues one brings to the discussion, the less rational it becomes. It would be most safe to discuss all issues separately, but sometimes is it not possible, e.g. when the topic being discussed relies heavily on evolution.

One part of trying to be rational is to accept that people are not rational, and act accordingly. For every political topic there is a number of people whose minds will turn off if they read something they disagree with. It does not mean we should be quiet on the topic, but we should not insert it where it is not relevant.

Explaining why X is true, in a separate article, is correct approach. Saying or suggesting something like "by the way, people who don't think X is true are wrong" in an unrelated topic, is wrong approach. Why is it so? In the first example you expect your proof of X to be discussed in the comments, because it is the issue. In the second example, discussions about X in comments are off-topic. Asserting X in a place where discussion of X is unwelcome, is a kind of Dark Arts; we should avoid it even if we think X is true.

The topic of evolution, unlike the topic of climate change, is entangled with human psychology, AI, and many other important topics; not discussing it would be highly costly. Moreover, if anyone on LessWrong disagrees with evolution, it's probably along Newsomian eccentric lines, not along tribal political lines. Also, lukeprog's comments on the subject made implicit claims about the policy implications of the science, not just about the science itself, which in turn is less clear-cut than the scientific case against a hypothesis requiring a supernatural agent, though for God's sake please nobody start arguing about exactly how clear-cut.

As a matter of basic netiquette, please use words like "mistaken" or "harmful" instead of "idiotic" to describe views you disagree with.

This post is mostly directed at newbies, which aren't supposed to be trained in trying to keep their brain from shutting down whenever the "politics" pattern matcher goes off. In other words, it could cause some readers to stop reading before they get to the gist of the post. Even at Hacker News, I sometimes see "I stopped reading at this point" posts. Also, I see zero benefit from mentioning global warming specifically in this post. Even a slight drawback outweigh zero benefit.
Oh dear... I admit I hadn't thought of the folks who will literally stop reading when they hit a political opinion they don't like. Yeah, I've encountered them. Though I think they have bigger problems than not knowing how to fix science, and don't think mentioning AGW did zero for this post.
(I don't necessarily disagree with your points, I was simply making a relevant factual claim; yet you seem to have unhesitatingly interpreted my factual claim as automatically implying all sorts of things about what policies I would or would not endorse. Hm...)
I didn't interpret it as anything about what gov. policies you'd endorse. I did infer you agreed with Steven's comment. But anyway, my first comment may not have been clear enough, and I think the second comment should be a useful explication of the first one. (Actually, I meant to type "Maybe... isn't the right analysis..." or "Maybe... is the wrong analysis..." That was intended as acknowledgement of the reasons to be cautious about talking policy. But I botched that part. Oops.)
By "policies" I meant "norms of discourse on Less Wrong". I don't have any strong opinions about them; I don't unhesitatingly agree with Steven's opinion. Anyway I'm glad this thread didn't end up in needless animosity; I'm worried that discussing discussing global warming, or more generally discussing what should be discussed, might be more heated than discussing global warming itself.
Yeah. I thought of making another thread for this issue.
As for the difference with the singularity, views on that are not divided much along tribal political lines (ETA: as you acknowledge), and LessWrong seems much better placed to have positive influence there because the topic has received much less attention, because of LessWrong's strong connection (sociological if nothing else) with the Singularity Institute, and because it's a lot more likely to amount to an existential risk in the eyes of most people here of any political persuasion, though again let's not discuss whether they're right.
The point of politics is the mind-killer is that one shouldn't use politically-charged examples when they're not on-topic. This is exactly that case. The article is not about global warming, so it should not make mention of global warming, because that topic makes some people go insane. This does not mean that there cannot be a post about global warming (to the extent that it's on-topic for the site).
Also, "will" may be the wrong concept. How about "not enough people with the power to change things see sufficient reasons to so"?
Basic science isn't political here. Things like "Humans cause global warming; There is no God; Humans evolved from apes" are politicized some places but here they are just premises. There is no need to drag in political baggage by making "This Is Politics!" declarations in cases like this.
Do you see how the claim that "humans cause global warming" differs from the claim quoted in the grandparent comment?

There are many more problems with NHST and with "frequentist" statistics in general, but the central one is this: NHST does not follow from the axioms (foundational logical rules) of probability theory. It is a grab-bag of techniques that, depending on how those techniques are applied, can lead to different results when analyzing the same data — something that should horrify every mathematician.

The inferential method that solves the problems with frequentism — and, more importantly, follows deductively from the axioms of probability theory — is Bayesian inference.

But two Bayesian inferences from the same data can also give different results. How could this be a non-issue for Bayesian inference while being indicative of a central problem for NHST? (If the answer is that Bayesian inference is rigorously deduced from probability theory's axioms but NHST is not, then the fact that NHST can give different results for the same data is not a true objection, and you might want to rephrase.)

By a coincidence of dubious humor, I recently read a paper on exactly this topic, how NHST is completely misunderstood and employed wrongly and what can be improved! I was only reading it for a funny & insightful quote, but Jacob Cohen (as in, 'Cohen's d') in pg 5-6 of "The Earth Is Round (p < 0.05)" tells us that we shouldn't seek to replace NHST with a "magic alternative" because "it doesn't exist". What we should do is focus on understanding the data with graphics and datamining techniques; report confidence limits on effect sizes, which gives us various things I haven't looked up; and finally, place way more emphasis on replication than we currently do.

An admirable program; we don't have to shift all the way to Bayesian reasoning to improve matters. Incidentally, what Bayesian inferences are you talking about? I thought the usual proposals/methods involved principally reporting log odds, to avoid exactly the issue of people having varying priors and updating on trials to get varying posteriors.

This only works in extremely simple cases.
Could you give an example of an experiment that would be too complex for log odds to be useful?
Any example where there are more than two potential hypotheses. Note, that for example, "this coin is unbiased", "this coin is biased toward heads with p=.61", and "this coin is biased toward heads with p=.62" count as three different hypotheses for this purpose.
This is fair as a criticism of log-odds, but in the example you give, one could avoid the issue of people having varying priors by just reporting the value of the likelihood function. However, this likelihood function reporting idea fails to be a practical summary in the context of massive models with lots of nuisance parameters.
I didn't have any specific examples in mind. But more generally, posteriors are a function of both priors and likelihoods. So even if one avoids using priors entirely by reporting only likelihoods (or some function of the likelihoods, like the log of the likelihood ratio), the resulting implied inferences can change if one's likelihoods change, which can happen by calculating likelihoods with a different model.
If the OP is read to hold constant everything not mentioned as a difference, that includes the prior beliefs of the person doing the analysis, as against the hypothetical analysis that wasn't performed by that person. Does "two Bayesian inferences" imply it is two different people making those inferences, with two people not possibly having identical prior beliefs? Could a person performing axiom-obeying Bayesian inference reach different conclusions than that same person hypothetically would have had they performed a different axiom-obeying Bayesian inference?
I think my reply to gwern's comment (sibling of yours) all but answers your two questions already. But to be explicit: Not necessarily, no. It could be two people who have identical prior beliefs but just construct likelihoods differently. It could be the same person calculating two inferences that rely on the same prior but use different likelihoods. I think so. If I do a Bayesian analysis with some prior and likelihood-generating model, I might get one posterior distribution. But as far as I know there's nothing in Cox's theorem or the axioms of probability theory or anything like those that says I had to use that particular prior and that particular likelihood-generating model. I could just as easily have used a different prior and/or a different likelihood model, and gotten a totally different posterior that's nonetheless legitimate.
The way I interpret hypotheticals in which one person is said to be able to do something other than what they will do, such as "depending on how those techniques are applied," all of the person's priors are to be held constant in the hypothetical. This is the most charitable interpretation of the OP because the claim is that, under Bayesian reasoning, results do not depend on how the same data is applied. This seems obviously wrong if the OP is interpreted as discussing results reached after decision processes with identical data but differing priors, so it's more interesting to talk about agents with other things differing, such as perhaps likelihood-generating models, than it is to talk about agents with different priors. Can you give an example?
But even if we assume the OP means that data and priors are held constant but not likelihoods, it still seems to me obviously wrong. Moreover, likelihoods are just as fundamental to an application of Bayes's theorem as priors, so I'm not sure why I would have/ought to have read the OP as implicitly assuming priors were held constant but not likelihoods (or likelihood-generating models). I didn't have one, but here's a quick & dirty ESP example I just made up. Suppose that out of the blue, I get a gut feeling that my friend Joe is about to phone me, and a few minutes later Joe does. After we finish talking and I hang up, I realize I can use what just happened as evidence to update my prior probability for my having ESP. I write down: * my evidence: "I correctly predicted Joe would call" (call this E for short) * the hypothesis H0 — that I don't have ESP — and its prior probability, 95% * the opposing hypothesis H1 — that I have ESP — and its prior probability, 5% Now let's think about two hypothetical mes. The first me guesses at some likelihoods, deciding that both P(E | H0) and P(E | H1) were both 10%. Turning the crank, it gets a posterior for H1, P(H1 | E), that's proportional to P(H1) P(E | H1) = 5% × 10% = 0.5%, and a posterior for H0, P(H0 | E), that's proportional to P(H0) P(E | H0) = 95% × 10% = 9.5%. Of course its posteriors have to add to 100%, not 10%, so it multiplies both by 10 to normalize them. Unsurprisingly, as the likelihoods were equal, its posteriors come out at 95% for H0 and 5% for H1; the priors are unchanged. When the second me is about to guess at some likelihoods, its brain is suddenly zapped by a stray gamma ray. The second me therefore decides that P(E | H0) was 2% but that P(E | H1) was 50%. Applying Bayes's theorem in precisely the same way as the first me, it gets a P(H1 | E) proportional to 5% × 50% = 2.5%, and a P(H0 | E) proportional to 95% × 2% = 1.9%. Normalizing (but this time multiplying by 100/(2.5+1.9)) gives posterior

This shouldn't be a surprise. What we currently call "science" isn't the best method for uncovering nature's secrets; it's just the first set of methods we've collected that wasn't totally useless like personal anecdote and authority generally are.

It's ridiculous to call non-scientific methods are "useless". Our civilization is based on such non-scientific methods. Observation, anecdotal evidence, trial and error, markets etc. are all deeply unscientific and extremely useful ways of gaining useful knowledge. Next to these Science is really a fairly minor pursuit.

I'd say that existing folk practices and institutions (what I think you mean by "our civilization") are based on the non-survival of rival practices and institutions. Our civilization has the institutions it has, for the same reason that we have two eyes and not three — not because two eyes are better than three, but because any three-eyed rivals to prototypical two-eyed ancestors happened not to survive. Folk practices have typically been selected at the speed of generations, with cultures surviving or dying out — the latter sometimes due to war or disease; but sometimes just as the youth choose to convert to a more successful culture. Science aims at improving knowledge at a faster rate than folk practice selection.

I asked my professor, "But don't we want to know the probability of the hypothesis we're testing given the data, not the other way around?" The reply was something about how this was the best we could do.

One senses that the author (the one in the student role) neither has understood the relative-frequency theory of probability nor has performed any empirical research using statistics--lending the essay the tone of an arrogant neophyte. The same perhaps for the professor. (Which institution is on report here?) Frequentists reject the very concept of "the probability of the theory given the data." They take probabilities to be objective, so they think it a category error to remark about the probability of a theory: the theory is either true or false, and probability has nothing to do with it.

You can reject relative-frequentism (I do), but you can't successfully understand it in Bayesian terms. As a first approximation, it may be better understood in falsificationist terms. (Falsificationism keeps getting trotted out by Bayesians, but that construct has no place in a Bayesian account. These confusions are embarrassingly amateurish.) The Fischer paradigm is that ... (read more)

Then they should also reject the very concept of "the probability of the data given the theory", since that quantity has "the probability of the theory" explicitly in the denominator.

Then they should also reject the very concept of "the probability of the data given the theory", since that quantity has "the probability of the theory" explicitly in the denominator.

You are reading "the probability of the data D given the theory T" to mean p(D | T), which in turn is short for a ratio p(D & T)/p(T) of probabilities with respect to some universal prior p. But, for the frequentist, there is no universal prior p being invoked.

Rather, each theory comes with its own probability distribution p_T over data, and "the probability of the data D given the theory T" just means p_T(D). The different distributions provided by different theories don't have any relationship with one another. In particular, the different distributions are not the result of conditioning on a common prior. They are incommensurable, so to speak.

The different theories are just more or less correct. There is a "true" probability of the data, which describes the objective propensity of reality to yield those data. The different distributions from the different theories are comparable only in the sense that they each get that true distribution more or less right.

Not LessWronger Bayesians, in my experience.
What about:
It would be more accurate to say that LW-style Bayesians consider falsificationism to be subsumed under Bayesianism as a sort of limiting case. Falsificationism as originally stated (ie, confirmations are irrelevant; only falsifications advance knowledge) is an exaggerated version of a mathematically valid claim. From An Intuitive Explanation of Bayes' Theorem:
This seems the key step for incorporating falsification as a limiting case; I contest it. The rules of Bayesian rationality preclude assigning an a priori probability of 1 to a synthetic proposition: nothing empirical is so certain that refuting evidence is impossible. (Isthat assertion self-undermining? I hope that worry can be bracketed.) As long as you avoid assigning probabilities of 1 or 0 to priors, you will never get an outcome at those extremes. But since P(X/A) is always "intermediate," observing X will never strictly falsify A—which is a good thing because the falsification prong of Popperianism has proven at least as scientifically problematic as the nonverification prong. I don't think falsification can be squared with Bayes, even as a limiting case. In Basesian theory, verification and falsification are symmetric (as the slider metaphor really indicates). In principle, you can't strictly falsify a theory empirically any more (or less) than you can verify one. Verification, as the quoted essay confirms, is blocked by the > 0 probability mandatorily assigned to unpredicted outcomes; falsification is blocked by the < 1 probability mandatorily assigned to the expected results. It is no less irrational to be certain that X holds given A than to be certain that X fails given not-A. You are no more justified in assuming absolutely that your abstractions don't leak than in assuming you can range over all explanations.
This throws the baby out with the bathwater; we can falsify and verify to degrees. Refusing the terms verify and falsify because we are not able to assign infinite credence seems like a mistake.
I agree; that's why "strictly." But you seem to miss the point, which is that falsification and verification are perfectly symmetric: whether you call the glass half empty or half full on either side of the equation wasn't my concern. Two basic criticisms apply to Popperian falsificationism: 1) it ignores verification (although the "verisimilitude" doctrine tries to overcome this limitation); and 2) it does assign infinite credence to falsification. No. 2 doesn't comport with the principles of Bayesian inference, but seems part of LW Bayesianism (your term): This allowance of a unitary probability assignment to evidence conditional on a theory is a distortion of Bayesian inference. The distortion introduces an artificial asymmetry into the Bayesian handling of verification versus falsification. It is irrational to pretend—even conditionally—to absolute certainty about an empirical prediction.
We all agree on this point. Yudkowsky isn't supposing that anything empirical has probability 1. In the line you quote, Yudkowsky is saying that even if theory A predicts data X with probability 1 (setting aside the question of whether this is even possible), confirming that X is true still wouldn't push our confidence in the truth of A past a certain threshold, which might be far short of 1. (In particular, merely confirming a prediction X of A can never push the posterior probability of A above p(A|X), which might still be too small because too many alternative theories also predict X). A falsification, on the other hand, can drive the probability of a theory very low, provided that the theory makes some prediction with high confidence (which needn't be equal to 1) that has a low prior probability. That is the sense in which it is true that falsifications tend to be more decisive than confirmations. So, a certain limited and "caveated", but also more precise and quantifiable, version of Popper's falsificationism is correct. Yes, no observation will drive the probability of a theory down to precisely 0. The probability can only be driven very low. That is why I called falsificationism an "an exaggerated version of a mathematically valid claim". As you say, getting to probability 0 is as impossible as getting to probability 1. But getting close to probability 0 is easier than getting equally close to probability 1. This asymmetry is possible because different kinds of propositions are more or less amenable to being assigned extremely high or low probability. It is relatively easier to show that some data has extremely high or low probability (whether conditional on some theory or a priori) than it is to show that some theory has extremely high conditional probability. Fix a theory A. It is very hard to think up an experiment with a possible outcome X such that p(A | X) is nearly 1. To do this, you would need to show that no other possible theory, even among th
All these arguments are at best suggestive. Our abductive capacities are such as to suggest that proving a universal statement about all possible theories isn't necessarily hard. Your arguments, I think, flow from and then confirm a nominalistic bias: accept concrete data; beware of general theories. There are universal statements known with greater certainly than any particular data, e.g., life evolved from inanimate matter and mind always supervenes on physics.
I agree that 1. some universal statements about all theories are very probable, and that 2. some of our theories are more probable than any particular data. I'm not seeing why either of these facts are in tension with my previous comment. Would you elaborate? The claims I made are true of certain priors. I'm not trying to argue you into using such a prior. Right now I only want to make the points that (1) a Bayesian can coherently use a prior satisfying the properties I described, and that (2) falsificationism is true, in a weakened but precise sense, under such a prior.

I hope they're not using that landing page for anything important. It's not clear what product (if any) they're selling, there's no call to action, and in general it looks to me like it's doing a terrible job of overcoming inferential distances. I'd say you did a far better job of selling them than they did. Someone needs to read a half a dozen blog posts about how customers only think of themselves, etc.

Great post by the way, Luke.

2Pablo Repetto4y
The website is currently down and parked by GoDaddy. has several snapshots, but they are all 404s since 2012.
Their website used to have more content on it, I don't know why they changed it.

What should I read to get a good defense of Bayesianism--that isn't just pointing out difficulties with frequentism, NHST, or whatever? I understand the math, but am skeptical that it can be universally applied, due to problems with coming up with the relevant priors and likelihoods.

It's like the problem with simple deduction in philosophy. Yes, if your premises are right, valid deductions will lead you to true conclusions, but the problem is knowing whether the premises used by the old metaphysicians (or modern ones, for that matter) are true. Bayesianis... (read more)

Probability theory can be derived as the extension of classical logic to the case where propositions are assigned plausibilities rather than truth values,so it's not merely like the GIGO problem with simple deduction -- it's the direct inheritance of that problem.
You're right. I'll make sure to say "is the same problem" in the future.
A philosophical treatise of universal induction.
This doesnt seem particular generally actionable for testing scientific hypotheses (which is the general problem with proposing bayes as a way to fix science).
You may want to check out John Earman's Bayes or Bust?.
I suspect that using only valid deductions, while manipulating terms that already have real meanings attached to them, probably poses at least as great a problem as avoiding untrue premises. I remember during a logic class I took, the teacher made an error of deduction, and I called her out on it. She insisted that it was correct, and every other student in the class agreed. I tried to explain the mistake to her after class, and wasn't able to get her to see the error until I drew a diagram to explain it. It was only an introductory level class, but I don't get the impression that most practicing philosophers are at a higher standard.

The inferential method that solves the problems with frequentism — and, more importantly, follows deductively from the axioms of probability theory — is Bayesian inference.

You seem to be conflating Bayesian inference with Bayes Theorem. Bayesian inference is a method, not a proposition, so cannot be the conclusion of a deductive argument. Perhaps the conclusion you have in mind is something like "We should use Bayesian inference for..." or "Bayesian inference is the best method for...". But such propositions cannot follow from mathem... (read more)

It stands on the foundations of probability theory, and while foundational stuff like Cox's theorem takes some slogging through, once that's in place, it is quite straightforward to justify Bayesian inference.
It's actually somewhat tricky to establish that the rules of probability apply to the Frequentist meaning of probability. You have to mess around with long run frequencies and infinite limits. Even once that's done, it hard to make the case that the Frequentist meaning has anything to do with the real world -- there are no such thing as infinitely repeatable experiments. In contrast, a few simple desiderata for "logical reasoning under uncertainty" establish probability theory as the only consistent way to do so that satisfy those criteria. Sure, other criteria may suggest some other way of doing so, but no one has put forward any such reasonable way.
Could Dempster-Shafer theory count? I haven't seen anyone do a Cox-style derivation of it, but I would guess there's something analogous in Shafer's original book.
I would be quite interested in seeing such. Unfortunately I don't have any time to look for such in the foreseeable future.
P.S. Bayes Theorem is derived from a basic statement about conditional probability, such as the following: P(S/T) = P(S&T)/P(T) According to the SEP ( this is usually taken as a "definition", not an axiom, and Bayesians usually give conditional probability some real-world significance by adding a Principle of Conditionalization. In that case it's the Principle of Conditionalization that requires justification in order to establish that Bayes Theorem is true in the sense that Bayesians require.
Just to follow up on the previous replies to this line of thought, see Wikipedia's article on Cox's theorem and especially reference 6 of that article. On the Principle of Conditionalization, it might be argued that Cox's theorem assumes it as a premise; the easiest way to derive it from more basic considerations is through a diachronic Dutch book argument.

disclaimer: I'm not very knowledgeable in this subject to say the least.

This seems relevant: Share likelihood ratios, not posterior beliefs

It would seem useful for them to publish p(data|hypothesis) because then I can use my priors for p(hypothesis) and p(data) to calculate p(hypothesis|data).

Otherwise, depending on what information they updated on to get their priors I might end up updating on something twice.

Cigarette smoking: an underused tool in high-performance endurance training

In summary, existing literature supports the use of cigarettes to enhance endurance performance through weight loss and increased serum hemoglobin levels and lung volumes.

musical contrast and chronological rejuvenation

...people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba” (adjusted M = 21.5 years), F(1, 17) = 4.92, p = .040.

Effects of remote, retroactive intercessory prayer on outcomes i... (read more)

The music link doesn't work. I will tentatively suggest that the difference reported is about people hearing music which was popular when they were younger than about the details of the music.

What we currently call "science" isn't the best method for uncovering nature's secrets; it's just the first set of methods we've collected that wasn't totally useless like personal anecdote and authority generally are.

Prior methods weren't completely useless. Humans went from hunter-gatherers to civilization without the scientific method or a general notion of science. It is probably more fair to say that science was just much better than all previous methods.

NHST uses "p-values," which are statements about the probability of getting some data (e.g. one's experimental results) given the hypothesis being tested.

Wait, that confused me. I thought the p-value was the chance of the data given the null hypothesis.

Since NHST is "null hypothesis significance testing", the hypothesis being tested is the null hypothesis!

In the vernacular, when "testing a hypothesis" we refer to the hypothesis of interest as the one being tested, i.e. the alternative to the null - not the null itself. (For instance, we say things like "test the effect of gender", not the more cumbersome "test the null hypothesis of the absence of an effect of gender".)

In any case it wouldn't hurt the OP, and could only make it clearer, to reword it to remove the ambiguity.

I really like the discussions of the problems, but I would have loved to see more discussions of the solutions. How do we know, more specifically, that they will solve things? What are the obstacles to putting them into effect -- why, more specifically, do people just not want to do it? I assume it's something a bit more complex than a bunch of people going around saying "Yeah, I know science is flawed, but I don't really feel like fixing it." (Or maybe it isn't?)

journals to only accept experiments that were registered in a public database

I know this is stating the obvious, but the next stage after this is for people to regard "science" as what's in the database rather than what's in the journals. Otherwise there's still publication bias (unless people like writing up boring results and journals like publishing them)

Well, the database wouldn't contain any results. What it does though is reduce the importance of published claims that have a large number of non-published (probably failed) attempts at showing the same effect. Ideally you want the literature review section of a paper to include a mention of all these related but unpublished experiments, not just other published results.
Boredom is far from the only bad reason that some journals refuse some submissions. Every person in the chain of publication, and that of peer review, must be assumed at least biased and potentially dishonest. Therefore "science" can never be defined by just one database or journal, or even a fixed set of either. Excluded people must always be free to start their own, and their results judged on the processes that produced them. Otherwise whoever is doing the excluding is not to be trusted as an editor. I hasten to add that this kind of bias exists among all sides and parties.

This, I think, is just one symptom of a more general problem with scientists: they don't emphasize rigorous logic as much as they should. Science, after all, is not only about (a) observation but about (b) making logical inferences from observation. Scientists need to take (b) far more seriously (not that all don't, but many do not). You've heard the old saying "Scientists make poor philosophers." It's true (or at least, true more often than it should be). That has to change. Scientists ought to be amongst the best philosophers in the world, precisely because they ought to be masters of logic.

The problem is that philosophers also make poor philosophers. Less snarkily, "logical inference" is overrated. It does wonders in mathematics, but rarely does scientific data logically require a particular conclusion.
Well, of course one cannot logically and absolutely deduce much from raw data. But with some logically valid inferential tools in our hands (Occam's razor, Bayes' Theorem, Induction) we can probabilistically derive conclusions.
In what sense Occam's razor "logically valid"?
Well, it is not self-contradictory, for one thing. For another thing, every time a new postulate or assumption is added to a theory we are necessarily lowering the prior probability because that postulate/assumption always has some chance of being wrong.
Just to clarify something: I would expect most readers here would interpret "logically valid" to mean something very specific - essentially something is logically valid if it can't possibly be wrong, under any interpretation of the words (except for words regarded as logical connectives). Self-consistency is a much weaker condition than validity. Also, Occam's razor is about more than just conjunction. Conjunction says that "XY" has a higher probability than "XYZ"; Occam's razor says that (in the absence of other evidence), "XY" has a higher probability than "ABCDEFG".
Hi Giles, I think Occam's razor is logically valid in the sense that, although it doesn't always provide the correct answer, it is certain that it will probably provide the correct answer. Also, I'm not sure if I understand your point about conjunction. I've always understood "do not multiply entities beyond necessity" to mean that, all else held equal, you ought to make the fewest number of conjectures/assumptions/hypotheses possible.
The problem is that the connotations of philosophy (in my mind at least) are more like how-many-angels mindwanking than like On the electrodynamics of moving bodies. (This is likely the effect of studying pre-20th-century philosophers for five years in high school.)
21st century philosophers aren't much different.
Saying that people should be better is not helpful. Like all people, scientists have limited time and need to choose how to allocate their efforts. Sometimes more observations can solve a problem, and sometimes more careful thinking is necessary. The appropriate allocation depends on the situation and the talents of the researcher in question. That being said, there may be a dysfunctional bias in how funding is allocated -- creating a "all or none" environment where the best strategy for maintaining a basic research program (paying for one's own salary plus a couple of students) is to be the type of researcher who gets multi-million dollar grants and uses that money to generate gargantuan new datasets, which can then provide the foundation for a sensational publication that everyone notices.

It is important here to distinguish two roles of statistics in science: exploration and confirmation. It seems likely that Bayesian methods are more powerful (and less prone to misuse) than non-Bayesian methods the exploratory paradigm.

However, for the more important issue of confirmation, the primary importance of statistical theory is to: 1) provide a set of quantitative guidelines for scientists to design effective (confirmatory) experiments and avoid being mislead by the results of poorly designed experiments or experiments with inadequate sample size... (read more)

Thanks for putting this together. There are many interesting links in there.

I am hopeful that Bayesian methods can help to solve some of our problems, and there is constant development of these techniques in biology.

Scientists should pay more attention to their statistical tests, and I often find myself arguing with others when I don't like their tests. The most important thing that people need to remember is what "NHST" actually does -- it rejects the null hypothesis. Once they think about what the null hypothesis is, and realize that they have done nothing more than reject it, they will make a lot of progress.


Not only can't I get my head around explanations for bayes formalisms, I have no idea how to apply it to my science. And that's as a Lesswronger. WinBugs looks 1000x times more complicated that those 'intuitive' explanations of bayes like 'update your beliefs' and 'your priors should affect your expectations, then be updated'.

Also this year,

Nitpick: actually last year (March 2011, per ).

I imagine you intended to link to consilience the concept, not the book. Then again you may just be trying to be subtle.