A friend recently posted a link on his Facebook page to an informational graphic about the alleged link between the MMR vaccine and autism. It said, if I recall correctly, that out of 60 studies on the matter, not one had indicated a link.

Presumably, with 95% confidence.

This bothered me. What are the odds, supposing there is no link between X and Y, of conducting 60 studies of the matter, and of all 60 concluding, with 95% confidence, that there is no link between X and Y?

Answer: .95 ^ 60 = .046. (Use the first term of the binomial distribution.)

So if it were in fact true that 60 out of 60 studies failed to find a link between vaccines and autism at 95% confidence, this would prove, with 95% confidence, that studies in the literature are biased against finding a link between vaccines and autism.

In reality, you should adjust your literature survey for known biases of literature. Scientific literature has publication bias, so that positive results are more likely to be reported than negative results.

They also have a bias from errors. Many articles have some fatal flaw that makes their results meaningless. If the distribution of errors is random, I think--though I'm not sure--that we should assume this bias causes regression towards an equal likelihood of positive and negative results.

Given that both of these biases should result, in this case, in more positive results, having all 60 studies agree is even more incredible.

So I did a quick mini-review this morning, looking over all of the studies cited in 6 reviews on the results of studies on whether there is a connection between vaccines and autism:

National Academies Press (2004). Immunization safety review: Vaccines and autism.

National Academies Press (2011). Adverse effects of vaccines: Evidence and causality.

American Academy of Pedatricians (2013): Vaccine safety studies.

The current AAP webpage on vaccine safety studies.

The Immunization Action Coalition: Examine the evidence.

Taylor et al. (2014). Vaccines are not associated with autism: an evidence-based meta-analysis of case-control and cohort studies. *Vaccine* Jun 17;32(29):3623-9. Paywalled, but references given here.

I listed all of the studies that were judged usable in at least one of these reviews, removed duplicates, then went through them all and determined, either from the review article or from the study's abstract, what it concluded. There were 39 studies used, and all 39 failed to find a connection between vaccines and autism. 4 studies were rejected as methodologically unsound by all reviews that considered them; 3 of the 4 found a connection.

(I was, as usual, irked that if a study failed to prove the existence of a link given various assumptions, it was usually cited as having shown that there was no link.)

I understand that even a single study indicating a connection would immediately be seized on by anti-vaccination activists. (I've even seen them manage to take a study that indicated no connection, copy a graph in that study that indicated no connection, and write an analysis claiming it proved a connection.) Out there in the real world, maybe it's good to suppress any such studies. Maybe.

But here on LessWrong, where our job is not physical health, but mental practice, we shouldn't kid ourselves about what the literature is doing. Our medical research methodologies are not good enough to produce 39 papers and have them all reach the right conclusion. The chances of this happening are only .95 ^ 39 = 0.13, even before taking into account publication and error bias.

**Note: This does not apply in the same way to reviews that show a link between X and Y**

If the scientific community felt compelled to revisit the question of whether gravity causes objects to fall, and conducted studies using a 95% confidence threshold comparing apples dropped on Earth to apples dropped in deep space, we would not expect 5% of the studies to conclude that gravity has no effect on apples. 95% confidence means that, even if there is no link, there's a 5% chance the data you get will look as if there is a link. It does not mean that if there is a link, there's a 5% chance the data will look as if there isn't. (In fact, if you're wondering how small studies and large studies can all have 95% confidence, it's because, by convention, the extra power in large studies is spent on being able to detect smaller and smaller effects, not on higher and higher confidence that a detected effect is real. Being able to detect smaller and smaller effects means having a smaller and smaller chance that, if there is an effect, it will be too small for your study to detect. Having "95% confidence" tells you nothing about the chance that you're able to detect a link if it exists. It might be 50%. It might be 90%. This is the information black hole that priors disappear into when you use frequentist statistics.)

## Critiquing bias

One plausible mechanism is that people look harder for methodological flaws in papers they don't like than in papers that they like. If we allowed all 43 of the papers, we'd have 3 / 43 finding a link, which would still be surprisingly low, but possible.

To test this, I looked at Magnuson 2007, "Aspartame: A Safety Evaluation Based on Current Use Levels, Regulations, and Toxicological and Epidemiological Studies" (Critical Reviews in Toxicology,37:629–727). This review was the primary--in fact, nearly the only--source cited by the most-recent FDA review panel to review the safety of aspartame. The paper doesn't mention that its writing was commissioned by companies who sell aspartame. Googling their names revealed that at least 8 of the paper's 10 authors worked for companies that sell aspartame, either at the time that they wrote it, or shortly afterwards.

I went to section 6.9, "Observations in humans", and counted the number of words spent discussing possible methodological flaws in papers that indicated a link between aspartame and disease, versus the number of words spent discussing possible methodological flaws in papers that indicated no link. I counted only words suggesting problems with a study, not words describing its methodology.

224 words were spent critiquing 55 studies indicating no link, an average of 4.1 words per study. 1375 words were spent critiquing 24 studies indicating a link, an average of 57.3 words per study.

(432 of those 1375 words were spent on a long digression arguing that formaldehyde isn't really carcinogenic, so that figure goes down to only 42.9 words per positive-result study if we exclude that. But that's... so bizarre that I'm not going to exclude it.)

OK so I got interested in this strong claim Phil is making and went to look at the original study he is critiquing so strenuously.

But there is no link to that original study or infographic or whatever.

I don't think there is any value to a strenuous criticism of a study or result when there is no link to that study in the critique.

I tried google and found this image based on this bibliography. It took me a bit to figure out that ti doesn't match Phil's description.

As for Phil's 60 studies, the fact that he gathered his 39 studies from 4 papers

that he does linksuggests that he was not able to find any actual list of 60 studies. It doesn't matter whether the 39 studies come from the same bibliography or not. Either the 4 reviews are, on average, biased in their review of the literature, or the literature is itself subject to publication bias.Boy, it's a real pity that there's no research into excess significance in which various authors do systematic samples of large numbers of papers to get field-wide generalizations and observations about whether this is a common phenomenon or not. As it stands, we have no idea whether Phil has cherry-picked a rare phenomenon or not.

Such a pity.

Presumably? I checked the definition of presumably:

So you take this uncertain confidence level of 95% and find:

OK so you presumed 95% confidence level and showed that that confidence level is inconsistent with unanimity across 60 studies.

Assuming the studies are good, what confidence level would be consistent with unanimity?

Answer: .99^60 = 54%

So from this result we conclude either 1) there is a a problem with at least some of the studies or 2) there is a problem with the presumption of 95% confidence level, but a 99% confidence level would work fine.

For this post to have positive value, the case for picking only conclusion 1 above, and not considering conclusion 2, needs to be made. If the 95% confidence level is in fact EXPLICIT in these studies, then that needs to be verified, and the waffle-word "presumably" needs to be removed.

I don't think the "95% confidence" works that way. It's a lower bound, you never try to publish anything with a lower than 95% confidence (and if you do, your publication is likely to be rejected), but you don't always need to have exactly 95% (2 sigma).

Hell, I play enough RPGs to know that rolling 1 or 20 in a d20 is frequent enough ;) 95% is quite low confidence, it's really a minimum at which you can start working, but not something optimal.

I'm not sure exactly in medicine, but in physics it's frequent to have studies at 3 sigma (99.7%) or higher. The detection of the Higgs boson by the LHC for example was done within 5 sigma (one chance in a million of being wrong).

Especially in a field with high risk of data being abused by ill-intentioned people such as "vaccine and autism" link, it would really surprise me that everyone just kept happily the 95% confidence, and didn't aim for much higher confidence.

Careful! That's a one chance in a million of a fluke occuring (given the null hypothesis). Probability of being wrong is P(~H1 | 5 sigma) rather than P(5 sigma | H0), and on the whole unmeasurable. :)

Okay. Be surprised. It appears that I've read hundreds of medical journal articles and you haven't.

Medicine isn't like physics. The data is incredibly messy. High sigma results are often unattainable even for things you know are true.

Was that "exactly 95% confidence" or "at least 95% confidence"?

"Power" is a statistical term of art, and its technical meaning is neither 1 - alpha) nor 1 - p.

In your "critiquing bias" section you allege that 3/43 studies supporting a link is "still surprisingly low". This is wrong; it is actually surprisingly high. If B ~ Binom(43, 0.05), then P(B > 2) ~= 0.36.*

*As calculated by the following Python code:

I don't think this is likely, but one possible explanation is that vaccines prevent autism.

This seems like a big leap. 95% confidence means

at least95% confidence, right? So if I reject the "vaccines cause autism" hypothesis with p = 0.001, that makes me 95% confident and I publish?I question this assumption. The distribution of errors that people will make in public communication tracks closely what they can expect to get away with doing. Errors (and non-errors, for that matter) that would result in social sanction will be reviewed more closely ... (read more)

I don't think that it's

necessarilysuspicious in that, a priori, I wouldn't have a problem with 60 tests all being negative even though they're all only 95% confident.The reason being, depending on the nature of the test, the probability of a false negative might indeed be 5% while the probability of a false positive could be tiny. Suppose this is indeed the case and let's consider the two cases that the true answer is either 'positive' or 'negative'.

(A) if the true conclusion is 'positive', any test can yield a negative with 5% probability. (this test wi... (read more)

What would have happened to results that vaccines

preventautism?When it comes to studies of vaccine side effect there one thing that very worrying. When a new vaccine enter the market there is testing for side effects. Those studies actually do find side effects and the Center of Disease Control should be a trustworthy source for reporting them.

It turn out different vaccine have quite different side effects. They didn't find any Nausea, vomiting, diarrhea, or abdominal pain as a side effect in the Hepatitis A vaccine but 1 of 4 people who take the HPV--Cervarix vaccine get them. Maybe different vaccine work extremely d... (read more)

Stupid mathematical nitpick:

Actually, it is more correct to say that .95 ^ 39 = 0.14.

If we calculate it out to a few more decimal places, we see that .95 ^ 39 is ~0.135275954. This is closer to 0.14 than to 0.13, and the mathematical convention is to round accordingly.

Did you save a list of the

p-values reported in the 39 (or 43) studies you looked at? I wonder what I'd get if I aggregated them with Fisher's method.Well in light of how the modern scientific processes produces a bias against contrary views the activists' seizing on any studies drawing contrary conclusions appears ... (read more)

First thing, if you put something in your body, it has some effect, even if that effect is small. "No effect" results just rule out effects above different effect sizes (both positive and negative) with high probability, and there's no point talking about "a link" like it's some discrete thing (you sort of jump back and forth between getting this one right and wrong).

Second, different studies will rule out different effect sizes with 95% confidence - or to put it another way, at a given effect size, different studies will have different... (read more)

Vaguely interesting, but a p=.13 result in pretty much any field gets a big fat "meh" from me. Sometimes p=.13 results happen. I'd want stronger evidence before I started to suspect bias.

Simple statistics, but eye-opening. I wonder if gwern would be interested enough to do a similar analysis, or maybe he already has.

Goetz is re-inventing a meta-analytic wheel here (which is nothing to be ashamed of). It certainly is the case that a body of results can be too good to be true. To Goetz's examples, I'll add acupuncture, but wait, that's not all! We can add

everythingto the list: "Do Certain Countries Produce Only Positive Results? A Systematic Review of Controlled Trials" is a fun** paper which finds'Excess significance' is not a new concept (fun fact: people even use the phrase 'too good to be true' to summarize it, just like Goetz does) and is a valid sign of bias in whatever set of studies on is looking at, and as he says, you can treat it as a binomial to calculate the odds of

nstudies failing to hit their quota of 5% fa... (read more)But none of the incentives seem particularly strong there. It's not offensive to any state religion, it's not objectionable to local landlords, it's not a subversive creed espoused by revolutionaries who want to depose the emperor. The bad incentives here seem to be small bureaucratic ones along the line of it being easier to judge academics for promotion based on how many papers they publish. If genuine science can't survive

thatand will degenerate into cargo cult science when hit by such weak incentives...Yes, that's suspicious. Good instinct. I'm sure there's some bias against publishing a marginally-significant result that's got a low (outside the framework of the paper's statistical model) prior. I'd bet some of the unlucky ones got file-drawered, and others (dishonestly or not) kept on collecting more data until the noise (I presume) was averaged down.

However, you might be missing that on an iso-P contour, false positives have diminishing effect size as sample size increases.

For those who don't know what a case control or cohort study is:

''...Essentially, for a cohort study, you start at the point of exposure, and then follow individuals to see who develops the outcome. In a retrospective cohort, you find information that has recorded participants prior exposures. In a case control study, you start with a group of cases (those with the disease) and controls (those without disease) and then measure exposures. These two designs are similar, but they differ on the starting point (outcome or exposure).; - AL - UoM

What? No. Just... no. You can't say "Because P('result'|H0)=.05, P(~'result'|H1)=.05".