Too good to be true

[-]mwengler11y210

A friend recently posted a link on his Facebook page to an informational graphic about the alleged link between the MMR vaccine and autism. It said, if I recall correctly, that out of 60 studies on the matter, not one had indicated a link.

Presumably, with 95% confidence.

OK so I got interested in this strong claim Phil is making and went to look at the original study he is critiquing so strenuously.

But there is no link to that original study or infographic or whatever.

I don't think there is any value to a strenuous criticism of a study or result when there is no link to that study in the critique.

[-]Douglas_Knight11y130

I tried google and found this image based on this bibliography. It took me a bit to figure out that ti doesn't match Phil's description.

As for Phil's 60 studies, the fact that he gathered his 39 studies from 4 papers that he does link suggests that he was not able to find any actual list of 60 studies. It doesn't matter whether the 39 studies come from the same bibliography or not. Either the 4 reviews are, on average, biased in their review of the literature, or the literature is itself subject to publication bias.

7private_messaging11y

And of course, whenever Phils of this world encounter the example of results not being slightly too good to be true, they're just as likely to write an LW post about that.

[-]gwern11y100

whenever Phils of this world encounter the example of results not being slightly too good to be true

Boy, it's a real pity that there's no research into excess significance in which various authors do systematic samples of large numbers of papers to get field-wide generalizations and observations about whether this is a common phenomenon or not. As it stands, we have no idea whether Phil has cherry-picked a rare phenomenon or not.

Such a pity.

3private_messaging11y

Well, I don't see anyone writing about e.g. physics results not being too good to be true, or government-sponsored pharmaceutical studies not being too good to be true etc. Nor would it be particularly rare to obtain that sort of result anyway.

6gwern11y

Well, more generally people do apply that sort of reasoning in being skeptical of improbable results, like most people's reaction (especially on LW) to the neutrino FTL result was that the result was simply wrong, regardless of how many measurements they took. I'm not really familiar with how significance-testing is used in physics, but at least under the six-sigma level of alpha, it would take an enormous number of studies of a null hypothesis before the lack of statistical-significance would become 'too good to be true'. Then maybe you should look instead of talking out of your ass. People talk about problems with clinical trials all the time, and pharmaceutical & medicine in general is the home stomping grounds for a lot of meta approaches like excess significance.

1private_messaging11y

Physics is very diverse. There's those neutrino detectors which detect and fail to detect rare events, for example. Yes, and they don't seem to talk much about non problems.

6gwern11y

OK, so? Do they impose six-sigmas on the total result, subdivisions, or what? Yes, because almost all clinical trials stink. Publication bias is pervasive, and the methodological problems are almost universal. When you read through, say, Cochrane meta-analyses or reviews, it's normal to find that something like 90%+ of studies had to be discarded because they lacked such basic desiderata as 'blinding' or 'randomization' or simply didn't specify important things like sample sizes or intent-to-treat. That people are willing to cite studies at all is 'talking about non problems'.

[-]mwengler11y120

Presumably, with 95% confidence.

Presumably? I checked the definition of presumably:

used to convey that what is asserted is very likely though not known for certain.

So you take this uncertain confidence level of 95% and find:

What are the odds, supposing there is no link between X and Y, of conducting 60 studies of the matter, and of all 60 concluding, with 95% confidence, that there is no link between X and Y?

Answer: .95 ^ 60 = .046. (Use the first term of the binomial distribution.)

OK so you presumed 95% confidence level and showed that that confidence level is inconsistent with unanimity across 60 studies.

Assuming the studies are good, what confidence level would be consistent with unanimity?

Answer: .99^60 = 54%

So from this result we conclude either 1) there is a a problem with at least some of the studies or 2) there is a problem with the presumption of 95% confidence level, but a 99% confidence level would work fine.

For this post to have positive value, the case for picking only conclusion 1 above, and not considering conclusion 2, needs to be made. If the 95% confidence level is in fact EXPLICIT in these studies, then that needs to be verified, and the waffle-word "presumably" needs to be removed.

4gwern11y

Is there any reason at all to think that these medical studies didn't use 95%? The universal confidence level, used pretty much everywhere in medicine and psychology except in rare subfields like genomics, so universal that authors of papers typically don't even bother to specify or justify the confidence level?

3private_messaging11y

There's all sorts of things one has to control for, e.g. parent's age, that may inflate the error bars (if the error in imperfectly controlling for a co-founder is accounted for), putting zero within the error bars. Without looking at all the studies one can't really tell. Some studies ought to also have a chance of making a superfluous finding that 'vaccines prevent autism', but apparently that was not observed either.

3gwern11y

What does that have to do with whether the researchers followed the nigh-universal practice of setting alpha to 0.05?

3private_messaging11y

Example: I am measuring radioactivity with a Geiger counter. I have statistical error (with the 95% confidence interval), but I also have systematic error (e.g. the Geiger counter's sensitivity is 'guaranteed' to be within 5% of a specified value). If I am reporting an unusual finding, I'd want the result not to be explainable by the sum of statistical error and the bound on the systematic error. Bottom line is, generally there's no guarantee that "95% confidence" findings will go the other way 5% of the time. It is perfectly OK to do something that may inadvertently boost the confidence.

2wedrifid11y

I'd love to see a paper get published that justified the confidence level with "because if I wanted to do rigorous science I would have studied physics" or "because we only have enough jelly beans to run 30 studies, will only be given more jelly beans if we get a positive result and so need to be sure".

1mwengler11y

Suppose there were 60 studies that showed no correlation between autism and vaccines at a 99% confidence level. THen it would not be particularly surprising that there were indeed 60 studies with that result. Would you expect the authors to point out that their result was actually 99% confident even though the usual standard, which they were not explicitly claiming anyway, was 95%?

0mwengler11y

retracted

3DanielLC11y

That part was just him noticing his confusion. The only way to figure out what the real confidence levels were would be to try and find the studies, which is exactly what he did.

5private_messaging11y

I read his post twice and I still don't see him having figured out the real confidence levels or claiming to have. edit: besides, Phil's own claims don't even meet the 95% confidence, and god only knows out of how big of a pool he fished this bias example from, and how many instances of 'a few studies find a link but most don't' he ignored until he came up with this.

[-]kilobug11y120

I don't think the "95% confidence" works that way. It's a lower bound, you never try to publish anything with a lower than 95% confidence (and if you do, your publication is likely to be rejected), but you don't always need to have exactly 95% (2 sigma).

Hell, I play enough RPGs to know that rolling 1 or 20 in a d20 is frequent enough ;) 95% is quite low confidence, it's really a minimum at which you can start working, but not something optimal.

I'm not sure exactly in medicine, but in physics it's frequent to have studies at 3 sigma (99.7%) or higher. The detection of the Higgs boson by the LHC for example was done within 5 sigma (one chance in a million of being wrong).

Especially in a field with high risk of data being abused by ill-intentioned people such as "vaccine and autism" link, it would really surprise me that everyone just kept happily the 95% confidence, and didn't aim for much higher confidence.

[-]nshepperd11y160

5 sigma (one chance in a million of being wrong)

Careful! That's a one chance in a million of a fluke occuring (given the null hypothesis). Probability of being wrong is P(~H1 | 5 sigma) rather than P(5 sigma | H0), and on the whole unmeasurable. :)

[-]PhilGoetz11y110

Especially in a field with high risk of data being abused by ill-intentioned people such as "vaccine and autism" link, it would really surprise me that everyone just kept happily the 95% confidence, and didn't aim for much higher confidence.

Okay. Be surprised. It appears that I've read hundreds of medical journal articles and you haven't.

Medicine isn't like physics. The data is incredibly messy. High sigma results are often unattainable even for things you know are true.

[-]dvasya11y120

Was that "exactly 95% confidence" or "at least 95% confidence"?

5dvasya11y

Also, different studies have different statistical power, so it may not be OK to simply add up their evidence with equal weights.

8A1987dM11y

p-values are supposed to be distributed uniformly from 0 to 1 conditional on the null hypothesis being true.

1PhilGoetz11y

No; it's standard to set the threshold for your statistical test for 95% confidence. Studies with larger samples can detect smaller differences between groups with that same statistical power.

[-]Cyan11y160

No; it's standard to set the threshold for your statistical test for 95% confidence. That's its statistical power.

"Power" is a statistical term of art, and its technical meaning is neither 1 - alpha) nor 1 - p.

4PhilGoetz11y

Oops; you're right. Careless of me; fixed.

0DanielLC11y

It's times like this that I wish Doctor Seuss was a mathematician (or statistician in this case). If they were willing to make up new words, we'd be able to talk without accidentally using jargon that has technical meaning we didn't intend.

5Mass_Driver11y

I'm confused about how this works. Suppose the standard were to use 80% confidence. Would it still be surprising to see 60 of 60 studies agree that A and B were not linked? Suppose the standard were to use 99% confidence. Would it still be surprising to see 60 of 60 studies agree that A and B were not linked? Also, doesn't the prior plausibility of the connection being tested matter for attempts to detect experimenter bias this way? E.g., for any given convention about confidence intervals, shouldn't we be quicker to infer experimenter bias when a set of studies conclude (1) that there is no link between eating lithium batteries and suffering brain damage vs. when a set of studies conclude (2) that there is no link between eating carrots and suffering brain damage?

9PhilGoetz11y

"95% confidence" means "I am testing whether X is linked to Y. I know that the data might randomly conspire against me to make it look as if X is linked to Y. I'm going to look for an effect so large that, if there is no link between X and Y, the data will conspire against me only 5% of the time to look as if there is. If I don't see an effect at least that large, I'll say that I failed to show a link between X and Y." If you went for 80% confidence instead, you'd be looking for an effect that wasn't quite as big. You'd be able to detect smaller clinical effects--for instance, a drug that has a small but reliable effect--but if there were no effect, you'd be fooled by the data 20% of the time into thinking that there was. It would if the papers claimed to find a connection. When they claim not to find a connection, I think not. Suppose people decided to test the hypothesis that stock market crashes are caused by the Earth's distance from Mars. They would gather data on Earth's distance from Mars, and on movements in the stock market, and look for a correlation. If there is no relationship, there should be zero correlation, on average. That (approximately) means that half of all studies will show a negative correlation, and half will have positive correlation. They need to pick a number, and say that if they find a positive correlation above that number, they've proven that Mars causes stock market crashes. And they pick that number by finding the correlation just exactly large enough that, if there is no relationship, it happens 5% of the time by chance. If the proposition is very very unlikely, somebody might insist on a 99% confidence interval instead of a 95% confidence interval. That's how prior plausibility would affect it. Adopting a standard of 95% confidence is really a way of saying we agree not to haggle over priors.

1V_V11y

I think it is "only at most 5% of the time".

6Douglas_Knight11y

No, we are choosing the effect size before we do the study. We choose it so that if the true effect is zero, we will have a false positive exactly 5% of the time.

2jbay11y

How does this work for a binary quantity? If your experiment tells you that [x > 45] with 99% confidence, you may in certain cases be able to confidently transform that to [x > 60] with 95% confidence. For example, if your experiment tells you that the mass of the Q particle is 1.5034(42) with 99% confidence, maybe you can say instead that it's 1.50344(2) with 95% confidence. If your experiment happens to tell you that [particle Q exists] is true with 99% confidence, what kind of transformation can you apply to get 95% confidence instead? Discard some of your evidence? Add noise into your sensor readings? Roll dice before reporting the answer?

3Douglas_Knight11y

We're not talking about a binary quantity.

2V_V11y

According to Wikipedia:

5Douglas_Knight11y

Quoting authorities without further commentary is a dick thing to do. I am going to spend more words speculating about the intention of the quote than are in the quote, let alone that you bothered to type. I have no idea what you think is relevant about that passage. It says exactly what I said, except transformed from the effect size scale to the p-value scale. But somehow I doubt that's why you posted it. The most common problem in the comments on this thread is that people confuse false positive rate with false negative rate, so my best guess is that you are making that mistake and thinking the passage supports that error (though I have no idea why you're telling me). Another possibility, slightly more relevant to this subthread, is that you're pointing out that some people use other p-values. But in medicine, they don't. They almost always use 95%, though sometimes 90%.

1V_V11y

My confusion is about "at least" vs. "exactly". See my answer to Cyan.

3Cyan11y

You want size), not p-value. The difference is that size is a "pre-data" (or "design") quantity, while the p-value is post-data, i.e., data-dependent.

3V_V11y

Thanks. So if I set size at 5%, collect the data, and run the test, and repeat the whole experiment with fresh data multiple times, should I expect that, if the null hypothesis is true, the test accepts exactly %5 of times, or at most 5% of times?

3Cyan11y

If the null hypothesis is simple (that is, if it picks out a single point in the hypothesis space), and the model assumptions are true blah blah blah, then the test (falsely) rejects the null with exactly 5% probability. If the null is composite (comprises a non-singleton subset of parameter space), and there is no nice reduction to a simple null via mathematical tricks like sufficiency or the availability of a pivot, then the test falsely rejects the null with at most 5% probability. But that's all very technical; somewhat less technically, almost always, a bootstrap procedure is available that obviates these questions and gets you to "exactly 5%"... asymptotically. Here "asymptotically" means "if the sample size is big enough". This just throws the question onto "how big is big enough," and that's context-dependent. And all of this is about one million times less important than the question of how well each study addresses systematic biases, which is an issue of real, actual study design and implementation rather than mathematical statistical theory.

3dvasya11y

How does your choice of threshold (made beforehand) affect your actual data and the information about the actual phenomenon contained therein?

[-]benkuhn11y100

In your "critiquing bias" section you allege that 3/43 studies supporting a link is "still surprisingly low". This is wrong; it is actually surprisingly high. If B ~ Binom(43, 0.05), then P(B > 2) ~= 0.36.*

*As calculated by the following Python code:

from scipy.stats import binom
b = binom(43, 0.05)
p_less_than_3 = sum(b.pmf(i) for i in [0,1,2])
print 1 - p_less_than_3

0PhilGoetz11y

I said "surprisingly low" because of publication & error bias.

3private_messaging11y

Which way do you think publication bias on the issue goes, anyway?

0PhilGoetz9y

I wrote a paragraph on that in the post. I predicted a publication bias in favor of positive results, assuming the community is not biased on the particular issue of vaccines & autism. This prediction is probably wrong, but that hypothesis (lack of bias) is what I was testing.

[-]DanielLC11y90

I don't think this is likely, but one possible explanation is that vaccines prevent autism.

3ChristianKl11y

If that's true, why didn't one of the researchers publish a paper on that thesis? It should show up in the data they gathered.

1DanielLC11y

Only if it's statistically significant. It could be a small enough effect that they don't notice unless they're looking for it (if you're going to publish a finding from either extreme, you're supposed to use a two-tailed test, so they'd presumably want something stronger than p = 0.05), but large enough to keep them from accidentally noticing the opposite effect.

5gwern11y

Or alternately, it's a large effect but the rarity of autism and of non-vaccinated kids makes it hard to reach statistical-significance given sampling error. So let's see, the suggestion here is that the reason so few studies threw up a false positive was that the true effect was the opposite of the alternative, vaccines reduce autism. Autism is... what, 0.5% of the general population of kids these days? And unvaccinated kids are, according to a random Mother Jones article, ~1.8%. So let's imagine that vaccines halve the risk of autism down from the true 1.0% to the observed 0.5% (halving certainly seems like a 'large' effect to me), autism has the true base rate of 1.0% in unvaccinated, and the unvaccinated make up 1.8% of the population. If we randomly sampled the population in general, how much would we have to sample in order to detect a difference in autism rates between the vaccinated & unvaccinated? The regular R function I'd use for this, power.prop.test, doesn't work since it assumes balanced sample sizes, not 1.8% in one group and 98.2% in the other. I could write a simulation to do the power calculation for a prop.test since the test itself handles imbalanced sample sizes, but then I googled and found someone had written something very similar for the Wilcoxon u-test, so hey, I'll use the samplesize library instead; filling in the relevant values, we find for a decent chance of detecting such a correlation of vaccination with reduced autism, it takes: R> library(samplesize) R> n.wilcox.ord(power = 0.8, alpha = 0.05, t = 0.018, c(0.005,0.995), c(0.010,0.990)) $`total sample size` [1] 89947 $m [1] 88328 $n [1] 1619 a total n=90k. I'm guessing that most studies don't get near that. Of course, a lot of that penalty is going towards picking up enough kid who are both autistic and unvaccinated, so one could do better by trying to preferentially sample either of those groups, but then one gets into thorny questions about whether one's convenience sample

2DanielLC11y

As the original article says, if there was no effect, you'd expect a few studies to get p < 0.05 by chance. Similarly, if there was no effect, you'd expect a few studies to get p > 0.95 by chance, suggesting that vaccines prevent autism. If vaccines do prevent autism, then it would be even more likely to have p > 0.95.

1ChristianKl11y

Not all statistical analysis has to be preregistered. If a data has a trend that suggest vaccination might reduce autism I'm sure the researchers would run a test for it. If the study is underpowered to find a effect in that direction it's also like to be underpowered to find a effect in the other direction. Can someone with more statistical expertise run a test to see whether the studies are underpowered to pick up effects in either direction?

-4NatPhilosopher11y

There is fairly extensive data (not published in the peer reviewed literature) that groups which are unvaccinated have far lower autism rates than the general public. UPI Reporter Dan Olmsted went looking for the autistic Amish. In a community where he should have found 50 profound autistics, he found 3. The first was an adopted Chinese girl who'd had vaccinations rushed before she was shipped from China and more here on the way to the adoptive parents. The second had been normal until developing classic autism symptoms within hours of being vaccinated. The third there was no information about. http://www.putchildrenfirst.org/media/e.4.pdf Olmsted continued his search for unvaccinated Amish with autism beyond that community, finding none for a long time, but eventually found a Doctor in Virginia who had treated 6 unvaccinated Amish people from various places with autism. 4 of them had very elevated levels of mercury. A telephone survey commissioned by the nonprofit group Generation Rescue compared vaccinated with unvaccinated boys in nine counties of Oregon and California [15]. The survey included nearly 12,000 households with children ranging in ages from 4 to 17 years, including more than 17,000 boys among whom 991 were described as completely unvaccinated. In the 4 to 11 year bracket, the survey found that, compared with unvaccinated boys, vaccinated boys were 155% more likely to have a neurological disorder, 224% more likely to have ADHD, and 61% more likely to have autism. For the older boys in the 11-17 year bracket, the results were even more pronounced with 158 % more likely to have a neurological disorder, 317% more likely to have ADHD, and with 112% more likely to have autism. [15] In addition to the Generation Rescue Survey, there are three autism-free oases in the United States. Most publicized are Amish communities, mainly studied in Ohio and Pennsylvania [16].The Amish are unique in their living styles in largely self-sustaining communities. They g

2wedrifid11y

He went looking for autistics in a community mostly known for rejecting Science and Engineering? It 'should' be expected that the rate of autism is the same as in the general population? That's... not what I would expect. Strong social penalties for technology use for many generations would be a rather effective way to cull autistic tendencies from a population.

-1NatPhilosopher11y

I don't reject the possibility there are other explanations for the observation that unvaccinated Amish have very low autism rates. I even offered one: that they also reject Glyphosate. However, when it turns out that the rare cases of Amish with autism that are found mostly turn out to be vaccinated, or have some very specific other cause obvious that's not present in the general population (high mercury), the case for vaccination being a cause becomes much much stronger. And when you realize that other groups of unvaccinated also have low autism rates, the case becomes stronger. And when you realize that injecting the aluminum into animal models causes behavioral deficits, and injecting vaccines into post-natal animals causes brain damage, in every study I've found, the case becomes stronger still. And when you discover that the safety surveys don't cite any empirical measurements whatsoever of the toxicity of injected aluminum in neo-nates, (or even injected aluminum in adults, for that matter), and don't generally address the issue of aluminum at all, and don't cite or rebut any of the many papers published in mainstream journals observing these things, or rebut or cite any of the half dozen or more epidemiological studies showing aluminum is highly correlated with autism, then I think you should conclude there is strong cognitive bias at work, if not worse.

0PhilGoetz11y

The Amish vary greatly from one place to another. Here in Mercer County, they don't grow much of their own food, and when they do, they can it. They do make their own milk, but they like fast food and packaged food. Storing ingredients without refrigeration, cooking fancy meals on a wood stove, and cleaning up after them with no hot running water, isn't so simple.

0DanielLC11y

Why are you responding to me? I just gave a possible explanation that I specifically said that I didn't believe. You could post this in the main discussion to give credence to the hypothesis of the publishing bias explanation. I could critique this if you want, although if you actually want to talk about whether or not vaccines cause autism I'd suggest posting in the open thread or starting your own post. This one is talking about publishing bias.

[-]John_Maxwell11y90

Presumably, with 95% confidence.

This seems like a big leap. 95% confidence means at least 95% confidence, right? So if I reject the "vaccines cause autism" hypothesis with p = 0.001, that makes me 95% confident and I publish?

-3DanielLC11y

There's a 5% chance of having at least 95% confidence if there's no correlation. If there's no correlation, p is a random number between zero and one. p = 0.001 would show that vaccines do cause autism. p = 0.999 would show that they prevent it.

[-]wedrifid11y90

They also have a bias from errors. Many articles have some fatal flaw that makes their results meaningless. If the distribution of errors is random, I think--though I'm not sure--that we should assume this bias causes regression towards an equal likelihood of positive and negative results.

I question this assumption. The distribution of errors that people will make in public communication tracks closely what they can expect to get away with doing. Errors (and non-errors, for that matter) that would result in social sanction will be reviewed more closely ... (read more)

[-]byrnema11y70

I don't think that it's necessarily suspicious in that, a priori, I wouldn't have a problem with 60 tests all being negative even though they're all only 95% confident.

The reason being, depending on the nature of the test, the probability of a false negative might indeed be 5% while the probability of a false positive could be tiny. Suppose this is indeed the case and let's consider the two cases that the true answer is either 'positive' or 'negative'.

(A) if the true conclusion is 'positive', any test can yield a negative with 5% probability. (this test wi... (read more)

8Douglas_Knight11y

No, that 5% is the probability of false positive, not the probability of false negative. Phil has the number he needs and uses it correctly.

1byrnema11y

Which 5%? No, "that" 5% is the probability from my cooked-up example, which was the probability of a false-negative. You're saying (and Phil says also in several places) that in his example the 5% is the probability of a false positive. I don't disagree, a priori, but I would like to know, how do we know this? This is a necessary component of the full argument that seems to be missing so far.

1byrnema11y

Another way of asking my question, perhaps more clearly, is: how do we know if the 60 considered studies were testing the hypothesis that there was a link or the hypothesis that there was not a link?

9Douglas_Knight11y

There is an asymmetry that makes it implausible that the null hypothesis would be that there is an effect. The null hypothesis has to be a definite value. The null hypothesis can be zero, which is what we think it is, or it could be some specific value, like a 10% increase in autism. But the null hypothesis cannot be "there is some effect of unspecified magnitude." There is no data that can disprove that hypothesis, because it includes effects arbitrarily close to zero. But that can be the positive hypothesis, because it is possible to disprove the complementary null hypothesis, namely zero. Another more symmetric way of phrasing it is that we do the study and compute a confidence interval, that we are 95% confident that the effect size is in that interval. That step does not depend on the choice of hypothesis. But what do we do with this interval? We reject every hypothesis not in the interval. If zero is not in the interval, we reject it. If a 10% increase is not in the interval, we can reject that. But we cannot reject all nonzero effect sizes at once.

1byrnema11y

(I realize I'm confused about something and am thinking it through for a moment.)

3byrnema11y

I see. I was confused for a while, but in the hypothetical examples I was considering, a link between MMR and autism might be missed (a false negative with 5% probability) but isn't going to found unless it was there (low false positive). Then Vanviver explains, above, that the canonical null-hypothesis framework assumes that random chance will make it look like there is an effect with some probability -- so it is the false positive rate you can tune with your sample size. I marginally understand this. For example, I can't really zoom out and see why you can't define your test so that the false positive rate is low instead. That's OK. I do understand your example and see that it is relevant for the null-hypothesis framework. (My background in statistics is not strong and I do not have much time to dedicate to this right now.)

8Vaniver11y

I think the answer to this is "because they're using NHST." They say "we couldn't detect an effect at the level that random chance would give us 5% of the time, thus we are rather confident there is no effect." But that we don't see our 5% false positives suggests that something about the system is odd.

2byrnema11y

OK, that sounds straightforward. How does one know that the 60 studies are these? (rather then the others (e.g., that were designed to show an effect with 95% probability, but failed to do so and thus got a negative result)).

[-]private_messaging11y50

What would have happened to results that vaccines prevent autism?

1ChristianKl11y

They would have been highly cited academic papers and good for the researchers who made those findings.

3private_messaging11y

Yeah. I was asking a rhetorical question, actually.

[-]ChristianKl11y50

When it comes to studies of vaccine side effect there one thing that very worrying. When a new vaccine enter the market there is testing for side effects. Those studies actually do find side effects and the Center of Disease Control should be a trustworthy source for reporting them.

It turn out different vaccine have quite different side effects. They didn't find any Nausea, vomiting, diarrhea, or abdominal pain as a side effect in the Hepatitis A vaccine but 1 of 4 people who take the HPV--Cervarix vaccine get them. Maybe different vaccine work extremely d... (read more)

-2NatPhilosopher11y

It is generally believed that only something on the order of 1% of side effects reported to Doctors are reported by Doctors into the system, which would explain your last comment.

[-]ygert11y50

Stupid mathematical nitpick:

The chances of this happening are only .95 ^ 39 = 0.13, even before taking into account publication and error bias.

Actually, it is more correct to say that .95 ^ 39 = 0.14.

If we calculate it out to a few more decimal places, we see that .95 ^ 39 is ~0.135275954. This is closer to 0.14 than to 0.13, and the mathematical convention is to round accordingly.

[-]satt11y40

Did you save a list of the p-values reported in the 39 (or 43) studies you looked at? I wonder what I'd get if I aggregated them with Fisher's method.

[-]Azathoth12311y40

I understand that even a single study indicating a connection would immediately be seized on by anti-vaccination activists. (I've even seen them manage to take a study that indicated no connection, copy a graph in that study that indicated no connection, and write an analysis claiming it proved a connection.) Out there in the real world, maybe it's good to suppress any such studies. Maybe.

Well in light of how the modern scientific processes produces a bias against contrary views the activists' seizing on any studies drawing contrary conclusions appears ... (read more)

[-]Manfred11y40

First thing, if you put something in your body, it has some effect, even if that effect is small. "No effect" results just rule out effects above different effect sizes (both positive and negative) with high probability, and there's no point talking about "a link" like it's some discrete thing (you sort of jump back and forth between getting this one right and wrong).

Second, different studies will rule out different effect sizes with 95% confidence - or to put it another way, at a given effect size, different studies will have different... (read more)

1PhilGoetz11y

My probability exercise was not about effect size. It was about the probability of all studies agreeing by chance if there is in fact no link, and so the 95% confidence is what is relevant. Again, not relevant to the point I'm making here.

7Douglas_Knight11y

Not relevant to the main point that you're making, but relevant to your parenthetical:

[-]lmm11y30

Vaguely interesting, but a p=.13 result in pretty much any field gets a big fat "meh" from me. Sometimes p=.13 results happen. I'd want stronger evidence before I started to suspect bias.

3gwern11y

Excess significance and publication bias in general are so common as to be the default; p=0.13 is pretty bad looking (with a single-tailed test, that'd be below the 0.10 threshold Ioannides suggest for publication bias tests due to their low power to detect bias).

[-]Shmi11y20

Simple statistics, but eye-opening. I wonder if gwern would be interested enough to do a similar analysis, or maybe he already has.

[-]gwern11y270

Goetz is re-inventing a meta-analytic wheel here (which is nothing to be ashamed of). It certainly is the case that a body of results can be too good to be true. To Goetz's examples, I'll add acupuncture, but wait, that's not all! We can add everything to the list: "Do Certain Countries Produce Only Positive Results? A Systematic Review of Controlled Trials" is a fun** paper which finds

In studies that examined interventions other than acupuncture ['all papers classed as “randomized controlled trial” or “controlled clinical trial”'], 405 of 1100 abstracts met the inclusion criteria. Of trials published in England, 75% gave the test treatment as superior to control. The results for China, Japan, Russia/USSR, and Taiwan were 99%, 89%, 97%, and 95%, respectively. No trial published in China or Russia/USSR found a test treatment to be ineffective.

'Excess significance' is not a new concept (fun fact: people even use the phrase 'too good to be true' to summarize it, just like Goetz does) and is a valid sign of bias in whatever set of studies on is looking at, and as he says, you can treat it as a binomial to calculate the odds of n studies failing to hit their quota of 5% fa... (read more)

5IlyaShpitser11y

Science is not a magic bullet against bad incentives. I am more optimistic, we are getting a lot done despite bad incentives.

[-]gwern11y230

Science is not a magic bullet against bad incentives.

But none of the incentives seem particularly strong there. It's not offensive to any state religion, it's not objectionable to local landlords, it's not a subversive creed espoused by revolutionaries who want to depose the emperor. The bad incentives here seem to be small bureaucratic ones along the line of it being easier to judge academics for promotion based on how many papers they publish. If genuine science can't survive that and will degenerate into cargo cult science when hit by such weak incentives...

8IlyaShpitser11y

People respond strongly to this in the West also -- "least publishable units", etc. This is almost mystical wording. There is bad science in the West, and good science in the East. I would venture to guess that the crappy state of science in e.g. China is just due to the weak institutions/high corruption levels in their society. If you think you can get away with dumping plastic in milk, a little data faking is the least of your problems. As that gets better, science will get better too.

9gwern11y

And yet, at least clinical trials fail here, and we don't have peer-review rings being busted or people throwing bales of money out the window as the police raid them for assisting academic fraud. (To name some recent Chinese examples.) Again, what incentives? If science cannot survive some 'weak institutions' abroad, which don't strike me as any worse than, say, the Gilded Age in America (and keep in mind the relative per capita GDPs of China now and, say, the golden age of German science before WWII), how long can one expect it to last? It's gesturing to society-wide factors of morality, values, and personality, yes, since it doesn't seem to be related to more mundane factors like per capita GDP. Japan is a case in point here. Almost as bad as China on the trial metric despite over a century of Western-style science and a generally uncorrupt society which went through its growing pains decades ago.

8Azathoth12311y

That explains China and Russia/USSR, it doesn't explain Japan and Taiwan.

5private_messaging11y

The study was looking at English texts, not Russian, Chinese, or Japanese texts. edit: a study on foreign language bias in German speaking countries. And that's Germans, for whom it is piss easy to learn English (compared to Russians, Chinese, or Japanese).

1gwern11y

Why did you omit the part where a third of the sample was published in both English and German, and hence weakens the bias? (That is comparable to the overlap for Chinese & English publications.)

3private_messaging11y

There's something that just didn't get conveyed: English language. That paper, with it's idiot finding, was looking at the studies downloaded from Medline and presumably published in English, or at least with an English abstract (the search was done for English terms and no translation efforts were mentioned). As long as researchers retain freedom to either write their study up in English or not there's going to be an additional publication-in-a-very-foreign-language bias. With regards to acupuncture, one thing that didn't happen, is soviet union being full of acupuncture centres and posters about awesomeness of acupuncture everywhere on the walls, something that would have happened if there was indeed such a high prevalence of positive findings in locally available literature.

3V_V11y

As a rule of thumb, I would say that any research published after the early 1990s in a language other than English is most likely crap.

2gwern11y

Why do you think it changed, and in the early 1990s specifically? (The original study I posted only examined '90s papers and so couldn't show any time-series like that, so it can't be why you think that.)

5V_V11y

I suppose that before the 1990s respectable Soviet scientists published primarily in Russian.

3gwern11y

Yes, but it's not sufficient to explain the results. To use your German example, even a doubling of significance rates in vernacular vs English doesn't give one ~100% success rate in evaluating treatments since their net success rate across the 3 categories is going to be something like 40%. Nor is publishing in English going to be a rare and special event, regardless of how hard English is to learn, because publishing in high-impact English-language journals is part of how Chinese universities are ranked and people are rewarded. Uh huh. But acupuncture is not part of the Russian cultural heritage. What I do see instead is, to name one example (what with not being a Russian and familiar with the particular pathologies of Russian science), tons of bogus nootropics studies (they come up on /r/nootropics periodically as people discover yet another translated abstract on Pubmed of a sketchy substance cursorily tested in animals), because interest in human enhancement is part of Russian culture. Unsurprisingly, pseudo-medicine and pseudo-science will vary by region - which is, after all, the point of comparing acupuncture studies in the West to studies in East Asia! (If there were millions of acupuncture fanatics in Russia and the UK and the USA just like in China/Korea/Japan, then what would we learn, exactly, from comparing studies?) We expect there to be regional differences and that the West will be less committed & more disinterested than East Asia, closer to the ground truth, and hence the difference gives us a lower bound on how big the biases are.

1private_messaging11y

Publication in general doesn't have to be rare and special, only the publications of negative results has to be uncommon. People just care less about publishing negative results and prefer to publish positive results; if there's X amount of effort for publication in a foreign language, and the positive studies already use up all of the X, no X is left for negative results... There's other issues, e.g. how many of those tests were re-testing simple, effective FDA-approved drugs and such? Also, for the Soviet union, there would be a certain political advantage in finding no efficacy of drugs that are expensive to manufacture or import. And one big aspect of soviet backwardness was always the disbelief that something actually works. Even assuming that the publications always found what ever experimenter wanted to find, it wouldn't explain that predominantly an effect is found. What's of the chemical safety studies? There's a very strong bias to fail to disprove the null hypothesis. Yet your paper somehow found a ridiculously high positive rate for acupuncture. The way I think it would work, well, first thing first it's very difficult to blind acupuncture studies and inadequately blinded experiments should find positive result from the placebo effect, secondarily, because that's the case, nobody really cares about that effect, and thirdly, de-facto the system did not result in construction of acupuncture centres. I haven't really noticed nootropics being a big thing, and various rat maze studies were and are largely complete crap anyway. To the point that the impact of experimenter's gender got only discovered recently. edit: also if we're looking at Russia from 1991 to 1998, that was the time when scientists and other such government employees were literally not getting paid their wages. I remember that time, my parents were not paid for months at a time, they were reselling shampoo on the side to get some cash.

5gwern11y

I realize that, and I've already pointed out why the difference in rates is not going to be that large & that your cite does not explain the excess significance in their sample. Doesn't matter that much. Power, usually quite low, sets the upper limit to how many of the results should have been positive even if we assume every single one was testing a known-efficacious drug (which hypothesis raises its own problems: how is that consistent with your claims about the language bias towards publishing cool new results?) So? I don't care why the Russian literature is biased, just that it is. Yes, but toxicology studies being done by industry is not aimed at academic publication, and the ones aimed at academic publication have the usual incentives to find something and so are part of the overall problem. Huh? The paper finds that acupuncture study rates vary by region. USA/Sweden/Germany 53/59%/63%, China/Japan/Taiwan 100% etc How much have you looked? There's plenty of acupuncture centres in the USA despite a relatively low acupuncture success rate. Does a fish notice water? But fine, maybe you don't, feel free to supply your own example of Russian pseudoscience and traditional medicine. I doubt Russian science is a shining jewel of perfection with no faults given its 91% acupuncture success rate (admittedly on a small base). Not sure that's a good example, as Wikipedia seems to disagree about homebrew phage therapy not being applied: https://en.wikipedia.org/wiki/Phage_therapy#History Anyway, How do you see the unseen? Unless someone has done a large definitive RCT, how does one ever prove that a result was bogus? Nobody is ever going to take the time and resources to refute those shitty animal experiments with a much better experiment. Most scientific findings never gets that sort of black-and-white refutation, it just gets quietly forgotten and buried, and even the specialists don't know about it. Most bad science doesn't look like Lysenko. Or look at evidence

1A1987dM11y

Well, humans do notice air some of the time. (SCNR.)

-4private_messaging11y

If you want to persist in your mythical ideas regarding western civilization by postulating what ever you need and making shit up, there's nothing I or anyone else can do about it. Your study is making a more specific claim than mere bias in research, it's claiming bias in one particular direction. The point is that the SU was, mostly, using antibiotics (once production was set up, i.e. from some time after ww2). Well, and there wasn't a plenty in the soviet union despite supposedly higher success rate. If you don't know correct rate you can't tell which specific rate is erroneous. It's not realistically possible to construct a blind study of acupuncture, so, unlike, say, homoeopathy, it is a very shitty measure of research errors. I really doubt that 91% of Russian language acupuncture studies published in Soviet Union found a positive effect (I dunno about 1991-1998 Russia, it was fucked up beyond belief at that time), and I don't know how many studies should have found a positive effect (followed by a note that more adequate blinding must be invented to study it properly). And we know that what ever was the case there was no Soviet abandonment of normal medicine in favour of acupuncture - the system somehow worked out ok in the end.

2gwern11y

That's not a reply to what I wrote. Yes, that's what a bias is. A systematic tendency in one direction. As opposed to random error. And before that, they were using phages despite apparently pretty shaky evidence it was anything but a placebo. That said, pointing out the systematic bias of Russian science (among many other countries, and I'm fascinated, incidentally, how the only country you're defending like this is... your own. No love for Korea?) does not commit me to the premise that phages do or not work - you're the one who brought them up as an example of how excellent Russian science is, not me. How many are there now? Shouldn't you have looked that up? Difference in rates is prima facie evidence of bias, due to the disagreement. If someone says A and someone else says not-A, you don't need to know what A actually is to observe the contradiction and know at least one party is wrong. Yes it is. And naturally, you have not looked for anything on the topic, you just doubt it. Strawman. No country engages in 'abandonment of normal medicine' - if you go to China, do you only find acupuncturists? Of course not. The problem is that you find acupuncturists sucking up resources in dispensing expensive placbeos and you find that the scientific community is not strong enough to resist the cultural & institutional pressures and find that acupuncture doesn't work, resulting in real working medicine being intermeshed with pseudomedicine. Fortunately, normal medicine (after tremendous investments in R&D and evidence-based medicine) currently works fairly well and I think it would take a long time for it to decay into something as overall bad as pre-modern Western medicine was; I also think some core concepts like germ theory are sufficiently simple & powerful that they can't be lost, but that would be cold comfort in the hypothetical cargo cult scenario ('good news: doctors still know what infections are and how to fight epidemics; bad news: everything else they do

0A1987dM11y

Unless A contains indexicals that point to different things in the two cases. (Maybe Asian acupunturists are better than European ones, or maybe East Asians respond better to acupuncture than Caucasians for some reason, or...) ( I'm not saying that this is likely, just that it's possible.)

0private_messaging11y

I was referring to your other comment. That's the one I know most about, obviously. I have no clue about what's going on in China, Korea, or Japan. Look, it doesn't matter if phages work or don't work! The treatment, in favour of which there would be strong bias, got replaced with another treatment, which would have been biased against. Something that wouldn't have happened if science systematically failed to work in such an extreme and ridiculous manner. I keep forgetting that I really really need to spell out any conclusions when arguing with you. It's like you're arguing that a car is missing the wheels but I just drove here on it. Besides, the 90%+ proportion of positive results is also the case in the west (also, in the past we had stuff like lobotomy in the west)

5gwern11y

So why do you think your defense would not apply equally well (or poorly) to them? What's the phage of China? Oh wow. What a convincing argument. 'Look, some Russians once did this! Now they do that! No, it doesn't matter if they were right or wrong before or after!' Cool. So does that mean I get to point to every single change of medical treatment in the USA as evidence it's just peachy there? 'Look, some Americans once did lobotomy! Now they don't! It doesn't matter if lobotomies work or don't work!' You didn't drive shit anywhere. That's on a different dataset, covering more recent time periods, which, as the abstract says, still shows serious problems in East Asia (compromised by relatively small sample: trying to show trends in 'AS' using 204 studies over 17 years isn't terribly precise compared to the 2627 they have for the USA) with the latest data being 85% vs 100%. And 100% significance is a ceiling, so who knows how bad the East Asian research has actually gotten during the same time period Western numbers continue to deteriorate...

-2wedrifid11y

You are trying to pull a "everyone and me is against you" stunt against Gwern? Do you have any idea how dumbfoundingly absurd this would sound to most of those of the class "anyone else" who happens to see this exchange?

5private_messaging11y

Ohh and to add. One big 'thing' in the Soviet Union was research in phage therapy, hoping to replace antibiotics with it, but somehow they didn't end up replacing antibiotics with homebrew phage therapy, something that I'd expect to happen if they were simply finding what they wanted to find, and otherwise not doing science. To summarize, I see this allegation of some grave fault but I fail to see the consequences of this fault. Nor did they end up having all the workers take some 'nootropics' that don't work, or anything likewise stupid.

0dvasya11y

Well, perhaps a bit too simple. Consider this. You set your confidence level at 95% and start throwing a coin. You observe 100 tails out of 100. You publish a report saying "the coin has tails on both sides at a 95% confidence level" because that's what you chose during design. Then 99 other researchers repeat your experiment with the same coin, arriving at the same 95%-confidence conclusion. But you would expect to see about 5 reports claiming otherwise! The paradox is resolved when somebody comes up with a trick using a mirror to observe both sides of the coin at once, finally concluding that the coin is two-tailed with a 100% confidence. What was the mistake?

7Douglas_Knight11y

I don't know if the original post was changed, but it explicitly addresses this point:

1Vaniver11y

The actual situation is described this way: I have a coin which I claim is fair: that is, there is equal chance that it lands on heads and tails, and each flip is independent of every other flip. But when we look at 60 trials of the coin flipped 5 times (that is, 300 total flips), we see that there are no trials in which either 0 heads were flipped or 5 heads were flipped. Every time, it's 1 to 4 heads. This is odd- for a fair coin, there's a 6.25% chance that we would see 5 tails in a row or 5 heads in a row in a set of 5 flips. To not see that 60 times in a row has a probability of only 2.1%, which is rather unlikely! We can state with some confidence that this coin does not look fair; there is some structure to it that suggests the flips are not independent of each other.

1Caspian11y

One mistake is treating 95% as the chance of the study indicating two-tailed coins, given that they were two-tailed coins. More likely it was meant as the chance of the study not indicating two-tailed coins, given that they were not two-tailed coins. Try this: You want to test if a coin is biased towards heads. You flip it 5 times, and consider 5 heads as a positive result, 4 heads or fewer as negative. You're aiming for 95% confidence but have to get 31/32 = 96.875%. Treating 4 heads as a possible result wouldn't work either, as that would get you less than 95% confidence.

1The_Duck11y

This doesn't seem like a good analogy to any real-world situation. The null hypothesis ("the coin really has two tails") predicts the exact same outcome every time, so every experiment should get a p-value of 1, unless the null-hypothesis is false, in which case someone will eventually get a p-value of 0. This is a bit of a pathological case which bears little resemblance to real statistical studies.

1dvasya11y

While the situation admittedly is oversimplified, it does seem to have the advantage that anyone can replicate it exactly at a very moderate expense (a two-headed coin will also do, with a minimum amount of caution). In that respect it may actually be more relevant to real world than any vaccine/autism study. Indeed, every experiment should get a pretty strong p-value (though never exactly 1), but what gets reported is not the actual p but whether it is above .95 (which is an arbitrary threshold proposed once by Fisher who never intended it to play the role it plays in science currently, but merely as a rule of thumb to see if a hypothesis is worth a follow-up at all.) But even the exact p-values refer to only one possible type of error, and the probability of the other is generally not (1-p), much less (1-alpha).

0Shmi11y

I don't see a paradox. After 100 experiments one can conclude that either the confidence level was set too low, or the papers are all biased toward two-tailed coins. But which is it?

1dvasya11y

(1) is obvious, of course--in hindsight. However changing your confidence level after the observation is generally advised against. But (2) seems to be confusing Type I and Type II error rates. On another level, I suppose it can be said that of course they are all biased! But, by the actual two-tailed coin rather than researchers' prejudice against normal coins.

-1wedrifid11y

Neglecting all of the hypotheses which would result in the mirrored observation which do not involve the coin being two tailed. The mistake in your question is the "the". The final overconfidence is the least of the mistakes in the story. Mistakes more relevant to practical empiricism: Treating ">= 95%" as "= 95%" is a reasoning error, resulting in overtly wrong beliefs. Choosing to abandon all information apart from the single boolean is a (less serious) efficiency error. Listeners can still be subjectively-objectively 'correct', but they will be less informed.

1dvasya11y

Hence my question in another thread: Was that "exactly 95% confidence" or "at least 95% confidence"? However when researchers say "at a 95% confidence level" they typically mean "p < 0.05", and reporting the actual p-values is often even explicitly discouraged (let's not digress into whether it is justified). Yet the mistake I had in mind (as opposed to other, less relevant, merely "a" mistakes) involves Type I and Type II error rates. Just because you are 95% (or more) confident of not making one type of error doesn't guarantee you an automatic 5% chance of getting the other.

[-]Jonathan_Graehl11y10

Yes, that's suspicious. Good instinct. I'm sure there's some bias against publishing a marginally-significant result that's got a low (outside the framework of the paper's statistical model) prior. I'd bet some of the unlucky ones got file-drawered, and others (dishonestly or not) kept on collecting more data until the noise (I presume) was averaged down.

However, you might be missing that on an iso-P contour, false positives have diminishing effect size as sample size increases.

[-][anonymous]11y00

For those who don't know what a case control or cohort study is:

''...Essentially, for a cohort study, you start at the point of exposure, and then follow individuals to see who develops the outcome. In a retrospective cohort, you find information that has recorded participants prior exposures. In a case control study, you start with a group of cases (those with the disease) and controls (those without disease) and then measure exposures. These two designs are similar, but they differ on the starting point (outcome or exposure).; - AL - UoM

[This comment is no longer endorsed by its author]Reply

[-][anonymous]11y00

This bothered me. What are the odds, supposing there is no link between X and Y, of conducting 60 studies of the matter, and of all 60 concluding, with 95% confidence, that there is no link between X and Y?

What? No. Just... no. You can't say "Because P('result'|H0)=.05, P(~'result'|H1)=.05".

[This comment is no longer endorsed by its author]Reply

[+]NatPhilosopher11y-50

LESSWRONG
LW

LESSWRONG
LW

35

Too good to be true

35

35

Critiquing bias