People kept noticing that blood donors were healthier than non-donors. Could giving blood be good for you, perhaps by removing excess iron? Perhaps medieval doctors practicing blood-letting were onto something? Running some studies (1998, 2001) this does seem to be a real correlation, so you see articles like "Men Who Donate Blood May Reduce Risk Of Heart Disease."

While this sounds good, and it's nice when helpful things turn out to be healthy, the evidence is not very strong. When you notice A and B happen together it may be that A causes B, B causes A, or some hidden C causes A and B. We may have good reasons to believe A might cause B, but it's very hard to rule out a potential C. Instead if you intentionally manipulate A and observe what happens to B then you can actually see how much of an effect A has on B.

For example, people observed (2003) that infants fed soy-based formula were more likely to develop peanut allergies. So they recommended that "limiting soy milk or formula in the first 2 years of life may reduce sensitization." Here A is soy formula, B is peanut allergy, and we do see a correlation. When intentionally varying A (2008, n=620), however, B stays constant, which kind of sinks the whole theory. A likely candidate for a third cause, C, was a general predisposition to allergies: those infants were more likely to react to cows-milk formula and so be given soy-based ones, and they were also more likely to react to peanuts.

To take another example, based on studies (2000, 2008, 2010) finding a higher miscarriage rate among coffee drinkers pregnant women are advised to cut back their caffeine consumption. But a randomized controlled trial (2007, n=1207) found people randomly assigned to drink regular or decaf coffee were equally likely to have miscarriages. [EDIT: jimrandomh points out that I misread the study and it didn't actually show this. Instead it was too small a study to detect an effect on miscarriage rates.] A potential third cause (2012) here is that lack of morning sickness is associated with miscarriage (2010) and when you're nauseated you're less likely to drink a morning coffee. This doesn't tell us the root of the problem (why would feeling less sick go along with miscarriages?) but it does tell us cutting back on caffeine is probably not helpful.

Which brings us back to blood donation. What if instead of blood donation making you healthy, healthier people are more likely to donate blood? There's substantial screening involved in becoming a blood donor, plus all sorts of cultural and economic factors that could lead to people choosing to donate blood or not, and those might also be associated with health outcomes. This was noted as a potential problem in 2011 but it's hard to test this with a full experiment because assigning people to give blood or not is difficult, you have to wait a long time, and the apparent size of the effect is small.

One approach that can work in places like this is to look for a "natural experiment," some way in which people might already be being divided into appropriate groups. A recent study (2013, n=12,357+50,889) took advantage of the situation where screening tests sometimes give false positives that disqualify people. These are nearly random, and give us a pool of people who are very similar to blood donors but don't quite make it to giving blood. When comparing the health of these disqualified donors to actual donors the health benefits vanish, supporting the "healthy donor hypothesis."

This isn't to say you should never pay attention to correlations. If your tongue starts peeling after eating lots of citric acid you should probably have less in the future, and the discovery (1950) that smoking causes lung cancer was based on an observation of correlations. Negative results are also helpful: if we don't find a correlation between hair color and musical ability then it's unlikely that one causes the other. Even in cases where correlational studies only provide weak evidence, however, they're so much easier than randomized controlled trials that we still should do them if only to find problems to look into more deeply with a more reliable method. But if you see a news report that comes down to "we observed people with bad outcome X had feature Y in common," it's probably not worth trying to avoid Y.

I also posted this on my blog.

New Comment
43 comments, sorted by Click to highlight new comments since: Today at 8:54 AM

If only there were a snappy five-word phrase to this effect that got printed in pretty much every statistics textbook ever written...


There is, but there's still value to showing real world examples where 'correlation is not causation' and showing the real explanations (which in the absence of hard randomized trials, can come off as mindless nitpicking and motivated cognition by people who just don't want to accept the correlation*).

* I'm worried about this because looking back over the years of discussion on the dual n-back mailing list, I get the feeling that a lot of people wrote off the methodological criticisms of the n-back studies as basically theoretical 'perfect is the enemy of better' motivated-cognition nitpicking, even about the soundest criticisms like active control groups, until I finally sat down and did the meta-analysis showing the active control group criticism was right on the money.

What do the stats experts here think of "adjusting for confounders"?

I notice I often see correlational studies I would expect to be confounded by something (for example, a study showing that people who go to the doctor more often are less likely to get heart attacks might be confounded by income, since rich people can afford more doctor visits and also have different diets). Then the study says "We adjusted for confounders including income, class, race, and education, and the effect remained robust". Then they do a controlled experiment investigating the same thing and the effect disappears.

Is there any conventional wisdom in the stats community about how far to trust these kinds of adjustments?

This is a good thing to read: (chapter 22 in Judea's Festschrift). In particular the contrast between Fig. 1 and Fig. 2 is relevant.

What is going on here is that what we care about is some causal parameter, for instance "average causal effect (ACE) : E[Y | do(A=1)] - E[Y | do(A=0)]."

This parameter is sometimes identified, and sometimes not identified.

If it is NOT identified, it is simply not a function of the observed data. So any sort of number you get by massaging the observed data will not equal to the ACE. Naturally, if we try to randomize (which will get us the ACE directly) we will not reproduce what our observational data massage got us.

If it IS identified, then it is the matter of what functional of the observed data equals to the ACE. Maybe if we have treatment A, outcome Y, and a set of baseline confounders, the correct functional is:

\sum_{c} ( E[Y | A=1,c] - E[Y | A=0,c] ) p(c)

This is what "adjusting for confounders" means.

However, maybe that's not the right functional at all! Maybe you have a mediating variable M between A and Y, and the right functional is:

\sum{m} \sum{a'} E(Y | m,a') p(a') P(m | A=1) - \sum{m} \sum{a'} E(Y | m,a') p(a') P(m | A=0)

How do we tell what functional is right? We have to agree on what the right causal graph is for our problem, and then consult an algorithm that will either give us the right functional for the ACE given the graph, or tell us the ACE is not identifiable given the graph we got. This algorithm was what a part of my thesis was about.

There is one important historical example of people ignoring graph structure to their peril. In epidemiology people worry about something called the "healthy worker survivor effect." Say we have workers who work with asbestos, which is a bad chemical. We want to get an idea of how bad it is by running a study. The longer you work with asbestos, the worse your outcome. However, if you are sick, you will probably terminate employment early, which means you will not get more exposure to asbestos. So people who get more asbestos are also healthier. So it might seem based on observational data that even though we suspect asbestos is very bad for you, it seems to have a protective effect on workers. This is the "healthy worker survivor effect."

If we were to draw a simple graph with two time slices for this, we would get:

A1 -> H -> A2 -> D

where A1 and A2 are asbestos exposure, H is health status after A1, and D is death (or not). H and D are confounded by a common cause we do not see H <- U -> D. A1 determines H. If H is bad enough, it will cause the worker to leave, and thus set A2 to 0. A1 and A2 determine D.

What we want here is E[D | do(a1,a2)]. The point is that blindly adjusting for H is incorrect, because of the particular graph structure where H arises. H is a standard confounder for A2, but is NOT a standard confounder for A1 (H is what is called a "time-varying confounder.") So you need to use a particular form of adjustment called "g-computation":

\sum_{h} E[D | a1,a2,h] p[h | a1]

If you use the standard adjustment

\sum_{h} E[D | a1,a2,h] p[h]

you will get a biased answer. Jamie Robins wrote a giant 120 page paper in 1986 (that no one ever reads) on (among many many other things) this precise issue:

(edit: the reason you get bias with standard adjustment is because A1 -> H <- U is in your graph. If you condition on H, A1 and U become dependent: this is the so called "Berkson's bias, selection bias, collider stratification bias, or explaining away phenomenon." So standard adjustment creates a non-causal path A1 -> H <- U -> Y between a treatment and the outcome which accounts for part of the magnitude of the effect, and thus creates bias.)

What happens in practice is if you try to get the ACE from observed data, you will have too much confounding to get identification by any method (adjustment or anything else, really). So you need some sort of extra "trick." Maybe you can find a good instrumental variable. Or maybe you have a natural experiment. Or maybe you had really good data collection that really observed most important confounders. Or maybe the treatment variable only has observed parents (this happens in observational longitudinal studies sometimes). If you just blindly use covariate adjustment without thinking about your causal structure you will generally get garbage.

I'd very much like to know more about this too. I distinctly recall at least three separate papers by separate authors in different domains that mentioned "adjusting for confounders" on things that I thought would be, so I thought "Oh, it's fine then, they did their thinking properly and there is an effect!". (At times like these I wish I remembered things or kept notes on research papers a fraction as diligently as Yvain or gwern...)

Then I read further because now I'm very interested in why, and upon finding the details of their adjustments, in small print under one of the many tables of Annex F, I discover that the "adjustment" was that they guessed at an effective rate for the confounders and plugged that percentage in their math. "Oh, there must be about 15% more rich people than poor people who go to the doctor for any given condition, so let's adjust the results by that amount and see if things work!"

(I'm exaggerating for dramatization, these guess numbers are rarely "hidden" in this way and rarely this important, but even tiny examples of such piss me off and I get angry at the paper for it every time.)

Is there any conventional wisdom in the stats community about how far to trust these kinds of adjustments?

In my experience there's no general answer other than the observation that if people did NOT adjust for confounders, it's usually a very bad sign. But if they did, you actually have to go read the paper and form your own opinion on whether their adjustments look reasonable, whether they did them correctly, whether they picked the right confounders (or just grabbed whatever characteristics they had handy), etc.

Typically people don't adjust properly because it's against their incentives to do so.

It pretty easy to abuse controlling factors.

As a reader you don't always know how many different ways the people who made a study tried to adjust for different confounders till they got their result.


What I would like to see is a "meta" study examining the rate at which causal claims originally supported only by correlational studies are later found to be spurious when tested experimentally. It's hard to calibrate one's skepticism towards correlational studies without knowing what the relevant base rates are.

There are a few, but not many such studies, for obvious reasons. I list at least one in and IIRC, the correlation->causation rate was <10%.

"Contradicted and Initially Stronger Effects in Highly Cited Clinical Research" (Ioannidis 2005) is helpful:

Results Of 49 highly cited original clinical research studies, 45 claimed that the intervention was effective. Of these, 7 (16%) were contradicted by subsequent studies, 7 others (16%) had found effects that were stronger than those of subsequent studies, 20 (44%) were replicated, and 11 (24%) remained largely unchallenged. Five of 6 highly-cited nonrandomized studies had been contradicted or had found stronger effects vs 9 of 39 randomized controlled trials (P = .008). Among randomized trials, studies with contradicted or stronger effects were smaller (P = .009) than replicated or unchallenged studies although there was no statistically significant difference in their early or overall citation impact. Matched control studies did not have a significantly different share of refuted results than highly cited studies, but they included more studies with “negative” results.

In the history of medicine, 5 valuable facts about the causes of disease have been established by correlation: smoking => lung cancer, heart disease; sun => skin cancer; HPV is STD; alcohol => rectal cancer. The claim with the 6th most correlation evidence is that alcohol protects from heart disease, but this is controversial.

I'll leave it to you to compute the denominator.

Also, the number 5 is itself controversial. Some people put it at 2.

I think it's way higher. Some off the top of my head (with a little reading to confirm details):

  • Child delivery by doctors in a hospital correlated with puerperal fever. Refined to a correlation between child delivery by someone who had recently performed an autopsy and puerperal fever. Experimentally testing handwashing (though not blind) confirmed effect, doctors wash their hands, dying in childbirth is now less common.

  • A student with anemia symptoms turns out to have some strangely shaped blood cells. This initial association is expanded by people looking at blood of other people with anemia, and several others also have these elongated red cells. Eventually we get enough of a correlation that we've discovered sickle cell anemia.

In fact I would go as far as to say that most of our medical knowledge comes from correlations, often relatively obvious ones like "getting run over by a car increases your chance of death".

There may still be something here, though: the kinds of studies we see with bad correlations being misleading, and the examples you give of successful ones, are generally small effects compared to the amount of time involved. Can we characterize better this area where correlations are especially suspect?

In fact I would go as far as to say that most of our medical knowledge comes from correlations, often relatively obvious ones like "getting run over by a car increases your chance of death".

Well, we have to be careful about definitions here. People generally don't talk about correlations when there is a known underlying mechanism.

I guess technically the phrase should look like this: Correlation by itself without known connecting mechanisms or relationships does not imply causation.

Correlation by itself without known connecting mechanisms or relationships does not imply causation.

The bayesian approach would suggest that we assign a causation-credence to every correlation we observe. Of course detecting confounders is very important since it provides you with updates. However, a correlation without known connecting mechanisms does imply causation. In particular it does it probabilistically. A bayesian updater would prefer talking about credences in causation which can be shifted up and downwards. It would be a (sometimes dangerous) simplification to in our map deal with discrete values like "just correlation" and "real causation". However, such a simplification may be of use as a heuristic in everyday life, still I'd suggest not to overgeneralize it.

Correlation by itself without known connecting mechanisms or relationships does not imply causation

This does separate out the "getting run over by a car" case, but it doesn't handle the handwashing one. Germ theory hadn't been invented yet and Semelweiss' proposed mechanism was both medically unlikely and wrong. With sickle cell anemia it kind of handles it, in that you can think of all sorts of ways weirdly shaped blood cells might be a problem, but I think it's a stretch to say that the first people looking at the blood and saying "that's weird, it's probably the problem" understood the "connecting mechanisms or relationships".

More generally, correlation is some evidence and if it's not expected someone should probably look more closely to try to understand why we're seeing it, which generally means some kind of controlled experiment.

More generally, correlation is some evidence

Well, to start with correlation is data. This data might be used to generate hypotheses. Once you have some hypotheses you can start talking about evidence and yes, correlation can be promoted to the rank of evidence supporting some hypothesis.

I don't think any of that is controversial. The only point is that pure correlation without anything else is pretty weak evidence, that's all. However if you want to use it to generate hypotheses, sure, no problems with it whatsoever.

Are you using Semelweiss as an example of the medical community properly assessing and synthesizing data?

I'm using it as an example of a valuable fact about disease being established by correlation.

Your paragraph speaks about correlation providing a hypothesis while the "fact about disease" was established by an experimental intervention study.

I think we're getting into a discussion about what it means for something to be established as a fact, which doesn't sound very useful.

Can we characterize better this area where correlations are especially suspect?

Epidemiological studies of diets (that is, health consequences of particular patterns of food intake) are all based on correlations and the great majority of them is junk.

These days epi people mostly use g methods which are not junk (or rather, give correct answers given assumptions they make, and are quite a bit more sophisticated than just using conditional probabilities). How much epi do you know?

edit: Correction: not everyone uses g methods. There is obviously the "changing of the guard" issue. But g methods are very influential now. I also agree there is a lot of junk in data analysis. But I think the "junk" issue is not always (or even usually) due to the fact that the study was "based on correlations" (you are not being precise about what you mean here, but I interpreted you to mean that "people are not using correct methods for getting causal conclusions from observational data.")

Not much. I've read a bunch of papers and some critiques... And I'm talking not so much about the methods as about the published claims and conclusions. Sophisticated methods are fine, the issue is their fragility. And, of course, you can't correct for what you don't know.

Thinking about why things do or don't belong on your list, and I think they basically have to be very harmful. If they're good we do an experiment and find out, but if they're bad we just declare it established with a correlation. For example I think Thalidomide could go on your list, in that the evidence was basically "people who took Thalidomide were far more likely to have babies with major birth defects." Probably lead (paint, fuel)? Our sense of what's a safe dose of radiation?

sun => skin cancer

I don't know whether the last one is valuable. It might have resulted in a lot people getting less sun than healthy for them because of Vitamin D production.

And I gave both positive and negative effects of alcohol. So what?

By "valuable" I mean an easily manipulable causal mechanism that explains a relatively large amount of the population variance of health. I don't mean that it has actually been manipulated, let alone manipulated correctly. And I certainly don't mean that this is all we know about medicine. We understand vitamin D and inebriation because of experiments.

Um. Really?

I'm no expert, but this sounds way off. So, we know essentially nothing about how to avoid disease, apart from these 5 or 6 (or 2) causes?

There are lots of ways to gain knowledge other than by looking at correlations. For example you can run experiments. There was a guy named Edward Jenner who was interested in avoiding smallpox. He ran an experiment and it worked. The world learned how to avoid smallpox and there were no correlations in sight...


At the age of 13, Jenner was apprenticed to Dr. Ludlow in Sodbury. He observed that people who caught cowpox while working with cattle were known not to catch smallpox. He assumed a causal connection. The idea was not taken up by Dr. Ludlow at that time. After Jenner returned from medical school in London, a smallpox epidemic struck his home town of Berkeley, England. When he advised the local cattle workers to be inoculated, the farmers told him that cowpox prevented smallpox. This confirmed his childhood suspicion, and he studied cowpox further, presenting a paper on it to his local medical society.

Saying "He ran an experiment and it worked" hides the initial correlational observation that let him to try that experiment.

the initial correlational observation

It seems to me that you want to call all observational data "correlations".

I think so. If you want to separate them how would you say "people who get pustules from working with cattle are less likely to catch smallpox" differs from "people who give blood are less likely to have heart disease"?

It's hard to calibrate one's skepticism towards correlational studies without knowing what the relevant base rates are.

The relevant base rates are likely to be much different between different subject domains.

Speaking of which, I'm helping out with a randomized controlled trial on veg advocacy. If you're in the Boston area and have some time tomorrow, Thursday November 11th, I'd love some help getting people to fill out surveys about their eating habits. I'm planning to be in Harvard Square from 9am to 10am and 12:15pm to 1:15pm, but there will be other people surveying too.

This doesn't tell us the root of the problem (why would feeling less sick go along with miscarriages?) but it does tell us cutting back on caffeine is probably not helpful.

Suppose everyone is subject to the same amount of environmental toxin, and some people react more strongly (by morning sickness, which vomits out those toxins). Then we would expect that the people less aversive to toxins would have more toxins, which would lead to more miscarriage.

In general, morning sickness seems to be good for pregnancy, but I don't think I've seen any evidence that it's adding good instead of removing bad (but, it would just be weird if it added good, so evidence may have been overlooked).

There's a very interesting study that finds that parents with more morning sickness have babies with better neurodevelopmental outcomes (20% had IQ > 130, compared to 7% of those with less morning sickness).

The researchers suggest that for some reason the hormone involved in producing morning sickness also aids child development. It's possible, but a better explanation is the one in your first paragraph - everyone gets exposed to toxins but people vary in how diligent they are in eliminating them, and extra diligence helps.

Has there been a check on whether the foods which are more likely to cause morning sickness are also more likely to be bad for fetuses?

I'm rather skeptical of this, because it seems that morning sickness eliminates nutrients as much or more than toxins. If the mother can't keep much down, or only a few select foods, is it likely that the baby is getting needed nutrients? And wouldn't this effect be as big as the toxin eliminating one, but in the opposite direction?

I'm rather skeptical of this, because it seems that morning sickness eliminates nutrients as much or more than toxins.

Harm and benefit are not equally balanced. If you snarf down some hemlock, the minimal calories you gain from digesting it do not offset the poison you've ingested. One bad meal can kill you (the Buddha died of some bad pork, legendarily), but skipping one meal certainly won't kill you. You would have to skip a lot of meals to equal one bad poisoning episode.

Is there some indication that morning sickness is related to ingesting harmful toxins, now or in the evolutionary past? It doesn't seem from my experience (close second hand only) that morning sickness is more than increased sensitivity to certain foods, and rather a general nausea that makes many nutrious foods difficult to keep down.

"You would have to skip a lot of meals to equal one bad poisoning episode."

Such as one or two a day for 2-3 months?

Morning sickness can get that bad, but it's rare.

‘That's got have something to do with the fact that donors get blood tests for free’ was the very first thing I thought upon reading the first sentence of this post.

WRT blood donation: Cardiac isechemia sure, the effect is small. Myocardial infarction the effect size was huge. My impression when reviewing the evidence is that a lot of sloppy reasoning is being done in the studies.