Imagine Bob's graph of health over time is sin(t) and his graph of exercise is cos(t), with negative values corresponding to being a couch potato. At first glance, health and exercise have nothing to do with each other, since their correlation is zero. At second glance, maybe exercise determines health with a π/2 lag. At third glance, maybe it determines health with a 3π/2 lag and a minus sign. At fourth glance, maybe exercise determines the derivative of health. You'll find similar problems with any bounded functions over a long timespan, not just sin and cos.


Instead of correlation fishing, let's investigate causation like real scientists, with graphs. Smoking causes tar in the lungs, tar is correlated with cancer, and for any fixed value of "tar" the values of "smoking" and "cancer" are conditionally uncorrelated. So there cannot be a trait that causes both smoking and cancer independently: the causation must go through tar, which we know is caused by smoking. Therefore smoking causes cancer.

But imagine there's a trait that gives you a "set point" of X tar, continually adjusting your desire to smoke and compensating for external factors, and also gives you X risk of cancer. Then the above will still hold, but quitting smoking won't help you.


Forget graphs. Let's use the golden standard of determining causation: a randomized controlled experiment. Does skydiving increase risk of death? Yes, if we force a random half of all people to skydive, more of them will die than in the control group.

But now take a random half of all people and force them not to skydive. Those who didn't skydive anyway will be unaffected, but those who skydived might change their behavior. What if they switch to something more dangerous, like basejumping, because getting the same thrill by safer skydiving is no longer available? Then this experiment also leads to more deaths, and our question about causation remains unanswered.

If we can't trust correlations, graphs, or experiments - what can we trust? What, in the end, is causation?

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 10:50 PM

I think this overstates the problems.  Causation is a model.  All models are wrong.  Some models are useful.

For #1, it's well-understood that pure correlation does not tell us anything about causality.  Perhaps health determines excercise with a pi/2 lag, rather than the reverse.  For most real-world phenomena, the graphs aren't that regular nor repeating, so there are more hints about direction and lag.  Additionally, there are often "natural experiments", where an intervention changes one, and we can then see the directional correlation, which is a pretty strong hint to causation.

#2 is a case of incomplete rather than incorrect causation.  This is true for any causal model - one can ask further upstream.  X causes Y.  What causes X?  What causes the cause of X?  In the pure reduction, the only cause of anything is the quantum state of the universe.

#3 is a communication failure - we forgot to say "compared to what" when we say "increases" risk of death.  If we instead said that "intentionally jumping out of a plane carries a 0.0060% risk of death", that's clearer.  It doesn't matter that crossing the street is more dangerous.  

For most real-world phenomena, the graphs aren’t that regular nor repeating, so there are more hints about direction and lag.

Yeah, though I think "at fourth glance" stands as it is: in the long run any bounded function will have zero correlation with its derivative.

#3 is a communication failure—we forgot to say “compared to what” when we say “increases” risk of death.

Compared to the control group. People often measure the effect of variable X on variable Y by randomly dividing a population into experiment and control groups, intervening on X in the experiment group, and measuring the difference in Y between groups. Well, I tried to show an example where intervening on X in either direction will increase Y.

A few minor comments. Regarding I, it's known that the direction of (or lack of) an arrow in generic two-node causal is un-identifiable, although there's some recent work solving this in restricted cases.

Regarding II, if I understand correctly, the second sub-scenario is one in which we'd have a graph that looks like the following DAG.

What I'm confused about is if we condition on a level of tar in a big population, we'll still see correlation between smoking and cancer via the trait assuming there's independent noise feeding into each of these nodes. More concretely, presumably people will smoke different amounts based on some other unobserved factors outside this trait. So at at least certain levels of tar in lungs, we'll have people who do/don't have the trait, meaning there'll be a correlation between smoking and cancer even in different tar level sub-populations. That said, in the purely deterministic simplified scenario, I see your point.

Alternatively, I'm pretty sure applying the front-door criterion (explanation) would properly identify the zero causal effect of smoking on cancer in this scenario (again assuming all the relationships aren't purely deterministic).

Yeah. Thanks for the front door link, I'll take some time learning this!

Maybe to reformulate a bit, in the second sub-scenario my idea was that each person has a kind of "tar thermostat", which sets the desired level of tar and continually adjusts your desire to smoke. If some other factor makes you smoke more or less, it will compensate until your level of tar again matches the "thermostat setting". And the trait that determines someone's "thermostat setting" would also determine their cancer risk. Basically the system would counteract any external noise, making the statistician's job harder (though not impossible, you're right).

The third scenario, about skydiving, hints at a similar idea. The "thermostat" there is the person's desire for thrill, so if you take away skydiving, it will try to find something else.

Oh I see, yeah this sounds hard. The causal graph wouldn't be a DAG because it's cyclic, in which case there may be something you can do but the "standard" (read: what you'd find in Pearl's Causality) won't help you unless I'm forgetting something.

An apparently real hypothesis that fits this pattern is that people take more risks / do more unhealthy things the more they know healthcare can heal them / keep them alive.

The thermostat pattern is everywhere, from biology to econ to climate etc. I learned about it years ago from this article and it affected me a lot.