Some counterexamples:
Two independent Wiener processes (Brownian motion) have an expected correlation of zero, but the distribution of correlations is wide. You can easily see correlations of magnitude above 0.8 between two independent realisations of the standard Wiener process, independently of sample size. The problem is the strong autocorrelation of the Wiener process, which drastically broadens the distribution of the correlation coefficient.
The Moon is slowly receding from the Earth; the Andromeda galaxy is approaching the Milky Way. Therefore the respective distances between them have a large negative correlation. But there is no causal connection. More generally, any two time series, each of which exhibits a monotonic trend over time, will have a substantial correlation.
Control systems often exhibit large correlations between variables with no direct causal conection (and near-zero correlation between variables that do have a direct causal connection). Control systems are ubiquitous in the life sciences, social sciences, and technology. The causal relationships within them are cyclic, putting them outside the scope of Reichenbach's principle and Pearl-style causal analysis.
(1) and (2) are well-known to people who analyse time series data, and there are standard methods (prewhitening and detrending respectively) for dealing with them. While (3) gets a mention from time to time, none of the papers I have seen on extending causal analysis to cyclic systems have (IMO) made much progress.
The causal relationships within them are cyclic, putting them outside the scope of Reichenbach's principle and Pearl-style causal analysis.
When we start introducing time into the mix I think it can be helpful to be somewhat more particular about how we define variables in a causal setting. When you have limited time resolution measurments of a quantity over time you can view it as a single "variable" that has is involved in a causal cycle. But you could also "unroll" this cycle in time and view the quantity at each time as a seperate variable. If you do this it seems to me like Pearl-style causal analysis generally holds. Even if X and Y have a montonic pattern over time with prior time points of each series causing its later time points but no causal interaction between them, the X(t)s and Y(t)s aren't correlated with each other, and the pattern over time is fully explained by the causal structure. This makes sense in a Pearl-style analysis because you would have an SCM for the series that looks something like X(t) = f(X(t-1, t-2, ....)), which is treating the X(t)s as seperate variables. The correlation over time of X and Y isn't a correlation between variables in this model, it involves mixing different variables together. If we treat them seperately it seems like the Pearl-style analysis still works and makes no prediction errors, and in fact has the advantage of being robust to potential interventions.
While (3) gets a mention from time to time, none of the papers I have seen on extending causal analysis to cyclic systems have (IMO) made much progress.
Although I won't claim to understand all the claims and concepts fully, I've found this paper to be interesting and helpful in this regard, and to me some of the concepts seem to have connections to the persective I offer above.
Pearl-style SCMs assume that every single node in a graph is ontologically independent, which makes unrolled models as suggested not particularly great.
From a paper co-authored by Pearl himself:
The problem with using structural causal models is that the language of structural models is simply not expressive enough to capture certain intricate relationships that are important in causal reasoning. The ontological commitment of these models is to facts alone, assignments of values to random variables, much like propositional logic. Just as propositional logic is not a particularly effective tool for reasoning about dynamic situations, it is similarly difficult to express dynamic situations (with objects, relationships, and time) in terms of structural causal models.
( https://commonsensereasoning.org/2005/hopkins.pdf )
I haven't been active in causality research since about 5 years ago, but I'm not aware of any good solutions to the time problem. I do know there are proposals for models that make improvements for causality involving sets of related variables, e.g.: platelet models. I think our own work on counterfactual probabilistic programming has a pretty strong basis, although the philosophy is fairly abridged in the paper.
tl; dr: The basic version of this is Reichenbach's principle and is well-known. A lot of holes show up under a more advanced lens.
More precisely, if things are correlated, there exists a relatively short causal chain linking those things, with confidence one minus the p-value of the correlation.
Other than the p-value part (would like to see your reasoning there -- p-values are not probabilities, so this reads like a type error to me), this is Reichenbach's principle ( https://plato.stanford.edu/entries/physics-Rpcc/ ): a correlation between A and B implies that either A causes B, B causes A, or there's a common cause for both. (Or you are conditioning on something caused by both; c.f. Berkson's paradox.)
Reichenbach's principle is trivially true in Pearl-style causal networks, but philosphers argue that it's false for the causal network of the whole universe. I haven't found the short versions of the arguments convincing, but you can read about them in the linked PLATO article.
There's a few things to keep in mind:
Thanks for a long and detailed comment! tl;dr - I wish I'd kept some of the more detailed footnotes from an earlier draft of the essay.
the p-value part... type error
My phrasing there was insufficiently precise. You're right to call out that (1- p-value) isn't the probability of coincidence, because that's ignoring the prior. I should have gone with something vaguer like "with the evidentiary strength you would expect based on the p-value of the correlation".
I do more or less interpret p-values as probabilities, but in a world of theory rather than the real world - the summary I generally give is "probability of getting a result this extreme or more under the null hypothesis", does that still seem like a type error?
The basic version of this is Reichenbach's principle and is well-known
I came across this SEP article while researching the essay, thinking surely I couldn't be the first to think of the idea. Ultimately I cut the section about it because a) I found the SEP article fairly long and confusing and thought I could explain it more simply without referencing it and b) having discussed this point with dozens of people, including many with degrees in math and physics, exactly zero have ever brought up Reichenbach's common cause, which led me to believe it wasn't very well known (I'd guess at least 2 OOMs fewer people are familiar with that compared to the classic injunction "correlation does not imply causation"). That said, you've convinced me to add the footnote referencing Reichenbach's common cause back in, thank you.
As the size of a causal network grows, the set of correlations grows far faster than the set of causal relationships, until almost all correlations become spurious.
This is another point featured in an earlier version of the essay, inspired by the spurious correlations charts. Correlations between thousands of variables indeed grow extremely quickly, to me this is just an argument for correctly adjusting for multiple comparisons - pretty much all the p-values there are reported as <0.01, but with tens of thousands of comparisons of course you'd find a ton of those by chance and need to adjust for that - my claim is that very few of those would remain interesting after such an adjustment.
set of nucleotides used by humans and by jellyfish is explained by an extremely long causal chain.
This is the point about dominoes in footnote 4, I can necker-cube between the two views of it being an extremely long causal chain or a very short one, "they share common ancestry". I find the shorter view generally easier to reason about and it's the one I use most of the time.
https://people.math.ethz.ch/~peterbu/Files/Manuscripts/strong-faithfulness-aos.pdf
Can you recommend any non-technical summary of or examples of non-faithful networks? The closes semantic match I found after a couple iterations of search was this paper, but it's denser than I prefer. The core point
To dive into the weeds a bit: The phrasing "with the evidentiary strength you would expect based on the p-value of the correlation" works, but the issue with p-values is much stronger than "that's ignoring the prior." There's another type error there. p-values and priors do not mix. A p-value is a supremum over probabilities from a space of hypothesis. Suppose you have two possible null hypothesis: one generates your data with probability 10%, but you assign 0.000000000001% confidence to this hypothesis. The other generates your data with probability 1%, and you assign 99.999999999999% confidence to this hypothesis. Then you want your p-value to be approximately 0.01. But it's actually 0.1. Sorry. p-values are fundamentally frequentist, not Bayesian, Frequentism rejects the idea that you can express confidence as a probability.
having discussed this point with dozens of people, including many with degrees in math and physics, exactly zero have ever brought up Reichenbach's common cause
Not surprised. The mathematics of causality is still not part of a standard stats curriculum. I can't fully blame them -- I dove deep into this area in my first year of grad school, in 2015, and concluded the field is still fairly primitive. But a disproportionate number of people here are familiar with it. (Indeed, I was excited to dive into causality in grad school when I saw an excuse to do so precisely because of my exposure to the field's existence through LessWrong.)
This is the point about dominoes in footnote 4, I can necker-cube between the two views of it being an extremely long causal chain or a very short one, "they share common ancestry". I find the shorter view generally easier to reason about and it's the one I use most of the time.
Indeed, just as I could describe the causal chance linking humans and jellyfish in a few words. I would like there to be a formal way to render both DNA and dominoes as a short causal chain. But I don't have one.
Can you recommend any non-technical summary of or examples of non-faithful networks? The closes semantic match I found after a couple iterations of search was this paper, but it's denser than I prefer. The core point
Think this got cut off.
A classic example from Pearl's 2009 book: A and B are fair 0/1 coins, and C is their xor. Then the sets {A, C} and {B, C} each have pairwise independency, even though there are causal links A->C and B->C.
I'm using 'imply' in an empirical rather than logical sense
I feel like this is using the word somewhat differently than is meant by the phrase you are discussing. I've always interpreted "correlation doesn't imply causation" to mean that if X and Y are correlated you can't necessarily say that X -> Y or Y -> X (probably based on some prior about the direction like timing), not that correlation is somehow completely unrelated to causation.
Congrats on always interpreting the phrase that way! Of the folks I've discussed this with in person, every one of them had the interpretation that things can be correlated without either one causing the other or a common cause.
I would be surprised if "correlation does't imply causation" would have become so popular if most people interpreted it as strictly expanding the options from X -> Y or Y -> X to include common cause (and now might ask a stats-professor friend to run a poll to see which of these interpretations most people hold in their intro stats class), but I think your interpretation is correct. I certainly don't want to attack a strawman version of the phrase, but if >30% of people interpret it that way I'll conclude it's not a strawman.
(Cross-posted from my personal website)
Epistemic status: I've become slowly convinced of this broad point over the last handful of years, but pretty much every time I mention it in conversation people think it's wrong. I wrote this in part looking for counterexamples (and to explain the point clearly enough that targeted counterexamples can be identified).
Perhaps the single most-quoted phrase about statistics1 is that 'correlation does not imply causation.' It's a phrase I've spoken hundreds of times, even after the ideas that resulted in this essay were broadly developed. It's often a useful educational tool for beginner-level students, and it's convenient as a shorthand description of a failure of scientific reasoning that's disturbingly common: just because A correlates with B, it doesn't mean that A causes B. The classic example is that ice cream sales correlate with violent crime rates, but that doesn't mean ice cream fuels crime — and of course this is true, and anyone still making base-level errors is well-served by that catchphrase 'correlation does not imply causation'.
The thing is, our catchphrase is wrong — correlation does in fact imply2 causation. More precisely, if things are correlated, there exists a relatively short causal chain linking those things, with confidence one minus the p-value of the correlation. Far too many smart people think the catchphrase is literally true, and end up dismissing correlation as uninteresting. It's of course possible for things to be correlated by chance, in the same way that it's possible to flip a coin and get 10 heads in a row3, but as sample size increases this becomes less and less likely, that's the whole point of calculating the p-value when testing for correlation. In other words, there are only two explanations for a correlation: coincidence or causation.
Let's return to the ice cream example. It doesn't take long to guess what's really going on here: warm weather causes both the increased occupancy of public space and irritability that leads to spikes in violent crime and to a craving for a cold treat. So no, ice cream does not cause violent crime. But they are causally linked, through a quite short causal pathway. There are three possible architectures for the pathway: A causes B, B causes A, and C causes both, either directly or indirectly4.
I would hate to push anyone back to the truly naive position that A correlating with B means A causes B, but let's not say false things: correlation does in fact imply causation5, just doesn't show you which direction that causation flows.
Why do I care about correcting this phrase? Two reasons — it is bad as a community to have catchphrases that are factually false, and "correlation does not imply causation" can and has been used for dark arts before. Rather famously, Ronald Fisher spent decades arguing that there was insufficient evidence to conclude that smoking causes lung cancer - because correlation does not imply causation. The tobacco industry was grateful. Meanwhile, the correlation was telling us exactly what we should have been doing: not dismissing it, but designing experiments to determine which of the three causal architectures explained it. The answer, of course, was the obvious one. Correlation was trying to tell us something, and we spent decades pretending it wasn't allowed to.
Notes
1.This one strikes closest to my heart as a longtime XKCD fan. Randall is almost gesturing at the point I make in this essay, but not quite. At the risk of thinking too hard about a joke (surely not a sin for this particular comic), the key flaw here is the tiny sample size — this isn't even correlation, the p-value is 0.5. If 1,000 people take a statistics class and we survey them before and after, then we could get a meaningful, statistically-robust correlation here — and unfortunately it would probably be the case that taking the class makes people more likely to believe this phrase.
2.I'm using 'imply' in an empirical rather than logical sense — it's not that correlation proves causation the way a mathematical proof does, but that it provides evidence for causation, with strength proportional to sample size.
3.p=0.00195, being generous and taking the two-tailed value.
4.That "indirectly" is pointing at a fourth option, actually an infinite set of options: C causes A and D which causes B, C causes A and E which causes D which causes B, etc. I'm not including these because it's natural to consider those as variants on C causes both. As an analogy: if one pushes over the first domino, did that cause the last domino to fall? A pedant might argue the actual cause of the last domino falling was the penultimate domino falling on it, and in some cases that precision can be useful, but most of the time it's natural to just say the person who pushed the first domino caused the last one to fall over. In practice the causal chain is probably pretty short, because interesting correlations tend to be well below one, and after a few intermediates the correlation strength drops below the noise threshold of detection.
5.With the evidentiary strength you would expect based on the p-value of the correlation. Coincidence is always a possibility, but becomes pretty unlikely for correlations with a large sample size.