Mentioned in

Causality does not imply correlation

14Steve_Rayhawk

9jimrandomh

0Douglas_Knight

7Mike Bishop

0cousin_it

7CronoDAS

8SilasBarta

0CronoDAS

0Richard_Kennaway

5mistercow

5SilasBarta

4JGWeissman

-2Richard_Kennaway

1Jack

0Richard_Kennaway

0mistercow

3Cyan

4SilasBarta

0Richard_Kennaway

2SilasBarta

0Richard_Kennaway

-1SilasBarta

2PhilGoetz

4Richard_Kennaway

2PhilGoetz

2gjm

0PhilGoetz

0Richard_Kennaway

0gjm

-2PhilGoetz

0PhilGoetz

0PhilGoetz

2Epictetus

2imuli

2Steve_Rayhawk

1[anonymous]

1Annoyance

0thomblake

4SilasBarta

0Richard_Kennaway

2SilasBarta

0Richard_Kennaway

3Steve_Rayhawk

1ShardPhoenix

0Richard_Kennaway

0SilasBarta

0Steve_Rayhawk

0SilasBarta

0Vladimir_Nesov

0Steve_Rayhawk

0prase

0Richard_Kennaway

1Vladimir_Nesov

0Richard_Kennaway

0Vladimir_Nesov

0prase

0Richard_Kennaway

0Vladimir_Nesov

New Comment

58 comments, sorted by Click to highlight new comments since: Today at 7:08 AM

The actual relationship is: A causes B. Furthermore, there is no noise in the process. A is varying randomly, but B is deterministically caused by A and nothing else, and not by a complex process either.

[. . .] It does not matter what smooth waveform the signal generator puts out, it will have zero correlation with the current that it is the sole cause of.

This equivocates the entire waveform A and the values of A at single points in time. The random value of the entire waveform A can be seen as the sole cause of the entire value of the waveform B, under one representation of the probability relations. But there is no representation under which the random value of A at a single point in time can be seen as the sole cause of the random value of B at that point in time. What could be a sole cause of the value of B at any point in time is the value of A at that time together with any one of three other variables: the value of a hidden low-pass-filtered white noise at that time, the value of A at an immediately preceding time in the continuum limit, or, if this is a second-order system, the value of B at an immediately preceding time in the continuum limit.

As entire waveforms, the random value of A is perfectly correlated with the random value of B (up to the rank of the covariance of B), because B is a deterministic linear transformation of A. As values at single points in time, the random value of A is uncorrelated with the random value of B.

So, marginalizing out the equivocation, either A is a sole deterministic cause of B, and A and B are perfectly correlated (but correlation is not logically necessary; see below), or A and B have zero correlation, and A is not a sole deterministic cause of B.

. . . Spirtes et al's Causation, Prediction, and Search. One of the axioms used in the last-mentioned is the Faithfulness Axiom. See the book for the precise formulation; informally put it amounts to saying that if two variables are un

, then they are causally independent. . . . The purpose of this article is to argue that this is not the case.correlated

Emphasis added here and below.

Causation, Prediction, and Search, page 31:

Faithfulness Condition:LetGbe a causal graph andPa probability distribution generated byG. <G,P> satisfies the Faithfulness Condition if and only if everytrue inconditional independence relationPis entailed by the Causal Markov Condition applied to G.

Wikipedia on correlation:

If the variables are independent then the correlation is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. Here is an example: Suppose the random variable X is uniformly distributed on the interval from −1 to 1, and Y = X². Then Y is completely determined by X, so that X and Y are dependent, but their correlation is zero; they are uncorrelated.

Spirtes's example on page 71 looks like a linear Gaussian causal system. In a linear Gaussian causal system, uncorrelation is equivalent to simple marginal independence and can imply complete conditional independence.

Theorem:In the long run, a bounded, differentiable real function has zero correlation with its first derivative. [. . .]Notice that unlike the case that Spirtes considers, where the causal connections between two variables just happen to have multiple effects that exactly cancel, the lack of correlation between A and B is robust.

Yes, I think this is true for values of a function and its derivative sampled at single uniformly random times (for some limit sense of "uniform" and "a function").

Type error! Causal relationships between *booleans* imply correlation between them. Causal relationships between *numbers* imply correlation only if the relation is monotonic. Other types, such as strings, need not even have a meaningful definition of correlation, but they can nevertheless be causally related to eachother.

There is a really interesting discussion/debate about Pearl's and Rubin's approaches to causal inference going on at Andrew Gelman's Blog. Part One. Part two. Part three.

Pearl is contributing in the comments.

As per my comment here, there is no *statistical correlation* between the PRNG and the seed value, but there is mutual information.

When someone says "no correlation" to mean "no statistical correlation", people hear "no correlation", which invokes that clump of conceptspace in their minds which implies "no mutual information". But that isn't true. There are other ways for variables to be *related* than statistical correlation, and mutual information is one way, and this is an important distinction to make before you get all giddy!

That is what A and B are: a randomly wandering variable A and its rate of change B.

Maybe I'm not quite understanding, but it seems to me that your argument relies on a rather broad definition of "causality". B may be dependent on A, but to say that A "causes" B seems to ignore some important connotations of the concept.

I think what bugs me about it is that "causality" implies a directness of the dependency between the two events. At first glance, this example *seems* like a direct relationship. But I would argue that B is not caused by A alone, but by both A's current and previous states. If you were to transform A so that a given B depended directly on a given A', I think you would indeed see a correlation.

I realize that I'm kind of arguing in a circle here; what I'm ultimately saying is that the term "cause" *ought* to imply correlation, because that is more useful to us than a synonym for "determine", and because that is more in line (to my mind, at least) with the generally accepted connotations of the word.

Maybe I'm not quite understanding, but it seems to me that your argument relies on a rather broad definition of "causality". B may be dependent on A, but to say that A "causes" B seems to ignore some important connotations of the concept.

Very true. Once again, I'm going to have to recommend in the context of a Richard Kennaway post, the use of more precise concepts. Instead of "correlation", we should be talking about "mutual information", and it would be helpful if we used Judea Pearl's definition of causality.

Mutual information between two variables means (among many equivalent definitions) how much you learn about one variable by learning the other. Statistical correlation is one *way* that there can be mutual information between two variables, but not the only way.

So, like what JGWeissman said, there can be mutual information between the two series even in the absence of a statistical correlation that directly compares time t in one to time t in the other. For example, there is mutual information between sin(t) and cos(t), even though d(sin(t))/dt = cos(t), and even though they're simultaneously uncorrelated (i.e. uncorrelated when comparing time t to time t). The reason there is mutual information is that if you know sin(t), a simple time-shift tells you cos(t).

As for causation, the Pearl definition is (and my apologies I may not get this right) that:

"A causes B iff, after learning A, nothing else at the time of A or B gives you information about B. (and A is the minimal such set for which this is true)"

In other words, A causes B iff A is the minimal set for which B is conditionally independent given A.

So, anyone want to rephrase Kennaway's post with those definitions?

But I would argue that B is not caused by A alone, but by both A's current and previous states.

This is the right idea. For small epsilon, B(t) should have a weak negative correlation with A(t - epsilon), a weak positive correlation with A(t + epsilon). and a strong positive correlation with the difference A(t + epsilon) - A(t - epsilon).

The function A causes the function B, but the value of A at time t does not cause the value of B at time t. Therefore the lack of correlation between A(t) and B(t) does not contradict causation implying correlation.

Therefore the lack of correlation between A(t) and B(t) does not contradict causation implying correlation.

Only trivially. Since B = dA/dt, the correlation between B and dA/dt is perfect. Likewise for any other relationship B = F(A): B correlates perfectly with F(A). But you would only compare B and F(A) if you already had some reason to guess they were related, and having done so would observe they were the same and not trouble with correlations at all.

If you do not know that B = dA/dt and have no reason to guess this hypothesis, correlations will tell you nothing, especially if your time series data has too large a time step -- as positively recommended in the linked paper -- to see dA/dt at all.

But I would argue that B is not caused by A alone, but by both A's current and previous states.

Consider not the abstract situation of B = dA/dt, but the concrete example of the signal generator. It would be a perverse reading of the word "cause" to say that the voltage does not cause the current. You can make the current be anything you like by suitably manipulating the voltage.

But let this not degenerate into an argument about the "real" meaning of "cause". Consider instead what is being said about the systems studied by the authors referenced in the post.

Lacerda, Spirtes, et al. do not use your usage. They talk about time series equations in which the current state of each variable depends on the previous states of some variables, but still they draw causal graphs which do not have a node for every time instant of every variable, but a node for every variable. When x(i+1) = b y(i) + c z(i), they talk about y and z causing x.

The reason that none of their theorems apply to the system B = dA/dt is that when I discretise time and put this in the form of a difference equation, it violates the precondition they state in section 1.2.2. This will be true of the discretisation of any system of ordinary differential equations. It appears to me that that is a rather significant limitation of their approach to causal analysis.

Consider not the abstract situation of B = dA/dt, but the concrete example of the signal generator. It would be a perverse reading of the word "cause" to say that the voltage does not cause the current. You can make the current be anything you like by suitably manipulating the voltage.

But you can make a similar statement for just about *any* situation where B = dA/dt, so I think it's useful to talk about the abstract case.

For example, you can make a car's velocity anything you like by suitably manipulating its position. Would you then say that the car's position "causes" its velocity? That seems awkward at best. You can control the car's acceleration by manipulating its velocity, but to say "velocity causes acceleration" actually sounds *backwards*.

But let this not degenerate into an argument about the "real" meaning of "cause". Consider instead what is being said about the systems studied by the authors referenced in the post.

But isn't this really the whole argument? If the authors implied that every relationship between two functions implies correlation between their raw values, then that is, I think, self-evidently wrong. The question then, is *do* we imply correlation when we refer to causation? I think the answer is generally "yes".

I think intervention is the key idea missing from the above discussion of which of the the derivative function and the integrated function is the cause and which is the effect. In the signal generator example, voltage is a cause of current because we can intervene directly on the voltage. In the car example, acceleration is a cause of velocity because we can intervene directly on acceleration. This is not too helpful on its own, but maybe it will point the discussion in a useful direction.

In a further article I will exhibit time series for three variables, A, B, and C, where the joint distribution is multivariate normal, the correlation of A with C is below -0.99, and each has zero correlation with B. ...

And in the current comment section, I'm going to give away the answer, since I've run through the PCT demos. (Sorry, I don't know how to format for spoilers, will edit once I figure out or someone tells me.)

You sure you didn't want to figure out on your own? Okay, here goes. Kennaway is describing a feedback control system: a system that observes a variable's current value and outputs a signal that attempts to bring it back towards a reference value. A is an external disturbance. B is the deviation of the system from the reference value (the error). C is the output of the controller.

The controller C will push in the opposite direction of the disturbance A, so A and C will be about anti-correlated. Their combined effect is to keep B very close to zero with random deviations, so B is uncorrelated with both.

The disturbance and the controller jointly cause the error. So, we have A->B and C->B. The error also causes the controller to output what it does, so B->C. (I assume directed cycles are allowed since there are four possible connections and you said there are 16 possible graphs.)

Together, that's A-->B<-->C

(In other news, Kennaway or pjeby will suggest I'm not giving due attenction to Perceptual Control Theory.)

(Edit: some goofs)

You have read my mind perfectly and understood the demos! But I'll go ahead and make the post anyway, when I have time, because there are some general implications to draw from the disconnect between causality and correlation. Such as, for example, the impossibility of arriving at A-->B<-->C for this example from any existing algorithms for deriving causal structure from statistical information.

the impossibility of arriving at A-->B<-->C for this example from any existing algorithms for deriving causal structure from statistical information.

Correct me if I'm wrong, but I think I already know the insight behind what you're going to say.

It's this: there is no fully general way to detect all mutual information between variables, because that would be equivalent to being able to compute Kolmogorov complexity (minimum length to output a string), which would in turn be equivalent to solving the Halting problem.

Do you actually have a proof?

Yes. But it's not deep; I recommend trying yourself before consulting the answer. It follows straightforwardly from the fact that the integral of x(dx/dt) is (x^2)/2. The rest is bookkeeping to eliminate edge cases.

I didn't trouble to state the result with complete precision in the OP. For reference, here is an exact formulation (Theorem 2 of the linked note):

*Let x be a differentiable real function. If the averages of x and dx/dt over the whole real line exist, and the correlation of x and dx/dt over the whole real line exists, then the correlation is zero.*

Let x be a differentiable real function.

I think precision would require you to state this in terms of a variable x and the function f(x). (EDIT: Sorry; please ignore this.)

If the averages of x and dx/dt over the whole real line exist,

This is a pretty harsh requirement! It will be true for constant functions, cyclic functions, symmetric functions, and maybe asymptotically-bounded functions. I don't think you can say it's true for y=x.

I think precision would require you to state this in terms of a variable x and the function f(x).

gjm has read the note I linked; I suggest you do the same. That is what a link is for.

This is a pretty harsh requirement!

Not particularly. The speed of a car, the temperature of a room, the height of an aircraft: such things are all around you. Stating the property of the whole real line is an idealisation, but Theorem 1 of the note treats of finite intervals also, and there is a version of the theorems for time series.

I don't think you can say it's true for y=x.

In keeping with the terminology established at the note I linked, I take this to mean x=t. Yes, it is not true of x=t. This does not have an average over the whole real line.

gjm has read the note I linked; I suggest you do the same. That is what a link is for.

I wish I hadn't made my comment about precision, which was too nitpicking and unhelpful. But as long as we're being snippy with each other:

To be excruciatingly precise: You just said you were being precise, then said "Let x be a differentiable real function." That isn't precise; you need to specify right there that it's a function of t. If you'd said the *link* stated it precisely, that would be different.

I admit that I would have interpreted it correctly by making the most-favorable, most-reasonable interpretation and assuming x was a function of t. But, because of the sorts of things I usually see done with x and t, I assumed that x was a function of time, and the function of interest was some function of x(t), and I jumped to the conclusion that you meant to say "Let f(x) be a differentiable real function." Which I would not have done had you in fact been precise, and said "Let x(t) be a differentiable real function."

Sorry that I sounded dismissive. It's a nice proof, and it wasn't obvious to me.

I am uncomfortable with using Pearson correlation to mean correlation. Consider y=sin(x), dy/dx = cos(x). These are "uncorrelated" according to Pearson correlation, but given one, there are at most 2 possibilties for the other. So knowing one gives you almost complete info about the other. So calling them "independent" seems wrong.

Correlation only looks for linear relationships. For example, suppose we have a random variable X that takes values -2, -1, 1, or 2 each with probability 1/4. Define the random variable Y=X^2. The correlation is 0. Despite a functional relationship (causality if I've ever seen it), the two variables are uncorrelated.

The actual relationship is: A causes B. Furthermore, there is no noise in the process. A is varying randomly, but B is deterministically caused by A and nothing else, and not by a complex process either.

[. . .] It does not matter what smooth waveform the signal generator puts out, it will have zero correlation with the current that it is the sole cause of.

This equivocates the entire waveform A and the values of A at single points in time. The random value of the entire waveform A is a sole cause of the entire value of the waveform B. The random value of A at a single point in time is not a sole cause of the random value of B at that point in time. What would be a sole cause of the value of B at any point in time is the value of A at that time together with any one of three other variables: the value of a hidden low-pass-filtered white noise at that time, the value of A at an immediately preceding time in the continuum limit, or, if this is a second-order system, the value of B at an immediately preceding time in the continuum limit.

As entire waveforms, the random value of A is perfectly correlated with the random value of B (up to the rank of the covariance of B), because B is a deterministic linear transformation of A. As values at single points in time, the random value of A is uncorrelated with the random value of B.

So, marginalizing out the equivocation, either A is a sole deterministic cause of B, and A and B are perfectly correlated (but correlation is not logically necessary; see below), or A and B have zero correlation, and A is not a sole deterministic cause of B.

. . . Spirtes et al's Causation, Prediction, and Search. One of the axioms used in the last-mentioned is the Faithfulness Axiom. See the book for the precise formulation; informally put it amounts to saying that if two variables are un

, then they are causally independent. . . . The purpose of this article is to argue that this is not the case.correlated

Emphasis added here and below.

Causation, Prediction, and Search, page 31:

Faithfulness Condition:LetGbe a causal graph andPa probability distribution generated byG. <G,P> satisfies the Faithfulness Condition if and only if everytrue inconditional independence relationPis entailed by the Causal Markov Condition applied to G.

Wikipedia on correlation:

If the variables are independent then the correlation is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. Here is an example: Suppose the random variable X is uniformly distributed on the interval from −1 to 1, and Y = X². Then Y is completely determined by X, so that X and Y are dependent, but their correlation is zero; they are uncorrelated.

Spirtes's example on page 71 looks like a linear Gaussian causal system. In a linear Gaussian causal system, uncorrelation is identical with simple marginal independence, and it can imply complete conditional independence.

Theorem:In the long run, a bounded, differentiable real function has zero correlation with its first derivative. . . .Notice that unlike the case that Spirtes considers, where the causal connections between two variables just happen to have multiple effects that exactly cancel, the lack of correlation between A and B is robust.

Yes, I think this is true for values of a function and its derivative sampled at single uniformly random times (for a limit sense of "uniform").

That's actually an interesting issue in control systems. IIRC, if you set up a system so that some variable B is a function of the time-derivative of A, B=f( dA(t)/dt ), and it requires you to know dA(T)/dt to compute B(T), such a system is called "acausal". I believe this is because you can't know dA(T)/dt until you know A(t) after time T.

So any physically-realizable system that depends on the time-derivative of some other value, is actually depending on the time-derivative at a *previous* point in time.

In contrast, there is no such problem for the integral. If I only know the time series of A(t) up to time T, then I know the integral of A up to time T, and such a relationship is not acausal.

In the general case, for a relationship between two systems where B is a function of A, the transfer function from A to B, num(s)/den(s) must be such that the deg(num) <= deg(den), where deg() denotes the degree of a polynomial.

(The transfer function is ratio of B to A in the Laplace domain, usually given the variable s to replace t. Multiplying by s in the Laplace domain corresponds to differentiation in the time domain, and dividing by s is integration.)

(edit to clarify, then again to clarify some more)

You mean, like the mechanical (well, electronic) one I described?

B = dA/dt doesn't imply that B is the cause of A. As I pointed out, a current generator attached to a capacitor causes the voltage, the reverse of the first example, but the mathematical relation between voltage and current is the same.

"Cause" is an everyday concept that tends to dissolve when looked at too closely. The research on causal analysis of statistical data quite sensibly does not try to define it.

Ah, ok. Applying his definition ("variable *X* is a probabilistic-cause of variable *Y* if *P*(*y*|do(*x*)) != *P*(*y*) for some values *x* and *y*") to the signal generator, it says that the voltage causes the current; in the current-source version, that the current causes the voltage. That's exactly what I would say as well.

Of course, his limitation to acyclic relations excludes from his analysis systems that are only slightly more complicated, such as .

his limit to acyclic relations

That's what dynamic Bayesian networks are for. The current values of state variables of a system near stable equilibrium are not caused by each other; they are caused by past values. Dynamic Bayesian networks express this distinction with edges that pass forward in time.

excludes from his analysis systems that are only slightly more complicated, such as

The continuous-time limit of a dynamic Bayesian network can be a differential equation such as this.

(ETA) A dynamic Bayesian network is syntactic sugar for an ordinary Bayesian network that has the same structure in each of a series of time slices, with edges from nodes in each time slice to nodes in the next time slice. The Bayesian network that is made by unrolling a dynamic Bayesian network is still completely acyclic. Therefore, Bayesian networks have at least the representation power of finitely iterated systems of explicit recurrence relations and are acyclic, and continuum limits of Bayesian networks have at least the representation power of systems of differential equations and are acyclic. (Some representation powers that these Bayesian networks do not have are the representation powers of systems of implicit recurrence relations, systems of differential algebraic equations without index reduction, and differential games. Something like hybrid Bayesian-Markovian networks would have some of these representation powers, but they would have unphysical semantics (if physics is causal) and would be hard to use safely.)

(Dynamic Bayesian networks at the University of Michigan Chemical Engineering Process Dynamics and Controls Open Textbook ("ControlWiki"))

As with CronoDAS's suggestion of a pseudorandom generator, this can easily yield variables possessing a strong causal connection but no correlation.

Correlations -- product-moment or any other statistical calculation -- are machines to detect relationships between variables that are obscured by passive fog. Random generators and cryptosystems are machines to defeat detection even by an adversary. It is not a surprise that crypto beats correlation.

More surprising is the existence of systems as simple as B = dA/dt which also defeat correlation. The scatter-plot looks like pure fog, yet there are no extraneous noise sources and no adversarial concealment. The relationship between the variables is simply invisible to the statistical tools used in causal analysis.

Whoa, what happened there? Someone made a comment here that made a lot of good points (others had made them, but this poster did it much better), and I voted it up, as did someone else, but now it's gone, and there's no "comment deleted" indication.

Is LW not always giving the most up-to-date site?

My use of the word "cause" was not technically accurate. I needed to find a way to word my comment that didn't use it that way.

. . . Spirtes et al's Causation, Prediction, and Search. One of the axioms used in the last-mentioned is the Faithfulness Axiom. See the book for the precise formulation; informally put it amounts to saying that if two variables are un

, then they are causally independent. . . . The purpose of this article is to argue that this is not the case.correlated

Emphasis added here and below.

Causation, Prediction, and Search, page 31:

Faithfulness Condition:LetGbe a causal graph andPa probability distribution generated byG. <G,P> satisfies the Faithfulness Condition if and only if everytrue inconditional independence relationPis entailed by the Causal Markov Condition applied to G.

Wikipedia on correlation:

If the variables are independent then the correlation is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. Here is an example: Suppose the random variable X is uniformly distributed on the interval from −1 to 1, and Y = X². Then Y is completely determined by X, so that X and Y are dependent, but their correlation is zero; they are uncorrelated.

Spirtes's example on page 71 looks like a linear Gaussian causal system. In a linear Gaussian causal system, uncorrelation is identical with simple marginal independence, and it can imply complete conditional independence.

Theorem:In the long run, a bounded, differentiable real function has zero correlation with its first derivative. . . .Notice that unlike the case that Spirtes considers, where the causal connections between two variables just happen to have multiple effects that exactly cancel, the lack of correlation between A and B is robust.

Yes, I think this is true for values of a function and its derivative sampled at single uniformly random times (for a limit sense of "uniform").

*Theorem: In the long run, a bounded, differentiable real function has zero correlation with its first derivative.*

I don't understand the theorem. What does "in the long run" mean? Is it that in limit a,b->\infty

(\int{a,b} f(x)f'(x) dx)/(b-a)=(\int{a,b} f(x) dx)(\int_a^b f'(y) dy)/(b-a)^2 ?

Sorry for the quasi-TEX notation, even the underscore doesn't appear here. Is there any elegant way to write formulae on LW?

If <f'> on (a,b) grows more rapidly than (b-a)

This cannot happen. f is assumed bounded. Therefore the average of f' over the interval [a,b] tends to zero as the bounds go to infinity.

The precise, complete mathematical statement and proof of the theorem does involve some subtlety of argument (consider what happens if f = sin(exp(x))) but the theorem is correct.

It is a commonplace that correlation does not imply causality, however eyebrow-wagglingly suggestive it may be of causal hypotheses. It is less commonly noted that causality does not imply correlation either. It is quite possible for two variables to have zero correlation, and yet for one of them to be completely determined by the other.

Theorem:In the long run, a bounded, differentiable real function has zero correlation with its first derivative.