It is a commonplace that correlation does not imply causality, however eyebrow-wagglingly suggestive it may be of causal hypotheses. It is less commonly noted that causality does not imply correlation either. It is quite possible for two variables to have zero correlation, and yet for one of them to be completely determined by the other.

The causal analysis of statistical information is the subject of several major books, including Judea Pearl's Causality and Probabilistic reasoning in intelligent systems, and Spirtes et al's Causation, Prediction, and Search. One of the axioms used in the last-mentioned is the Faithfulness Axiom. See the book for the precise formulation; informally put it amounts to saying that if two variables are uncorrelated, then they are causally independent. As support for this, the book offers a theorem to the effect that while counterexamples are theoretically possible, they have measure zero in the space of causal systems, and anecdotal evidence that people find fault with causal explanations violating the axiom.

The purpose of this article is to argue that this is not the case.

The counterexample consists of just two variables A and B. The time series data can be found here, a text file in which each line contains a pair of values for A and B. Here is a scatter-plot:

The correlation is not significantly different from zero. Consider the possible causal relationships there might be between two variables, assuming there are no other variables involved. A causes B; B causes A; each causes the other; neither causes the other. Which of these describes the relationship between A and B for the above data?

The correct answer is that none of the four hypotheses can be rejected by these data alone. The actual relationship is: A causes B. Furthermore, there is no noise in the process. A is varying randomly, but B is deterministically caused by A and nothing else, and not by a complex process either. The process is robust: it is not by an accident that the correlation is zero. Every physical process that is modelled by the very simple mathematical relation at work here (to be revealed below) has the same property.

Because I know the process that generated these data, I can confidently predict that it is not possible for anyone to discover from them the true dynamical relation between A and B. So I'll make it a little easier to guess what is going on before I tell you a few paragraphs down. Here (warning: large file) is another time series for the same variables, sampled at 1000 times the frequency (but only 1/10 the total time). Just by plotting these a certain regularity may become evident to the eyes, and it should be quite easy for anyone so inclined to discover the mathematical relationship between A and B.

So what are these variables, that are tightly causally connected yet completely uncorrelated?

Consider a signal generator. It generates a voltage that varies with time. Most signal generators can generate square waves or sine waves, sometimes sawtooths as well. This signal generator generates a random waveform. Not white noise -- it meanders slowly up and down without pattern, and in the long run the voltage is normally distributed.

Connect the output across a capacitor. The current through the capacitor is proportional to the rate of change of the voltage. Because the voltage is bounded and differentiable, the correlation with its first derivative is zero. That is what A and B are: a randomly wandering variable A and its rate of change B.

**Theorem:**

*In the long run, a bounded, differentiable real function has zero correlation with its first derivative.*

The proof is left as an exercise.

Notice that unlike the case that Spirtes considers, where the causal connections between two variables just happen to have multiple effects that exactly cancel, the lack of correlation between A and B is robust. It does not matter what smooth waveform the signal generator puts out, it will have zero correlation with the current that it is the sole cause of. I chose a random waveform because it allows any value and any rate of change of that value to exist simultaneously, rather than e.g. a sine wave, where each value implies at most two possible rates of change. But if your data formed neat sine waves you wouldn't be resorting to statistics. The problem here is that they form a cloud of the sort that people immediately start doing statistics on, but the statistics tells you nothing. I could have arranged that A and B had a modest positive correlation, by taking for B a linear combination of A and dA/dt, but the seductive exercise of drawing a regression line through the cloud would be meaningless.

In some analyses of causal networks (for example here, which tries, but I think unsuccessfully, to handle cyclic causal graphs), an assumption is made that the variables are at equilibrium, i.e. that observations are made at intervals long enough to ignore transient temporal effects. As can be seen by comparing the two time series for A and B, or by considering the actual relation between the variables, this procedure guarantees to hide, not reveal, the relationship between these variables.

If anyone tackled and solved the exercise of studying the detailed time series to discover the relationship before reading the answer, I doubt that you did it by any statistical method.

Some signal generators can be set to generate a current instead of a voltage. In that case the current through the capacitor would cause the voltage across it, reversing the mathematical relationship. So even detailed examination of the time series will not distinguish between the voltage causing the current and the current causing the voltage.

In a further article I will exhibit time series for three variables, A, B, and C, where the joint distribution is multivariate normal, the correlation of A with C is below -0.99, and each has zero correlation with B. Some causal information is also given: A is exogenous (i.e. is not causally influenced by either B or C), and there are no confounding variables (other variables correlating with more than one of A, B, or C). This means that there are four possible causal arrows you might draw between the variables: A to B, A to C, B to C, and C to B, giving 16 possible causal graphs. Which of these graphs are consistent with the distribution?

This equivocates the entire waveform A and the values of A at single points in time. The random value of the entire waveform A can be seen as the sole cause of the entire value of the waveform B, under one representation of the probability relations. But there is no representation under which the random value of A at a single point in time can be seen as the sole cause of the random value of B at that point in time. What could be a sole cause of the value of B at any point in time is the value of A at that time together with any one of three other variables: the value of a hidden low-pass-filtered white noise at that time, the value of A at an immediately preceding time in the continuum limit, or, if this is a second-order system, the value of B at an immediately preceding time in the continuum limit.

As entire waveforms, the random value of A is perfectly correlated with the random value of B (up to the rank of the covariance of B), because B is a deterministic linear transformation of A. As values at single points in time, the random value of A is uncorrelated with the random value of B.

So, marginalizing out the equivocation, either A is a sole deterministic cause of B, and A and B are perfectly correlated (but correlation is not logically necessary; see below), or A and B have zero correlation, and A is not a sole deterministic cause of B.

Emphasis added here and below.

Causation, Prediction, and Search, page 31:

Wikipedia on correlation:

Spirtes's example on page 71 looks like a linear Gaussian causal system. In a linear Gaussian causal system, uncorrelation is equivalent to simple marginal independence and can imply complete conditional independence.

Yes, I think this is true for values of a function and its derivative sampled at single uniformly random times (for some limit sense of "uniform" and "a function").

Type error! Causal relationships between

booleansimply correlation between them. Causal relationships betweennumbersimply correlation only if the relation is monotonic. Other types, such as strings, need not even have a meaningful definition of correlation, but they can nevertheless be causally related to eachother.That is true, but not relevant. These numbers have, I think, as strong an independence as bit strings can have: no bit that can be extracted from A is correlated with a bit that is extracted from B.

There is a really interesting discussion/debate about Pearl's and Rubin's approaches to causal inference going on at Andrew Gelman's Blog. Part One. Part two. Part three.

Pearl is contributing in the comments.

Thanks! This stuff looks interesting. I would appreciate a short summary; could write it myself if I have the time.

The output of a pseudorandom number generator is determined by the seed value, but good luck finding a correlation between them! ;)

As per my comment here, there is no

statistical correlationbetween the PRNG and the seed value, but there is mutual information.When someone says "no correlation" to mean "no statistical correlation", people hear "no correlation", which invokes that clump of conceptspace in their minds which implies "no mutual information". But that isn't true. There are other ways for variables to be

relatedthan statistical correlation, and mutual information is one way, and this is an important distinction to make before you get all giddy!Ah, yes, this clears things up... correlation and mutual information are the kind of things that can get confused easily.

That is another way that correlations can fail to detect what is happening.

Maybe I'm not quite understanding, but it seems to me that your argument relies on a rather broad definition of "causality". B may be dependent on A, but to say that A "causes" B seems to ignore some important connotations of the concept.

I think what bugs me about it is that "causality" implies a directness of the dependency between the two events. At first glance, this example

seemslike a direct relationship. But I would argue that B is not caused by A alone, but by both A's current and previous states. If you were to transform A so that a given B depended directly on a given A', I think you would indeed see a correlation.I realize that I'm kind of arguing in a circle here; what I'm ultimately saying is that the term "cause"

oughtto imply correlation, because that is more useful to us than a synonym for "determine", and because that is more in line (to my mind, at least) with the generally accepted connotations of the word.Very true. Once again, I'm going to have to recommend in the context of a Richard Kennaway post, the use of more precise concepts. Instead of "correlation", we should be talking about "mutual information", and it would be helpful if we used Judea Pearl's definition of causality.

Mutual information between two variables means (among many equivalent definitions) how much you learn about one variable by learning the other. Statistical correlation is one

waythat there can be mutual information between two variables, but not the only way.So, like what JGWeissman said, there can be mutual information between the two series even in the absence of a statistical correlation that directly compares time t in one to time t in the other. For example, there is mutual information between sin(t) and cos(t), even though d(sin(t))/dt = cos(t), and even though they're simultaneously uncorrelated (i.e. uncorrelated when comparing time t to time t). The reason there is mutual information is that if you know sin(t), a simple time-shift tells you cos(t).

As for causation, the Pearl definition is (and my apologies I may not get this right) that:

"A causes B iff, after learning A, nothing else at the time of A or B gives you information about B. (and A is the minimal such set for which this is true)"

In other words, A causes B iff A is the minimal set for which B is conditionally independent given A.

So, anyone want to rephrase Kennaway's post with those definitions?

This is the right idea. For small epsilon, B(t) should have a weak negative correlation with A(t - epsilon), a weak positive correlation with A(t + epsilon). and a strong positive correlation with the difference A(t + epsilon) - A(t - epsilon).

The function A causes the function B, but the value of A at time t does not cause the value of B at time t. Therefore the lack of correlation between A(t) and B(t) does not contradict causation implying correlation.

Only trivially. Since B = dA/dt, the correlation between B and dA/dt is perfect. Likewise for any other relationship B = F(A): B correlates perfectly with F(A). But you would only compare B and F(A) if you already had some reason to guess they were related, and having done so would observe they were the same and not trouble with correlations at all.

If you do not know that B = dA/dt and have no reason to guess this hypothesis, correlations will tell you nothing, especially if your time series data has too large a time step -- as positively recommended in the linked paper -- to see dA/dt at all.

I don't think you are arguing in a circle. B is caused by current and previous As. Obviously we're not going to see a correlation unless we control for the previous state of A. Properly controlled the relationship between the two variables will be one-to-one, won't it?

Consider not the abstract situation of B = dA/dt, but the concrete example of the signal generator. It would be a perverse reading of the word "cause" to say that the voltage does not cause the current. You can make the current be anything you like by suitably manipulating the voltage.

But let this not degenerate into an argument about the "real" meaning of "cause". Consider instead what is being said about the systems studied by the authors referenced in the post.

Lacerda, Spirtes, et al. do not use your usage. They talk about time series equations in which the current state of each variable depends on the previous states of some variables, but still they draw causal graphs which do not have a node for every time instant of every variable, but a node for every variable. When x(i+1) = b y(i) + c z(i), they talk about y and z causing x.

The reason that none of their theorems apply to the system B = dA/dt is that when I discretise time and put this in the form of a difference equation, it violates the precondition they state in section 1.2.2. This will be true of the discretisation of any system of ordinary differential equations. It appears to me that that is a rather significant limitation of their approach to causal analysis.

But you can make a similar statement for just about

anysituation where B = dA/dt, so I think it's useful to talk about the abstract case.For example, you can make a car's velocity anything you like by suitably manipulating its position. Would you then say that the car's position "causes" its velocity? That seems awkward at best. You can control the car's acceleration by manipulating its velocity, but to say "velocity causes acceleration" actually sounds

backwards.But isn't this really the whole argument? If the authors implied that every relationship between two functions implies correlation between their raw values, then that is, I think, self-evidently wrong. The question then, is

dowe imply correlation when we refer to causation? I think the answer is generally "yes".I think intervention is the key idea missing from the above discussion of which of the the derivative function and the integrated function is the cause and which is the effect. In the signal generator example, voltage is a cause of current because we can intervene directly on the voltage. In the car example, acceleration is a cause of velocity because we can intervene directly on acceleration. This is not too helpful on its own, but maybe it will point the discussion in a useful direction.

And in the current comment section, I'm going to give away the answer, since I've run through the PCT demos. (Sorry, I don't know how to format for spoilers, will edit once I figure out or someone tells me.)

You sure you didn't want to figure out on your own? Okay, here goes. Kennaway is describing a feedback control system: a system that observes a variable's current value and outputs a signal that attempts to bring it back towards a reference value. A is an external disturbance. B is the deviation of the system from the reference value (the error). C is the output of the controller.

The controller C will push in the opposite direction of the disturbance A, so A and C will be about anti-correlated. Their combined effect is to keep B very close to zero with random deviations, so B is uncorrelated with both.

The disturbance and the controller jointly cause the error. So, we have A->B and C->B. The error also causes the controller to output what it does, so B->C. (I assume directed cycles are allowed since there are four possible connections and you said there are 16 possible graphs.)

Together, that's A-->B<-->C

(In other news, Kennaway or pjeby will suggest I'm not giving due attenction to Perceptual Control Theory.)

(Edit: some goofs)

You have read my mind perfectly and understood the demos! But I'll go ahead and make the post anyway, when I have time, because there are some general implications to draw from the disconnect between causality and correlation. Such as, for example, the impossibility of arriving at A-->B<-->C for this example from any existing algorithms for deriving causal structure from statistical information.

Correct me if I'm wrong, but I think I already know the insight behind what you're going to say.

It's this: there is no fully general way to detect all mutual information between variables, because that would be equivalent to being able to compute Kolmogorov complexity (minimum length to output a string), which would in turn be equivalent to solving the Halting problem.

You're wrong. :-)

Kolmogorov complexity will play no part in the exposition.

Check my comment: I was only guessing the underlying insight behind your future post, not its content.

I obviously leave room for the possibility that you'll present a more limited or more poorly-defended version of what I just stated. ;-)

Do you actually have a proof?

Yes. But it's not deep; I recommend trying yourself before consulting the answer. It follows straightforwardly from the fact that the integral of x(dx/dt) is (x^2)/2. The rest is bookkeeping to eliminate edge cases.

I didn't trouble to state the result with complete precision in the OP. For reference, here is an exact formulation (Theorem 2 of the linked note):

Let x be a differentiable real function. If the averages of x and dx/dt over the whole real line exist, and the correlation of x and dx/dt over the whole real line exists, then the correlation is zero.I think precision would require you to state this in terms of a variable x and the function f(x). (EDIT: Sorry; please ignore this.)

This is a pretty harsh requirement! It will be true for constant functions, cyclic functions, symmetric functions, and maybe asymptotically-bounded functions. I don't think you can say it's true for y=x.

He's actually working with a variable t and the function x, whose value at a particular t is x(t). I don't see anything wrong there.

Yep, sorry.

gjm has read the note I linked; I suggest you do the same. That is what a link is for.

Not particularly. The speed of a car, the temperature of a room, the height of an aircraft: such things are all around you. Stating the property of the whole real line is an idealisation, but Theorem 1 of the note treats of finite intervals also, and there is a version of the theorems for time series.

In keeping with the terminology established at the note I linked, I take this to mean x=t. Yes, it is not true of x=t. This does not have an average over the whole real line.

Full disclosure: actually I didn't, I just inferred what the notation had to mean :-).

I wish I hadn't made my comment about precision, which was too nitpicking and unhelpful. But as long as we're being snippy with each other:

To be excruciatingly precise: You just said you were being precise, then said "Let x be a differentiable real function." That isn't precise; you need to specify right there that it's a function of t. If you'd said the

linkstated it precisely, that would be different.I admit that I would have interpreted it correctly by making the most-favorable, most-reasonable interpretation and assuming x was a function of t. But, because of the sorts of things I usually see done with x and t, I assumed that x was a function of time, and the function of interest was some function of x(t), and I jumped to the conclusion that you meant to say "Let f(x) be a differentiable real function." Which I would not have done had you in fact been precise, and said "Let x(t) be a differentiable real function."

Sorry that I sounded dismissive. It's a nice proof, and it wasn't obvious to me.

I am uncomfortable with using Pearson correlation to mean correlation. Consider y=sin(x), dy/dx = cos(x). These are "uncorrelated" according to Pearson correlation, but given one, there are at most 2 possibilties for the other. So knowing one gives you almost complete info about the other. So calling them "independent" seems wrong.

Sorry that I sounded dismissive. It's a nice proof, and it wasn't obvious to me.

Correlation only looks for linear relationships. For example, suppose we have a random variable X that takes values -2, -1, 1, or 2 each with probability 1/4. Define the random variable Y=X^2. The correlation is 0. Despite a functional relationship (causality if I've ever seen it), the two variables are uncorrelated.

Note - images and links are broken.

This equivocates the entire waveform A and the values of A at single points in time. The random value of the entire waveform A is a sole cause of the entire value of the waveform B. The random value of A at a single point in time is not a sole cause of the random value of B at that point in time. What would be a sole cause of the value of B at any point in time is the value of A at that time together with any one of three other variables: the value of a hidden low-pass-filtered white noise at that time, the value of A at an immediately preceding time in the continuum limit, or, if this is a second-order system, the value of B at an immediately preceding time in the continuum limit.

As entire waveforms, the random value of A is perfectly correlated with the random value of B (up to the rank of the covariance of B), because B is a deterministic linear transformation of A. As values at single points in time, the random value of A is uncorrelated with the random value of B.

So, marginalizing out the equivocation, either A is a sole deterministic cause of B, and A and B are perfectly correlated (but correlation is not logically necessary; see below), or A and B have zero correlation, and A is not a sole deterministic cause of B.

Emphasis added here and below.

Causation, Prediction, and Search, page 31:

Wikipedia on correlation:

Spirtes's example on page 71 looks like a linear Gaussian causal system. In a linear Gaussian causal system, uncorrelation is identical with simple marginal independence, and it can imply complete conditional independence.

Yes, I think this is true for values of a function and its derivative sampled at single uniformly random times (for a limit sense of "uniform").

sd

This is a really spectacular post.

One quibble: in the case being discussed, one variable is actually a property of the other variable, rather than another

thingthat is affected by something else.Is it really appropriate to say that A causes B when B is just a property of A?

I was thinking this as well, but you could construct a situation that doesn't have this problem - like a mechanical system that relies on the derivative to perform some action deterministically.

That's actually an interesting issue in control systems. IIRC, if you set up a system so that some variable B is a function of the time-derivative of A, B=f( dA(t)/dt ), and it requires you to know dA(T)/dt to compute B(T), such a system is called "acausal". I believe this is because you can't know dA(T)/dt until you know A(t) after time T.

So any physically-realizable system that depends on the time-derivative of some other value, is actually depending on the time-derivative at a

previouspoint in time.In contrast, there is no such problem for the integral. If I only know the time series of A(t) up to time T, then I know the integral of A up to time T, and such a relationship is not acausal.

In the general case, for a relationship between two systems where B is a function of A, the transfer function from A to B, num(s)/den(s) must be such that the deg(num) <= deg(den), where deg() denotes the degree of a polynomial.

(The transfer function is ratio of B to A in the Laplace domain, usually given the variable s to replace t. Multiplying by s in the Laplace domain corresponds to differentiation in the time domain, and dividing by s is integration.)

(edit to clarify, then again to clarify some more)

You mean, like the mechanical (well, electronic) one I described?

B = dA/dt doesn't imply that B is the cause of A. As I pointed out, a current generator attached to a capacitor causes the voltage, the reverse of the first example, but the mathematical relation between voltage and current is the same.

"Cause" is an everyday concept that tends to dissolve when looked at too closely. The research on causal analysis of statistical data quite sensibly does not try to define it.

Except for everyone following Pearl.

Ah, ok. Applying his definition ("variable

Xis a probabilistic-cause of variableYifP(y|do(x)) !=P(y) for some valuesxandy") to the signal generator, it says that the voltage causes the current; in the current-source version, that the current causes the voltage. That's exactly what I would say as well.Of course, his limitation to acyclic relations excludes from his analysis systems that are only slightly more complicated, such as .

That's what dynamic Bayesian networks are for. The current values of state variables of a system near stable equilibrium are not caused by each other; they are caused by past values. Dynamic Bayesian networks express this distinction with edges that pass forward in time.

The continuous-time limit of a dynamic Bayesian network can be a differential equation such as this.

(ETA) A dynamic Bayesian network is syntactic sugar for an ordinary Bayesian network that has the same structure in each of a series of time slices, with edges from nodes in each time slice to nodes in the next time slice. The Bayesian network that is made by unrolling a dynamic Bayesian network is still completely acyclic. Therefore, Bayesian networks have at least the representation power of finitely iterated systems of explicit recurrence relations and are acyclic, and continuum limits of Bayesian networks have at least the representation power of systems of differential equations and are acyclic. (Some representation powers that these Bayesian networks do not have are the representation powers of systems of implicit recurrence relations, systems of differential algebraic equations without index reduction, and differential games. Something like hybrid Bayesian-Markovian networks would have some of these representation powers, but they would have unphysical semantics (if physics is causal) and would be hard to use safely.)

(Dynamic Bayesian networks at the University of Michigan Chemical Engineering Process Dynamics and Controls Open Textbook ("ControlWiki"))

Is encryption another example, or do you have to take into account the full system including the key?

As with CronoDAS's suggestion of a pseudorandom generator, this can easily yield variables possessing a strong causal connection but no correlation.

Correlations -- product-moment or any other statistical calculation -- are machines to detect relationships between variables that are obscured by passive fog. Random generators and cryptosystems are machines to defeat detection even by an adversary. It is not a surprise that crypto beats correlation.

More surprising is the existence of systems as simple as B = dA/dt which also defeat correlation. The scatter-plot looks like pure fog, yet there are no extraneous noise sources and no adversarial concealment. The relationship between the variables is simply invisible to the statistical tools used in causal analysis.

Whoa, what happened there? Someone made a comment here that made a lot of good points (others had made them, but this poster did it much better), and I voted it up, as did someone else, but now it's gone, and there's no "comment deleted" indication.

Is LW not always giving the most up-to-date site?

My use of the word "cause" was not technically accurate. I needed to find a way to word my comment that didn't use it that way.

(this comment)

Okay ... and now that comment shows up again in "recent comments". Weird.

ETA: Okay, I guess that comment is actually a different one.

is confusedA comment that has no follow-ups can be deleted by the author without leaving a "comment deleted" placeholder.

Emphasis added here and below.

Causation, Prediction, and Search, page 31:

Wikipedia on correlation:

Spirtes's example on page 71 looks like a linear Gaussian causal system. In a linear Gaussian causal system, uncorrelation is identical with simple marginal independence, and it can imply complete conditional independence.

Yes, I think this is true for values of a function and its derivative sampled at single uniformly random times (for a limit sense of "uniform").

Theorem: In the long run, a bounded, differentiable real function has zero correlation with its first derivative.I don't understand the theorem. What does "in the long run" mean? Is it that in limit a,b->\infty

(\int{a,b} f(x)f'(x) dx)/(b-a)=(\int{a,b} f(x) dx)(\int_a^b f'(y) dy)/(b-a)^2 ?

Sorry for the quasi-TEX notation, even the underscore doesn't appear here. Is there any elegant way to write formulae on LW?

Not quite, it's that as a and b go to infinity,

(\int_{a,b}f(x)f'(x)dx)/(b-a))

goes to zero. \int_{a,b}f(x)f'(x)dx = [ f(x)^2/2 ]^b_a, which is bounded, while b-a is unbounded, QED.

LaTeX to Wiki might work, but LaTeX to LW comment doesn't.

f'(x)dx}{b-a})

Source code:

Formatting tutorial on the Wiki

I tried, but it didn't work for me. I could make a codecogs URL to exhibit the image in my browser, but it got munged when I tried the ![](...) embedding.

The problem must be in escape character (see the last section of the wiki article). Try copy-pasting the code I gave above in your comment, and notice the placement of backslashes.

The standard form for correlation coefficient is

cov(x,y)=N(-)

where N is normalisation; it seems that you suppose that =0 and <f'> finite, then. =0 follows from boundedness, but for the derivative it's not clear. If <f'> on (a,b) grows more rapidly than (b-a), anything can happen.

This cannot happen. f is assumed bounded. Therefore the average of f' over the interval [a,b] tends to zero as the bounds go to infinity.

The precise, complete mathematical statement and proof of the theorem does involve some subtlety of argument (consider what happens if f = sin(exp(x))) but the theorem is correct.

See the description on the Wiki of how to include LaTeX in comments.