Causality: a chapter by chapter review

Vaniver

This is a chapter by chapter review of Causality (2nd ed.) by Judea Pearl (UCLA, blog). Like my previous review, the intention is not to summarize but to help readers determine whether or not they should read the book (and if they do, what parts to read). Reading the review is in no way a substitute for reading the book.

I'll state my basic impression of the book up front, with detailed comments after the chapter discussions: this book is monumentally important to anyone interested in procuring knowledge (especially causal knowledge) from statistical data, but it is a heavily technical book primarily suitable for experts. The mathematics involved is not particularly difficult, but its presentation requires dedicated reading and clarity of thought. Only the epilogue, this lecture, is suitable for the general audience, and that will be the highest value portion for most readers of LW.

1. Introduction to Probabilities, Graphs, and Causal Models

While the descriptions are complete, this chapter may be more useful as a refresher than as an introduction. The three sections are detailed in inverse proportion to the expected reader's familiarity.

For the reader who's seen probability calculus before, Pearl's description of it in 12 pages is short, sweet, and complete. For the reader that hasn't seen it, that's just enough space to list the definitions and give a few examples. Compare Eliezer's explanation of Bayes' Rule (almost 50 pages) to Pearl's (around 2).

The section on graphs moves a little less quickly, but even so don't be afraid to find an online tutorial on d-separation if Pearl's explanation is too fast. For some reason, he does not mention here that section 11.1.2 (p 335-337 in my copy) is a gentler introduction to d-separation. [edit] His blog also linked to this presentation, which is an even gentler introduction to graphs, causal networks, and d-separation.

The section on causal models is the most detailed, as it will be new to most readers, and closely follows the section on graphs. Pearl uses an example to demonstrate the use of counterfactuals, which is a potent first glance at the usefulness of causal models.

He also draws an important distinction between probabilistic, statistical, and causal parameters. Probabilistic parameters are quantities defined in terms of a joint distribution. Statistical parameters are quantities defined in terms of observed variables drawn from a joint distribution. Causal parameters are quantities defined in terms of a causal model, and are not statistical. (I'll leave the explanation of the full implications of the distinctions to the chapter.)

2. A Theory of Inferred Causation

Philosophers have long grappled with the challenge of identifying causal information from data, especially non-experimental data. This chapter details an algorithm to attack that problem.

The key conceptual leap is the use of a third variable in the model as a control. Suppose X and Y are correlated; if there is a third variable Z that is correlated with Y but not with X, the natural interpretation is that X and Z both cause Y. That is not the unique interpretation, which causes quite a bit of philosophical trouble, which Pearl addresses with stability. Only one of the multiple consistent interpretations is stable.

Pearl gives the example of a photo of a chair. There are two primary hypotheses: first, that the underlying scenario was a single chair, and second, that the underlying scenario was two chairs, placed so that the first chair hides the second. While both scenarios predict the observed data, the first scenario is not just simpler, but more stable. If the camera position moved slightly, the second chair might not be hidden anymore- and so we should expect two chairs to be visible in most photos of two chair scenarios.

Pearl also calls on Occam's Razor, in the form of preferring candidate models and distributions which cannot be overfit to those which can be overfit. With those two reasonable criteria, we can move from an infinite set of possible causal models that could explain the data to a single equivalency class of causal models which most frugally explain the data.

The chapter describes the algorithm, its functionality, and some implementation details, which I won't discuss here.

Pearl also discusses how to differentiate between potential causes, genuine causes, and spurious associations given the output of the causal inference algorithm.

The chapter concludes with some philosophical discussion of the influence of time and variable choice, as well as defending the three core assumptions (of minimality, the Markovian structure of causal models, and stability).

3. Causal Diagrams and the Identification of Causal Effects

While we can infer causal relationships from data, that task is far easier when we allow ourselves to assume some sensible causal relationships. This step is necessary, and desirable even though making it transparent is sometimes controversial.

Interesting situations are often very complex, and Pearl shows how causal graphs- even ones where not all nodes may be measured- make it possible to navigate the complexity of those situations. The chapter focuses primarily on identifiability- that is, if we fix X to be some value x, can we determine p(y|do(x))? For an arbitrarily large and complex graph, the answer is non-obvious.

The answer is non-obvious enough that there is massive controversy between statisticians and econometricians, which Pearl attempted to defuse (and describes how well that went at the end of the chapter), because there is a subtle difference between observing that X=x and setting X=x. If we see that the price of corn is $1 a bushel, that implies a very different world than one where we set the price of corn at $1 a bushel. In applications where we want to control a system, we're interested in the second- but normal updating based on Bayes' Rule will give us the first. That is, from a statistical perspective, we can always determine the joint probability distribution, and condition on X=x to get p(y|X=x); but from a causal perspective this generally won't give us the information we want. Different causal models can have the same joint probability distribution- and thus look statistically indistinguishable- but give very different results when X is fixed to a particular value.

When everything is observable (and thus deterministic), there's no challenge in figuring out what will happen when X is fixed to a particular value. When there is uncertainty- that is, only some variables are observable- then we need to determine if we know enough to still be able to determine the effects of fixing X.

His 'intervention calculus' describes how to fix a variable by modifying the graph, and then what you can get out of the new, modified graph. It takes a necessary detour through the impacts of confounding variables (or, more precisely, what graph structure that represents and how to determine identifiability in the light of that graph structure). This is what lets us describe and calculate p(y|do(x)).

I should comment I feel badly reviewing a technical book like this; my summary of forty pages of math is half a page long, because I leave all of the math to the book itself, and just describe the motivation for the math.

4. Actions, Plans, and Direct Effects

This chapter begins with a distinction between acts and actions, very similar to the distinction discussed in the previous chapter. He treats acts as events or reactions to stimuli; because they are caused by the environment, they give evidence about the environment. Actions are treated as deliberative- they can't be used as evidence because they haven't happened yet, and are the result of deliberation. They become acts once performed- they're actions from the inside, but acts from the outside. (Link to algorithm feels like on the inside.) Pearl describes the controversy over Newcomb's Problem as a confusion over the distinction between acts and actions. Evidential Decision Theory, often called EDT, is discussed and dismissed; because it doesn't respect this distinction (or, really, any causal information), it gives nonsensical results. Commuters shouldn't rush to work, because if they did, that would increase the probability that they've overslept.

Pearl gives a brief description of the relationship between influence diagrams, used in decision analysis, and the causal diagrams he describes here; basically, they're very similar, although the ID literature purposefully sidesteps causal implications which are at the forefront here.

Much of the chapter is spent describing the math that determines when an action's or plan's effects are identifiable.

Of particular interest is the section on direct effects, which walks through the famous Berkeley Admissions example of Simpson's Paradox. Pearl presents a modified version in which the school admits students solely based on qualifications, but appears to discriminate on a department-by-department basis, to demonstrate the necessity of using a full causal model, rather than simple adjusting.

5. Causality and Structural Models in Social Science and Economics

This chapter will be much more significant to readers with experience doing economic or social science modeling, but is still worthwhile to other readers as a demonstration of the power of causal graphs as a language.

The part of the chapter that is interesting outside of the context of structural models is the part that discusses testing of models. Every missing link in a causal graph is the strong prediction that those two variables are independent (if properly conditioned). This presents a ready test of a causal graph- compute the covariance for every missing link (after proper conditioning), and confirm that those links are not necessary. As a statistical practice, this significantly aids in the debugging of models because it makes local errors obvious, even when they might be obscured in global error tests.

That said, I found it mildly disconcerting that Pearl did not mention there the rationale for using global tests. That is, if there are twenty missing links in your causal diagram, and you collect real data and calculate covariances, on average you should expect the covariance of one missing link to be statistically significantly different from zero if you're using a local test for each link independently. A global test will look at one statistically significant red flag and ignore it as expected given the number of coefficients.

In the context of structural models, most of the interesting parts of the chapter deal with determining the identifiability of parameters in the structural models, and then how to interpret those parameters. Pearl's approach is clear, easily understandable, and soundly superior to alternatives that he quotes (primarily to demonstrate his superiority to them).

6. Simpson's Paradox, Confounding, and Collapsibility

This chapter begins by dissolving Simpson's Paradox, which is more precisely called a reversal effect. Pearl gives a simple example: suppose 80 subjects have a disease and take a drug to treat it. 50% (20) of those who take the drug recover, and 40% (16) of those who do not take the drug recover. By itself, this seems to suggest that the drug increases the recovery rate.

The effect is reversed, though, when you take gender into account. Of the men, 30 decided to take the drug- and only 60% (18) of them recovered, compared to 70% (7) of the 10 that decided to not take the drug. Of the women, 10 decided to take the drug- and only 20% (2) of them recovered, compared to 30% (9) of the 30 who did not decide to take the drug.

Depicting the issue causally, the effect is clear: sex impacts both the proportion of subjects who take the drug and the base recovery rate, and the positive impact of sex on recovery is masking the negative impact of the drug on recovery. Simply calculating p(recovery|drug)-p(recovery|~drug) does not tell us if the drug is helpful. The parameter we need for that is p(recovery|do(drug))-p(recovery|do(~drug)). With causal diagrams and a clear conceptual difference between observing and fixing events, that's not a mistake one would make, and so there's no paradox to avoid and nothing interesting to see.

The rest of the chapter discusses confounding, presenting a definition of stable no-confounding between variables and showing why other definitions are less useful or rigorous. For readers who haven't heard of those alternatives before, the comparisons will not be particularly interesting or enlightening (as compared to the previous chapter, where the discussion of structural models seems readily intelligible to someone with little experience with them), though they do provide some insight into the issue of confounding.

7. The Logic of Structure-Based Counterfactuals

Pearl returns to the topic of counterfactuals, briefly introduced before, and gives them a firm mathematical foundation and linguistic interpretation, then makes their usefulness clear. Counterfactuals are the basis of interventions in complex systems- they encode the knowledge of what consequences a particular change would have. They also represent a convenient way to store and test causal information.

This power to predict the consequences of changes is what makes causal models superior to non-causal models. Pearl gives a great example of a basic econometric situation:

q=b₁p+d₁i+u₁

p=b₂q+d₂w+u₂

where q is the quantity demanded, I is the household income, p is the price level, w is the wage rate, and the u_is are uncorrelated error terms. The equilibrium level of price and quantity demanded is determined by the feedback between those two equations.

Pearl identifies three quantities of interest:

What is the expected value of the demand Q if the price is controlled at p=p₀?
What is the expected value of the demand Q if the price is reported to be p=p₀?
Given that the price is currently p=p₀, what is the expected value of the demand Q if we were to control the price at p=p₁?

The second is the only quantity available from standard econometric analysis; the causal analysis Pearl describes easily calculates all three quantities. Again, I leave all of the actual math to the book, but this example was vivid enough that I had to reprint it.

The chapter continues with a set of axioms that describe structural counterfactuals, which then allows Pearl to compare structural counterfactuals with formulations attempted by others. Again, for the reader only interested in Pearl's approach, the comparisons are more tedious than enlightening. There are enough possibly non-obvious implications to reward the dedicated reader, and the reader familiar with the object of the comparison will find the comparison far more meaningful, but the hurried reader would be forgiven for skipping a few sections.

The discussion of exogeneity is valuable for all readers, though, as it elucidates a hierarchy between graphical criteria, error-based criteria, and counterfactual criteria. Each of those criteria implies the one that follows it, but the implications do not flow in the reverse direction; another example of how the language of graphs is more powerful than alternative languages.

8. Imperfect Experiments: Bounding Effects and Counterfactuals

This chapter describes how to extract useful information (through bounds) from imperfect experiments. For experiments where all observed variables are binary (and, if they aren't, they can be binarized through partitioning), but unobserved variables are free to be monstrously complicated, that complexity can be partitioned into four classes of responses, to match the four possible functional forms between binary variables.

Pearl uses the example of medical drug testing- patients are encouraged to take the drug (experimental group) or not (control group), but compliance may be imperfect, as patients may not take medication given to them or patients not given medication may procure it by other means. Patients can be classed as either never taking the drug, complying with instructions, defying instructions, or always taking the drug. Similarly, the drug's effect on patient recovery can be classified as never recovering, helping, hurting, or always recovering. The two could obviously be related, and so the full joint distribution has 15 degrees of freedom- but we can pin down enough of those degrees of freedom with the observations that we make (of encouragement, treatment, and then recovery) to establish an upper and lower bound for the effect that the treatment has on recovery.

The examples in this chapter are much more detailed and numerical; it also includes a section on Bayesian estimation of the parameters as a complement to or substitute for bounding.

9. Probability of Causation: Interpretation and Identification

This chapter defines and differentiates between three types of causation: necessary causes, sufficient causes, and necessary and sufficient causes. When starting a fire, oxygen is a necessary cause, but not a sufficient cause (given the lack of spontaneous combustion). Striking a match is both a necessary cause and a sufficient cause, as the fire would not occur without the match and striking a match is likely to start a fire. These intuitive terms are given formal mathematical definitions using counterfactuals, and much of the chapter is devoted to determining when those counterfactuals can be uniquely measured (i.e. when they're identifiable). Simply knowing the joint probability distribution is insufficient, but is sufficient to establish lower and upper bounds for those quantities. In the presence of certain assumptions or causal graphs, those quantities are identifiable.

10. The Actual Cause

This chapter provides a formal definition of the concept of an “actual cause,” useful primarily for determining legal liability. For other contexts, the concept of a sufficient cause may be more natural. Pearl introduces the concepts of "sustenance," which is (informally) that a variable's current setting is enough to cause the outcome, regardless of other configurations of the system, and "causal beams," which are structures used to determine sustenance from causal graphs. The chapter provides more examples of causal diagrams, and a bit more intuition about the various kinds of causation, but is primarily useful for retrospective rather than predictive analysis.

11. Reflections, Elaborations, and Discussions with Readers

This chapter bounces from topic to topic, and (perhaps unsurprisingly, given the title) elaborates on many sections of the book. It may be worthwhile to read through chapter 11 in parallel with the rest of the book, as many responses to letters are Pearl clearing up a (presumably common) confusion with a concept. Indeed, the section numbers of this chapter match the chapter numbers of the rest of the book, and so 11.3 is a companion to 3.

The first response, 11.1.1 is worth reading in full. One paragraph in particular stands out:

These considerations imply that the slogan "correlation does not imply causation" can be translated into a useful principle: behind every causal conclusion there must lie some causal assumption that is not discernible from the distribution function.

Epilogue. The Art and Science of Cause and Effect

The book concludes with a public lecture given in 1996. The lecture swiftly introduces concepts and their informal relationships, as well as some of the historical context of the scientific understanding of causality. The lecture moves swiftly but focuses on the narrative and motivation over the mathematics.

The preface to the second edition states that Pearl's "main audience is the students" but the book is actually well-suited to be a reference text for experts. There are no exercises, and axioms, theorems, and definitions outweigh the examples. (As a side note, if you find yourself stating that the proof of a theorem can be found in a paper you reference, you are not targeting introductory statistics students.)

I would recommend reading the lecture first, then the rest of the book (reading the corresponding section of chapter 11 after each chapter), then the lecture again. The first reading of the lecture will motivate many of the concepts involved in the book, then the book will formalize those concepts, and then a second reading of the epilogue will be useful as a comparative exercise. Indeed, the lay reader is likely to find the lecture engaging and informative and the rest of the book impenetrable, so they should read only it (again, it's available online). When finding links for this post, I discovered that the two most helpful Amazon reviews also suggested to read the epilogue first.

There are many sections of the book that compare Pearl's approach to other approaches. For readers familiar with those other approaches, I imagine those sections are well worth reading as they provide a clearer picture of what Pearl's approach actually is, why it's necessary, and also make misunderstandings less likely. For the reader who is not familiar with those other approaches, reading the comparisons will sometimes provide deeper intuition, but often just provides historical context. For the reader who has already bought into Pearl's approach, this can get frustrating- particularly when his treatment of alternatives grows combative.

Chapter 5 is where this becomes significantly noticeable, although I found the comparisons in that chapter helpful; they seemed directly informative about Pearl's approach. In subsequent chapters, though, the convinced reader may skip entire sections with little loss. Unfortunately, the separation is not always clean. For example, in section 6.1: 6.1.1 is definitely worth reading, 6.1.2 probably not, but 6.1.3 is a mixture of relevant and irrelevant; figure 6.2 (around a third of the way through that section) is helpful for understanding the causal graph approach, but is introduced solely to poke holes in competing approaches! 6.1.4 begins comparatively, with the first bit repeating that other approaches have problems with this situation, but then rapidly shifts to the mechanics of navigating the situation where other approaches founder.

In my previous review of Thinking and Deciding, it seemed natural to recommend different sections to different readers, as the book served many purposes. Here, the mathematical development builds upon itself, so attempting to read chapter 4 without reading chapter 3 seems like a bad idea. Later chapters may be irrelevant to some readers- chapters 9 and 10 are primarily useful for making retrospective and not predictive statements, though they still provide some intuition about and experience with manipulating causal graphs.

All in all, the book seems deeply important. Causal graphs, interventions, and counterfactuals are all very significant concepts, and the book serves well as a reference for them but perhaps not as an introduction to them. It is probably best at explaining counterfactuals, both what they are and why they are powerful, but I would feel far more confident recommending a less defensive volume which focused on the motivations, basics, and practice for those concepts, rather than their mathematical and theoretical underpinnings.

On a more parochial note, much of the more recent work referenced in the book was done by one of Pearl's former graduate students whose name LWers may recognize, and a question by EY prompts an example in 11.3.7.

The first edition of the book is online for free here.

Many thanks to majus, who lent me his copy of Causality, without which this review would have occurred much later.

[-]Curiouskid12y70

So what are the pre-requisites for reading the book and not the lecture?

[-]IlyaShpitser12y50

Any motivated undergrad in a technical field has enough background. If you want to read other folks in the field, they tend to be more statistical than Pearl (and thus in addition it's good to have basic stat). It helps to have taken at least one class with proofs in it.

[-]Vaniver12y40

As in, you've read the lecture and are now interested in tackling the book, or you're just trying to get a hold on what the book requires?

If you understand probability calculus and have used mathematical graphs before, I think you could make it with sufficient dedication and patience. Familiarity with statistics will make much of the book more meaningful. The concepts in this book are mostly orthogonal to other mathematical formalisms (like standard statistical ones), but are in many ways more intuitive than those formalisms, because people like thinking in terms of cause and effect. (The slogan "correlation doesn't imply causation!" exists because people want correlation to imply causation.)

I think a class on machine learning or Bayesian networks may be a gentler introduction to several of the core concepts- but the benefit of the formal approach taken by this book is that it's very good at dissolving confusions, which more practical classes may not focus on as much.

[-]Curiouskid12y50

Sorry, my orignal comment was worded poorly. I don't have to read the lecture to know I'm interested in reading the whole book. I'm just trying to figure out how I can go through the book without getting frustrated and confused. I suppose I'll just try to read the book and ask questions and look stuff up when I need to. It'll probably be less enjoyable, but more useful to get used to thinking that way.

I took AP stats in high school which was pretty boring and repetitive except for when we talked about Simpson's paradox briefly.

I think this course on Coursera would be good to take concurrently if anybody else is interested: (https://www.coursera.org/course/pgm). It's taught by the founder of Coursera and uses her textbook (which has been recommended before). In fact, if anybody's interested, I'd love to take this class with other LWers.

[-]Matt_Simpson12y30

Vaniver's post finally convinced me to get off my ass and take a look at the book - mainly because of the mention of the controversy with econometricians (I'm a Stat/Econ joint Ph.D. student... so it caught my eye). I've glanced through the first few chapters of the first edition (the digitial copy linked above) and through chapter 5 of the second edition (yay libraries!).

I'd recommend at least having taken a course in probability & mathematical statistics - this course might be called different things in different places, but it's essentially a course in probability theory that uses some multivariate calculus, and might go on to a more rigorous introduction to statistics (that might be a second semester course). I'm not sure how much of the calculus is necessary for Pearl, but the more rigorous treatment of probability theory will be helpful in addition to being accustomed to a higher level of rigor in general. Unless you're a savant of course, then you might be fine anyway.

[-]Ronny Fernandez12y40

I have a question: is D-separation implied by the komologorov axioms?

I've proven that it is in some cases:

Premises:

1)A = A|B :. A|BC ≤ A|C
2)C < C|A
3)C < C|B
4) C|AB < C

proof starts:
1)B|C > B {via premise 3
2)A|BC = A B C|AB / (C B|C) {via premise 1
3)A|BC C = A B C|AB / B|C
4)A|BC C / A = B C|AB / B|C
5)B C|AB / B|C < C|AB {via line 1
6)B C|AB / B|C < C {via line 5 and premise 4
7)A|BC C / A < C {via lines 6 and 4
8)A|C = A C|A / C
9)A|C C = A C|A
10)A|C C / A = C|A
11)C < A|C C / A {via line 10 and premise 2
12)A|BC C / A < A|C C / A {via lines 11 and 7
13)A|BC < A|C
Q.E.D.

1) A = A|B :. A|BC ≤ A|C
2) C < C|A
3) C < C|B
4) C|AB = C

proof starts:

1)A|C = A C|A / C
2)A|BC = A B C / (B C|B) {via premises 1 and 4
3)A|BC = A C / C|B
4)A C < A C|A {via premise 2
5)A C / C|B < A * C|A / C {via line 4 and premise 3
6)A|BC < A|C {via lines 1, 3, and 5
Q.E.D.

If it is implied by classical probability theory, could someone please refer me to a proof?

[-]IlyaShpitser12y30

I don't understand your question, or your notation.

d-separation is just a way of talking about separating sets of vertices in a graph by "blocking" paths. It can't be implied by anything because it is not a statement in a logical language. For "certain" graph/joint distribution pairs, if a d-separation statement holds in the graph, then a corresponding conditional independence statement holds in the joint distribution. This is a statement, and it is proven in Verma and Pearl 1988, as paper-machine below says. Is that the statement you mean? There are lots of interesting true and hard to prove statements one could make involving d-separation.

I guess from a model theorist point of view, it's a proof in ZF, but it's high level and "elementary" by model theory standards.

[-]Ronny Fernandez12y00

Looking it over, I could have been much clearer (sorry). Specifically I want to know. Given a Dag of the form:

A -> C <- B

Is it true that (in all prior joint distributions where A is independent of B, but A is evidence of C, and B is evidence of C) A is none-independent of B, given C is held constant?

I proved that when A & B is evidence against C, this is so, and also when A & B are independent of C, this is so, the only case I am missing is when A & B is evidence for C.

It's clear enough to me that when you have one none-colliding path between any two variables, they must not be independent; and that if we were to hold any of the variable along that path constant, that those variables would be independent. This can all be shown given standard probability theory and correlation alone. It can also be shown that if there are only colliding paths between two variables, those two variables are independent. If I have understood the theory of d-separation correctly, if we hold the collision variable (assuming there is only one) on one of these paths constant, the two variables should become none-independent (either evidence for or against one another). I have proven that this is so in two of the (at least) three cases that fit the given DAG using standard probability theory.

Those are the proofs I gave above.

[-]IlyaShpitser12y10

No, but I think it's true if A,B,C are binary. In general, if a distribution p is Markov relative to a graph G, then if something is d-separated in G, then there is a corresponding independence in p. But, importantly, the implication does not always go the other way. Distributions in which the implication always goes the other way are very special and are called faithful.

[-]Ronny Fernandez9y00

What is Markov relative?

[-]Anders_H9y00

"Markov" is used in the standard memoryless sense. By definition, the graph G represents any distribution p where each variable on the graph is independent of its past given its parents. This is the Markov property.

Ilya is discussing probability distributions p that may or may not be represented by graph G. If every variable in p is independent of its past given its parents in G, then you can use d-separation in G to reason about independences in p.

[-][anonymous]12y00

Theorem 1.2.4: If sets X and Y are d-separated by Z, in a DAG G, then X is independent of Y conditional on Z in every distribution compatible with G....

Pearl's textbook cites Verma and Pearl, 1988, but I don't have access to it.

[-]pragmatist12y20

Thanks for the post! Do you (or does anybody else) know how different the second edition is from the first edition? Is there a significant amount of important new material? I've been intending to read the book (and your review has strengthened the intention) but my library only has the first edition. Should I hold out for the newer edition or will it not make much of a difference?

[-]Vaniver12y60

In general, imperfect action is better than inaction. From a brief glance at the table of contents in the first edition, it looks like they're fairly similar- the second edition has an extra chapter (of which a few sections are great and several are skippable). Many of the sections have extensions from work done after 2000, but the basic concepts underwent little change. I would imagine that the second edition has fewer errors, but without a detailed reading I can't tell how significant that is.

The main trouble you could get into with reading just the first edition that I can think of is a few of the definitions which underwent significant revision to exclude some corner cases where they misbehaved.

I think I would ask the library if they can order the second edition while starting to read the first edition.

LESSWRONG
LW