Applied Causal Inference for Observational Research
This sequence is an introduction to basic causal inference. It was originally written as auxiliary notes for a course in Epidemiology, but it is relevant to almost any kind of applied statistical research, including econometrics, sociology, psychology, political science etc. I would not be surprised if you guys find a lot of errors, and I would be very grateful if you point them out in the comments. This will help me improve my course notes and potentially help me improve my understanding of the material.
For mathematically inclined readers, I recommend skipping this sequence and instead reading Pearl's book on Causality. There is also a lot of good material on causal graphs on Less Wrong itself. Also, note that my thesis advisor is writing a book that covers the same material in more detail, the first two parts are available for free at his website.
Pearl's book, Miguel's book and Eliezer's writings are all more rigorous and precise than my sequence. This is partly because I have a different goal: Pearl and Eliezer are writing for mathematicians and theorists who may be interested in contributing to the theory. Instead, I am writing for consumers of science who want to understand correlation studies from the perspective of a more rigorous epistemology.
I will use Epidemiological/Counterfactual notation rather than Pearl's notation. I apologize if this is confusing. These two approaches refer to the same mathematical objects, it is just a different notation. Whereas Pearl would use the "Do-Operator" E[Y|do(a)], I use counterfactual variables E[Ya]. Instead of using Pearl's "Do-Calculus" for identification, I use Robins' G-Formula, which will give the same results.
For all applications, I will use the letter "A" to represent "treatment" or "exposure" (the thing we want to estimate the effect of), Y to represent the outcome, L to represent any measured confounders, and U to represent any unmeasured confounders.
Outline of Sequence:
I hope to publish one post every week. I have rough drafts for the following eight sections, and will keep updating this outline with links as the sequence develops:
Part 0: Sequence Announcement / Introduction (This post)
Part 1: Basic Terminology and the Assumptions of Causal Inference
Part 2: Graphical Models
Part 3: Using Causal Graphs to Understand Bias
Part 4: Time-Dependent Exposures
Part 5: The G-Formula
Part 6: Inverse Probability Weighting
Part 7: G-Estimation of Structural Nested Models and Instrumental Variables
Part 8: Single World Intervention Graphs, Cross-World Counterfactuals and Mediation Analysis
Introduction: Why Causal Inference?
The goal of applied statistical research is almost always to learn about causal effects. However, causal inference from observational is hard, to the extent that it is usually not even possible without strong, almost heroic assumptions. Because of the inherent difficulty of the task, many old-school investigators were trained to avoid making causal claims. Words like “cause” and “effect” were banished from polite company, and the slogan “correlation does not imply causation” became an article of faith which, when said loudly enough, seemingly absolved the investigators from the sin of making causal claims.
However, readers were not fooled: They always understood that epidemiologic papers were making causal claims. Of course they were making causal claims; why else would anybody be interested in a paper about the correlation between two variables? For example, why would anybody want to know about the correlation between eating nuts and longevity, unless they were wondering if eating nuts would cause them to live longer?
When readers interpreted these papers causally, were they simply ignoring the caveats, drawing conclusions that were not intended by the authors? Of course they weren’t. The discussion sections of epidemiologic articles are full of “policy implications” and speculations about biological pathways that are completely contingent on interpreting the findings causally. Quite clearly, no matter how hard the investigators tried to deny it, they were making causal claims. However, they were using methodology that was not designed for causal questions, and did not have a clear language for reasoning about where the uncertainty about causal claims comes from.
This was not sustainable, and inevitably led to a crisis of confidence, which culminated when some high-profile randomized trials showed completely different results from the preceding observational studies. In one particular case, when the Women’s Health Initiative trial showed that post-menopausal hormone replacement therapy increases the risk of cardiovascular disease, the difference was so dramatic that many thought-leaders in clinical medicine completely abandoned the idea of inferring causal relationships from observational data.
It is important to recognize that the problem was not that the results were wrong. The problem was that there was uncertainty that was not taken seriously by the investigators. A rational person who wants to learn about the world will be willing to accept that studies have errors of margin, but only as long as the investigators make a good-faith effort to examine what the sources of error are, and communicate clearly about this uncertainty to their readers. Old-school epidemiology failed at this. We are not going to make the same mistake. Instead, we are going to develop a clear, precise language for reasoning about uncertainty and bias.
In this context, we are going to talk about two sources of uncertainty – “statistical” uncertainty and “epidemiological” uncertainty.
We are going to use the word “Statistics” to refer to the theory of how we can learn about correlations from limited samples. For statisticians, the primary source of uncertainty is sampling variability. Statisticians are very good at accounting for this type of uncertainty: Concepts such as “standard errors”, “p-values” and “confidence intervals” are all attempts at quantifying and communicating the extent of uncertainty that results from sampling variability.
The old school of epidemiology would tell you to stop after you had found the correlations and accounted for the sampling variability. They believed going further was impossible. However, correlations are simply not interesting. If you truly believed that correlations tell you nothing about causation, there would be no point in doing the study.
Therefore, we are going to use the terms “Epidemiology” or “Causal Inference” to refer to the next stage in the process: Learning about causation from correlations. This is a much harder problem, with many additional sources of uncertainty, including confounding and selection bias. However, recognizing that the problem is hard does not mean that you shouldn't try, it just means that you have to be careful. As we will see, it is possible to reason rigorously about whether correlation really does imply causation in your particular study: You will just need a precise language. The goal of this sequence is simply to give you such a language.
In order to teach you the logic of this language, we are going to make several controversial statements such as «The only way to estimate a causal effect is to run a randomized controlled trial» . You may not be willing to believe this at first, but in order to understand the logic of causal inference, it is necessary that you are at least willing to suspend your disbelief and accept it as true within the course.
It is important to note that we are not just saying this to try to convince you to give up on observational studies in favor of randomized controlled trials. We are making this point because understanding it is necessary in order to appreciate what it means to control for confounding: It is not possible to give a coherent meaning to the word “confounding” unless one is trying to determine whether it is reasonable to model the data as if it came from a complex randomized trial run by nature.
When we say that causal inference is hard, what we mean by this is not that it is difficult to learn the basics concepts of the theory. What we mean is that even if you fully understand everything that has ever been written about causal inference, it is going to be very hard to infer a causal relationship from observational data, and that there will always be uncertainty about the results. This is why this sequence is not going to be a workshop that teaches you how to apply magic causal methodology. What we are interested in, is developing your ability to reason honestly about where uncertainty and bias comes from, so that you can communicate this to the readers of your studies. What we want to teach you about, is the epistemology that underlies epidemiological and statistical research with observational data.
Insisting on only using randomized trials may seem attractive to a purist, it does not take much imagination to see that there are situations where it is important to predict the consequences of an action, but where it is not possible to run a trial. In such situations, there may be Bayesian evidence to be found in nature. This evidence comes in the form of correlations in observational data. When we are stuck with this type of evidence, it is important that we have a clear framework for assessing the strength of the evidence.
I am publishing Part 1 of the sequence at the same time as this introduction. I would be very interested in hearing feedback, particularly about whether people feel this has already been covered in sufficient detail on Less Wrong. If there is no demand, there won't really be any point in transforming the rest of my course notes to a Less Wrong format.
Thanks to everyone who had a look at this before I published, including paper-machine and Vika, Janos, Eloise and Sam from the Boston Meetup group.