Epistemic Status: Entirely ripped from Chapter 3 of “Causal Inference: The Mixtape” by Scott Cunningham which is graciously provided for free here and is generally a very good textbook. I hope that by writing this I provide a slightly shorter and easier to read explanation of the highlight (imo) of this chapter. All mistakes are my own and not the textbook’s.
I. Simple DAG
Today I woke up and decided I wanted to explain backdoor paths in an intuitive way both to improve my own understanding and because I think it's a useful idea for rationalists working with causation (which is everyone all the time?). Also I find it really interesting, especially the second half.
To explain backdoor paths, we are going to use Directed Acyclic Graphs, or DAGs for short. These sound really scary but they are actually just maps of causality.
A simple DAG looks like the one below.
The letters in a DAG represent variables we are studying. In this case, A, B, and C. The lines represent paths of causality in the direction they are pointing. So here, changes in A cause changes in both B and C. An absence of a line represents the fact that there is no causal pathway between two variables. Changing B will not change C or vice versa. Also, the lines don’t run backwards, so changing either B or C exogenously will not change A.
Because we are only explaining backdoor paths this is all you need to know for now. If you want a better explanation of DAGs you can check out the textbook here. Now let’s get started.
II. Two Confounders
Imagine we suspect that changes in A cause changes in B but we want to estimate the exact relationship. We could maybe regress B onto A to find this relationship. We could even do a bunch of functional form specification stuff if we suspect the relationship is nonlinear. Imagine we do this and get a simple coefficient of 0.7. That is, a one unit increase in A leads to a 0.7 unit increase in B. Let’s write that like this.
But now imagine that our model was wrong and there was actually a third variable C that was affected by A and in turn affected B. Let’s say that the true relationship between the 3 variables is as follows.
Now, if we regressed B onto A and C, we would get a coefficient of 1 for A and a coefficient of -0.3 for C, the true relationship. But just regressing B onto A gives us a coefficient for A of 0.7 because a unit increase in A causes a one unit increase in B but then it also causes a 1 unit increase in C which decreases B by 0.3, netting us 0.7. So our estimate of the true relationship between B and A is biased by omitting C. This is Omitted Variable Bias.
This still happens even if the direction of causality between A and C is the other direction, as shown below.
The reasoning here is less intuitive, but it goes like this. In order to measure the causal effect of changes in A on changes in B, we need A to, well, change. So A needs to take on different values. Part of the variation in A will be due to factors other than C, but some of the variation will be due to changes in C. Therefore the changes in A that we use to estimate the coefficient when we regress B onto A are really partially just changes in C. And C, in reality, is also changing B at the same time, biasing our estimate of the relationship between A and B. So in the example above, the one unit increases in A, which by themselves cause 1 unit increases in B, are caused by 1 unit increase in C, which also causes 0.3 unit decreases in B, so our biased estimate of the relationship of A and B is 0.7.
As a classic example, swimming pool deaths and ice cream consumption are correlated in real life. But they are both mainly caused by the heat of summertime. Variations in ice cream consumption are mainly caused by variations in temperature, so most of the variation in ice cream consumption that we could try to use to see if it had any effect on swimming pool deaths would have temperature lurking behind it and causing people to die in swimming pools.
When either of these things occur, we say that A has a “backdoor path” to B. Controlling for C slams this backdoor closed because we isolate the variation in A that is not caused by C. Yay! We can now determine the true relationship between A and B.
III. Longer Paths
This still works the same even if the backdoor path from A to B is longer, except we have more options of what we can control for. Notice in the example above that it is probably better described with 4 variables instead of 3. C (temperature) causes A (ice cream consumption) and D (people going into swimming pools). D then causes B (people dying in swimming pools). A has no effect on B.
(If A didn’t have any effect on B we generally would represent this as having no arrow between A and B, but this graph is just showing a general example of an extended backdoor path).
But now we can control for either C or D. Controlling for people swimming more often (D) would isolate the variation in ice cream consumption that is not correlated with people swimming more often, allowing us to determine the real effect of ice cream consumption on swimming pool deaths.
IV. Colliders
Wait! Before you leave this post grumbling that this is just explaining that you need to control for confounders, this is only half of backdoor paths. The other half tells you that there are some cases in which it's actually a bad idea to control for certain variables. This is where the concept of “colliders” comes in.
Take a look at this graph.
In this case we don’t need to control for C when figuring out the true relationship between A and B. Changes in A cause changes in B and C but then the changes in C don’t cause changes in B at all. The backdoor is already closed. But what if we make a mistake and control for C? What will our coefficient estimate of A -> B be?
It will be 1.3. Why? Because when you control for C, you are (knowingly or unknowingly) making the assumption that part of the effect you see if you just regress B onto A is actually the effect of C on B. In other words, you assume that the arrow of causality runs from C -> B, not B -> C like it does in reality, and that through some relationship with A (doesn’t matter which way the arrow goes here as we saw in the two examples above) this causes the effect of A -> B to be estimated in a biased way when you just regress B onto A. You correct for this “bias” by controlling for C, but there is no bias, because there is no effect of C on B.
Once again this also works if we add extra variables along the backdoor path. One collider along this path is enough to ensure no backdoor paths, but controlling for this one collider also opens the backdoor again.
V. An Example
Let’s try a more complicated example. This graph is taken from the textbook.
In this graph, F is a dummy variable for being female, D is a variable representing the amount of discrimination one faces, O is a variable for occupation, A represents (unobserved in the dataset, hence the dotted lines) ability, and finally Y represents income, what we want to measure. This is a graph trying to analyze the gender wage gap.
What is this graph telling us? Well, there is a backdoor path between occupation and income through ability, which is currently not controlled for. It would be great if we could control for this. There is also a backdoor path between discrimination and income that runs first through O and then through A IF we control for O. Uncontrolled O would serve as a collider and thus cut off this backdoor path, but if we control for it this opens the door again. And since we can’t control for A this lets everything through.[1] But also, if we don’t control for O, there is a backdoor path from D to Y that just goes straight through O. So this is basically hopeless without being able to control for A.
How much does this bias things? Well the author runs some simulations with the following assumptions:
The author finds that with plausible values under these assumptions, you can get results by not controlling for A that make it look like the causal relationship from D -> Y is weakly positive, i.e. that discrimination improves a person’s income. The intuitive explanation for this is that if the assumptions are true, a woman in a given occupation has a higher ability than a man in that occupation because that is necessary to offset the discrimination that she faces. And since higher ability also causes higher income in a given occupation, and we don’t control for ability, this makes it look like discrimination is causing higher incomes for women. [2]
This shows why DAGs are so powerful. Would you have seen this problem without the DAG? I certainly didn’t. The textbook includes some other examples, including a court case which this example was based on and a real-live-peer-reviewed-journal-approved-famous-author economics study that failed to see something similar. So its clearly useful if you can spot these mistakes consistently or avoid making them yourself.
Note that this also means there is a backdoor path from F to Y but this is less important because D is screening off F.
The author includes code in R and Stata for the simulation which can be found here https://mixtape.scunning.com/03-directed_acyclical_graphs