(Part 1 of the Sequence on Applied Causal Inference)
In this sequence, I am going to present a theory on how we can learn about causal effects using observational data. As an example, we will imagine that you have collected information on a large number of Swedes  let us call them Sven, Olof, Göran, Gustaf, Annica, LillBabs, Elsa and Astrid. For every Swede, you have recorded data on their gender, whether they smoked or not, and on whether they got cancer during the 10years of followup. Your goal is to use this dataset to figure out whether smoking causes cancer.
We are going to use the letter A as a random variable to represent whether they smoked. A can take the value 0 (did not smoke) or 1 (smoked). When we need to talk about the specific values that A can take, we sometimes use lower case a as a placeholder for 0 or 1. We use the letter Y as a random variable that represents whether they got cancer, and L to represent their gender.
The datagenerating mechanism and the joint distribution of variables
Imagine you are looking at this data set:
ID 
L 
A 
Y 
Name 
Sex 
Did they smoke? 
Did they get cancer? 
Sven 
Male 
Yes 
Yes 
Olof 
Male 
No 
Yes 
Göran 
Male 
Yes 
Yes 
Gustaf 
Male 
No 
No 
Annica 
Female 
Yes 
Yes 
LillBabs 
Female 
Yes 
No 
Elsa 
Female 
Yes 
No 
Astrid 
Female 
No 
No 
This table records information about the joint distribution of the variables L, A and Y. By looking at it, you can tell that 1/4 of the Swedes were men who smoked and got cancer, 1/8 were men who did not smoke and got cancer, 1/8 were men who did not smoke and did not get cancer etc.
You can make all sorts of statistics that summarize aspects of the joint distribution. One such statistic is the correlation between two variables. If "sex" is correlated with "smoking", it means that if you know somebody's sex, this gives you information that makes it easier to predict whether they smoke. If knowing about an individual's sex gives no information about whether they smoked, we say that sex and smoking are independent. We use the symbol ∐ to mean independence.
When we are interested in causal effects, we are asking what would happen to the joint distribution if we intervened to change the value of a variable. For example, how many Swedes would get cancer in a hypothetical world where you intervened to make sure they all quit smoking?
In order to answer this, we have to ask questions about the data generating mechanism. The data generating mechanism is the algorithm that assigns value to the variables, and therefore creates the joint distribution. We will think of the data as being generated by three different algorithms: One for L, one for A and one for Y. Each of these algorithms takes the previously assigned variables as input, and then outputs a value.
Questions about the data generating mechanism include “Which variable has its value assigned first?”, “Which variables from the past (observed or unobserved) are used as inputs” and “If I change whether someone smokes, how will that change propagate to other variables that have their value assigned later". The last of these questions can be rephrased as "What is the causal effect of smoking”.
The basic problem of causal inference is that the relationship between the set of possible data generating mechanisms, and the joint distribution of variables, is manytoone: For any correlation you observe in the dataset, there are many possible sets of algorithms for L, A and Y that could all account for the observed patterns. For example, if you are looking at a correlation between cancer and smoking, you can tell a story about cancer causing people to take up smoking, or a story about smoking causing people to get cancer, or a story about smoking and cancer sharing a common cause.
An important thing to note is that even if you have data on absolutely everyone, you still would not be able to distinguish between the possible data generating mechanisms. The problem is not that you have a limited sample. This is therefore not a statistical problem. What you need to answer the question, is not more people in your study, but a priori causal information. The purpose of this sequence is to show you how to reason about what prior causal information is necessary, and how to analyze the data if you have measured all the necessary variables.
Counterfactual Variables and "God's Table":
The first step of causal inference is to translate the English language research question «What is the causal effect of smoking» into a precise, mathematical language. One possible such language is based on counterfactual variables. These counterfactual variables allow us to encode the concept of “what would have happened if, possibly contrary to fact, the person smoked”.
We define one counterfactual variable called Y^{a=1} which represents the outcome in the person if he smoked, and another counterfactual variable called Y^{a=0} which represents the outcome if he did not smoke. Counterfactual variables such as Y^{a=0} are mathematical objects that represent part of the data generating mechanism: The variable tells us what value the mechanism would assign to Y, if we intervened to make sure the person did not smoke. These variables are columns in an imagined dataset that we sometimes call “God’s Table”:
ID 
A 
Y 
Y^{a=1} 
Y^{a=0} 

Smoking 
Cancer 
Whether they would have got cancer if they smoked 
Whether they would have got cancer if they didn't smoke 
Sven 
1 
1 
1 
1 
Olof 
0 
1 
0 
1 
Göran 
1 
1 
1 
0 
Gustaf 
0 
0 
0 
0 
Let us start by making some points about this dataset. First, note that the counterfactual variables are variables just like any other column in the spreadsheet. Therefore, we can use the same type of logic that we use for any other variables. Second, note that in our framework, counterfactual variables are pretreatment variables: They are determined long before treatment is assigned. The effect of treatment is simply to determine whether we see Y^{a=0} or Y^{a=1} in this individual.
If you had access to God's Table, you would immediately be able to look up the average causal effect, by comparing the column Y^{a=1 }to the column Y^{a=0}. However, the most important point about God’s Table is that we cannot observe Y^{a=1 }and Y^{a=0}. We only observe the joint distribution of observed variables, which we can call the “Observed Table”:
ID 
A 
Y 
Sven 
1 
1 
Olof 
0 
1 
Göran 
1 
1 
Gustaf 
0 
0 
The goal of causal inference is to learn about God’s Table using information from the observed table (in combination with a priori causal knowledge). In particular, we are going to be interested in learning about the distributions of Y^{a=1} and Y^{a=0}, and in how they relate to each other.
Randomized Trials
The “Gold Standard” for estimating the causal effect, is to run a randomized controlled trial where we randomly assign the value of A. This study design works because you select one random subset of the study population where you observe Y^{a=0}, and another random subset where you observe Y^{a=1}. You therefore have unbiased information about the distribution of both Y^{a=0}and of Y^{a=1}.
An important thing to point out at this stage is that it is not necessary to use an unbiased coin to assign treatment, as long as your use the same coin for everyone. For instance, the probability of being randomized to A=1 can be 2/3. You will still see randomly selected subsets of the distribution of both Y^{a=0} and Y^{a=1}, you will just have a larger number of people where you see Y^{a=1}.^{ } Usually, randomized trials use unbiased coins, but this is simply done because it increases the statistical power.
Also note that it is possible to run two different randomized controlled trials: One in men, and another in women. The first trial will give you an unbiased estimate of the effect in men, and the second trial will give you an unbiased estimate of the effect in women. If both trials used the same coin, you could think of them as really being one trial. However, if the two trials used different coins, and you pooled them into the same database, your analysis would have to account for the fact that in reality, there were two trials. If you don’t account for this, the results will be biased. This is called “confounding”. As long as you account for the fact that there really were two trials, you can still recover an estimate of the population average causal effect. This is called “Controlling for Confounding”.
In general, causal inference works by specifying a model that says the data came from a complex trial, ie, one where nature assigned a biased coin depending on the observed past. For such a trial, there will exist a valid way to recover the overall causal results, but it will require us to think carefully about what the correct analysis is.
Assumptions of Causal Inference
We will now go through in some more detail about why it is that randomized trials work, ie , the important aspects of this study design that allow us to infer causal relationships, or facts about God’s Table, using information about the joint distribution of observed variables.
We will start with an “observed table” and build towards “reconstructing” parts of God’s Table. To do this, we will need three assumptions: These are positivity, consistency and (conditional) exchangeability:
ID 
A 
Y 
Sven 
1 
1 
Olof 
0 
1 
Göran 
1 
1 
Gustaf 
0 
0 
Positivity
Positivity is the assumption that any individual has a positive probability of receiving all values of the treatment variable: Pr(A=a) > 0 for all values of a. In other words, you need to have both people who smoke, and people who don't smoke. If positivity does not hold, you will not have any information about the distribution of Y^{a} for that value of a, and will therefore not be able to make inferences about it.
We can check whether this assumption holds in the sample, by checking whether there are people who are treated and people who are untreated. If you observe that in any stratum, there are individuals who are treated and individuals who are untreated, you know that positivity holds.
If we observe a stratum where no individuals are treated (or no individuals are untreated), this can be either for statistical reasons (your randomly did not sample them) or for structural reasons (individuals with these covariates are deterministically never treated). As we will see later, our models can handle random violations, but not structural violations.
In a randomized controlled trial, positivity holds because you will use a coin that has a positive probability of assigning people to either arm of the trial.
Consistency
The next assumption we are going to make is that if an individual happens to have treatment (A=1), we will observe the counterfactual variable Y^{a=1} in this individual. This is the observed table after we make the consistency assumption:
ID 
A 
Y 
Y^{a=1} 
Y^{a=0} 
Sven 
1 
1 
1 
* 
Olof 
0 
1 
* 
1 
Göran 
1 
1 
1 
* 
Gustaf 
0 
0 
* 
0 
Making the consistency assumption got us half the way to our goal. We now have a lot of information about Y^{a=1} and Y^{a=0}. However, half of the data is still missing.
Although consistency seems obvious, it is an assumption, not something that is true by definition. We can expect the consistency assumption to hold if we have a welldefined intervention (ie, the intervention is a welldefined choice, not an attribute of the individual), and there is no causal interference (one individual’s outcome is not affected by whether another individual was treated).
Consistency may not hold if you have an intervention that is not welldefined: For example, there may be multiple types of cigarettes. When you measure Y^{a=1 }in people who smoked, it will actually be a composite of multiple counterfactual variables: One for people who smoked regular cigarettes (let us call that Y^{a=1*}) and another for people who smoked ecigarettes (let us call that Y^{a=1#})_{. } Since you failed to specify whether you are interested in the effect of regular cigarettes or ecigarettes, the construct_{ }Y^{a=1 }is a composite without any meaning, and people will be unable to use your results to predict the consequences of their actions.
Exchangeability
To complete the table, we require an additional assumption on the nature of the data. We call this assumption “Exchangeability”. One possible exchangeability assumption is “Y^{a=0} ∐ A and Y^{a=1} ∐ A”. This is the assumption that says “The data came from a randomized controlled trial”. If this assumption is true, you will observe a random subset of the distribution of Y^{a=0} in the group where A=0, and a random subset of the distribution of Y^{a=1} in the group where A=1.
Exchangeability is a statement about two variables being independent from each other. This means that having information about either one of the variables will not help you predict the value of the other. Sometimes, variables which are not independent are "conditionally independent". For example, it is possible that knowing somebody's race helps you predict whether they enjoy eating Hakarl, an Icelandic form of rotting fish. However, it is also possible that this is just a marker for whether they were born in the ethnically homogenous Iceland. In such a situation, it is possible that once you already know whether somebody is from Iceland, also knowing their race gives you no additional clues as to whether they will enjoy Hakarl. In this case, the variables "race" and "enjoying hakarl" are conditionally independent, given nationality.
The reason we care about conditional independence is that sometimes you may be unwilling to assume that marginal exchangeability Y^{a=1} ∐ A holds, but you are willing to assume conditional exchangeability Y^{a=1} ∐ A  L. In this example, let L be sex. The assumption then says that you can interpret the data as if it came from two different randomized controlled trials: One in men, and one in women. If that is the case, sex is a "confounder". (We will give a definition of confounding in Part 2 of this sequence. )
If the data came from two different randomized controlled trials, one possible approach is to analyze these trials separately. This is called “stratification”. Stratification gives you effect measures that are conditional on the confounders: You get one measure of the effect in men, and another in women. Unfortunately, in more complicated settings, stratificationbased methods (including regression) are always biased. In those situations, it is necessary to focus the inference on the marginal distribution of Y^{a}.
Identification
If marginal exchangeability holds (ie, if the data came from a marginally randomized trial), making inferences about the marginal distribution of Y^{a} is easy: You can just estimate E[Y^{a}] as E [YA=a].
However, if the data came from a conditionally randomized trial, we will need to think a little bit harder about how to say anything meaningful about E[Y^{a}]. This process is the central idea of causal inference. We call it “identification”: The idea is to write an expression for the distribution of a counterfactual variable, purely in terms of observed variables. If we are able to do this, we have sufficient information to estimate causal effects just by looking at the relevant parts of the joint distribution of observed variables.
The simplest example of identification is standardization. As an example, we will show a simple proof:
Begin by using the law of total probability to factor out the confounder, in this case L:
· E(Y^{a}) = Σ E(Y^{a}L= l) * Pr(L=l) (The summation sign is over l)
We do this because we know we need to introduce L behind the conditioning sign, in order to be able to use our exchangeability assumption in the next step: Then, because Y^{a }∐ A  L, we are allowed to introduce A=a behind the conditioning sign:
· E(Y^{a}) = Σ E(Y^{a}A=a, L=l) * Pr(L=l)
Finally, use the consistency assumption: Because we are in the stratum where A=a in all individuals, we can replace Y^{a} by Y
· E(Y^{a}) = Σ E(YA=a, L=l) * Pr (L=l)
We now have an expression for the counterfactual in terms of quantities that can be observed in the real world, ie, in terms of the joint distribution of A, Y and L. In other words, we have linked the data generating mechanism with the joint distribution – we have “identified” E(Y^{a}). We can therefore estimate E(Y^{a})
This identifying expression is valid if and only if L was the only confounder. If we had not observed sufficient variables to obtain conditional exchangeability, it would not be possible to identify the distribution of Y^{a} : there would be intractable confounding.
Identification is the core concept of causal inference: It is what allows us to link the data generating mechanism to the joint distribution, to something that can be observed in the real world.
The difference between epidemiology and biostatistics
Many people see Epidemiology as «Applied Biostatistics». This is a misconception. In reality, epidemiology and biostatistics are completely different parts of the problem. To illustrate what is going on, consider this figure:
The data generating mechanism first creates a joint distribution of observed variables. Then, we sample from the joint distribution to obtain data. Biostatistics asks: If we have a sample, what can we learn about the joint distribution? Epidemiology asks: If we have all the information about the joint distribution , what can we learn about the data generating mechanism? This is a much harder problem, but it can still be analyzed with some rigor.
Epidemiology without Biostatistics is always impossible: It would not be possible to learn about the data generating mechanism without asking questions about the joint distribution. This usually involves sampling. Therefore, we will need good statistical estimators of the joint distribution.
Biostatistics without Epidemiology is usually pointless: The joint distribution of observed variables is simply not interesting in itself. You can make the claim that randomized trials is an example of biostatistics without epidemiology. However, the epidemiology is still there. It is just not necessary to think about it, because the epidemiologic part of the analysis is trivial
Note that the word “bias” means different things in Epidemiology and Biostatistics. In Biostatistics, “bias” is a property of a statistical estimator: We talk about whether ŷ is a biased estimator of E(Y^{ }A). If an estimator is biased, it means that when you use data from a sample to make inferences about the joint distribution in the population the sample came from, there will be a systematic source of error.
In Epidemiology, “bias” means that you are estimating the wrong thing: Epidemiological bias is a question about whether E(YA) is a valid identification of E(Y^{a}). If there is epidemiologic bias, it means that you estimated something in the joint distribution, but that this something does not answer the question you were interested in.
These are completely different concepts. Both are important and can lead to your estimates being wrong. It is possible for a statistically valid estimator to be biased in the epidemiologic sense, and vice versa. For your results to be valid, your estimator must be unbiased in both senses.