Michael Nielsen has posted a long essay explaining his understanding of the Pearlean causal DAG model. I don't understand more than half, but that's much more than I got out of a few other papers. Strongly recommended for anyone interested in the topic.

I'm really interested in these questions. (You might remember my agent with n sensors and m timesteps learning Huygen's principle from observing the evolution of the wave equation with various initial conditions.) I have some half-written ideas, but nothing in a state worth talking about.

Have you seen the following dissertations: Dash (2003) Caveats for Causal Reasoning with Equilibrium Systems and Voortman (2009) Causal Discovery of Dynamic Systems?

My dissertation was initially going to be an extension of this stuff. I ended up writing something very different, but I still think about these problems off and on. If you have thoughts that you would like to share and discuss, I would be glad to chat.

The paper links to Causality, by Pearl, but it links to the first edition from 2000. He came out with a second edition in 2009.

http://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/052189560X/ref=ntt_at_ep_dpt_1

And the third edition is forthcoming. For those who have the second edition, here is a list of all changes in the 3rd.

I wish there were an "engineering" version, outlining the steps necessary to get a practical calculation done, with some motivation along the way, outline of applicability, lots of examples and less emphasis on proofs. For example, various statistical tests (z, t, chi-squared, ...) are a part of the standard toolkit of any practitioner. What tools are needed to resolve the Simpson's paradox in the examples given? Is it even the right question to ask?

2013 resubmission: http://lesswrong.com/r/discussion/lw/hz1/link_if_correlation_doesnt_imply_causation_then/

Thanks. I hadn't read much on the subject before. Being able to go from that observational data and those causal assumptions to calculating precisely the expected results of an experiment you haven't carried out is amazing! I second your recommendation.

A stupid question:

In Nielsen's first "problem for the author", he writes,

But I can't tell the difference between the "alternate" approach and the original one. For example, Nielsen says that the alternate approach "introduces the overhead of dealing with the augmented graph". Wasn't this already necessary in the original approach?

Okay, suppose you have a two node system. One node is whether someone smokes, one node is whether there's tar in their lungs. The smoking node has a causal influence on the tar node, but there's also a random factor. Say if someone smokes they'll have tar in their lungs, otherwise there's only a 5% chance they do. This is only two nodes. To express this in the alternate approach, the tar node needs to be a deterministic function of other nodes, and an extra random node is required for this. It could be a "polluted atmosphere" node, value "yes" 5% of the time, and the tar node could be "yes" if either the smoking or polluted atmosphere nodes are "yes" (a deterministic function of its input nodes).

I don't see how this is true of either approach.

Let

X_smokes andX_tar be the random variables associated with your nodes. Under the first approach, if there are no other "exogenous"Y-nodes, then there is a functionf_tar such thatX_tar =f_tar(X_smokes). Doesn't that mean that whether you have tar is entirely a function of whether you smoke?Maybe I'm mistaken about what it means for one random variable to be a function of another. We can understand

X_smokes andX_tar formally as functions from the sample space Ω of people* to the state space {0,1} of Boolean values, right? Usually, to say that one functionfis a function of another functiongis to say that, for some functionF,f(x) =F(g(x)) for each elementxof the domain. That is, the value offatxis entirely determined by the value ofgatx.If this convention applies when the functions are random variables, then to say that

X_tar =f_tar(X_smokes) is to say that, for each person 𝜔,X_tar(𝜔) =f_tar(X_smokes(𝜔)). Thus, for every smoker 𝜔,X_tar(𝜔) has the same value, namelyf_tar(1). That is, the answer to whether a smoker has tar in their lungs is always the same. Similarly, among all nonsmokers, the answerf_tar(0) to whether they have tar in their lungs is always the same. Therefore, whether or not you smoke determines whether or not you have tar in your lungs.Do people mean something different when they say that one random variable is a function of another? If so, what do they mean? If not, where is there room for a "random factor" when there are no exogenous

Y-variables, even under the first approach described by Nielsen?* ETA: I originally had the sample space Ω being the set of all possible worlds, which seems wrong on reflection.

Mostly the essay is careful not to flatly say that a node value X_1 is a function of a node value X_2. Sometimes it is a random function of X_2 (note the qualifier "random"), sometimes it is a function of X_2 and a random value Y_1, where Y_1 does not have its own node (so does not increase the size of the graph). And of course there is an exception when proposing the alternate approach, where the nodes are divided into the random ones, and those which are a deterministic function of other node values.

In my example, I would not say the tar node value was a function of the smokes node value, it would be a function of the smokes node value and an extra random variable. In the alternate approach the extra random variable needs its own node, in the main approach it doesn't, you can just assume every node may have extra random variables influencing it without representing them on the graph.

I'm not sure where you're getting that. Here is how he describes the dependence among nodes under the "first approach":

So, unless you have some exogenous

Y-nodes, X_j will be a deterministic function of its parentX-nodes. The only way he introduces randomness is by introducing theY-nodes. My question is, how is that any different from the "alternative approach" that he discusses later?That is not the same as there being Y-nodes. Nodes would be part of the graph structure, and so be more visible when you look at the graph.

The only difference is whether the Y-values require their own nodes.

I see. Thanks. I was thrown off because he'd already said that he would "overload" the notation for random variables, using it also to represent nodes or sets of nodes. But what you say makes sense.

I'm not sure what the

realdifference is, though. The graph is just a way to depict dependencies among random variables. If you're already working with a collection of random variables with given dependencies, the graph is just, well, a graphical way to represent what you're already dealing with. Am I right, then, in thinking that the only difference between the two "approaches" is whether you bother to create this auxiliary graphical structure to represent what you're doing, instead of just working directly with the random variableX_i,Y_ij, and their dependency functionsf_i?It's easier for humans to think in terms of pictures. But if you were programming a computer to reason causally this way, wouldn't you implement the two "approaches" in essentially the same way?

If you have a specified causal system you could represent it either way, yes.

Speculating on another reason he may have made the distinction: often he posed problems with specified causal graphs but unspecified functions. So he may have meant that in the problems like these, with one approach you can easily specify some node values as being deterministic functions of other node values, whereas with the other approach you don't (since a specified graph rules out further random influences in one approach but not the other).