What this is: an attempt to examine how causal knowledge gets turned into probabilistic predictions.

I'm not really a fan of any view of probability that involves black boxes. I want my probabilities (or more practically, the probabilities of toy agents in toy problems I consider) to be derivable from what I know in a nice clear way, following some desideratum of probability theory at every step.

Causal knowledge sometimes looks like a black box, when it comes to assigning probabilities, and I would like to crack open that box and distribute the candy inside to smiling children.

What this is not: an attempt to get causal diagrams from constraints on probabilities.

That would be silly - see Pearl's article that was recently up here. Our reasonable desire is the reverse: getting the constraints on probabilities from the causal diagrams.


The Marble Game

Consider marbles. First, I use some coin-related process to get either Heads or Tails. If Tails, I give you a black marble. If Heads, I use some other process to choose between giving you a black marble or a white marble.

Causality is an important part of the marble game. If I manually interfere with the process that gives Heads or Tails, this can change the probability you should assign of getting a black marble. But if I manually interfere with the process that gives you white or black marbles, this won't change your probability of seeing Heads or Tails.


What I'd like versus what is

The fundamental principle of putting numbers to beliefs, that always applies, is to not make up information. If I don't know of any functional differences between two events, I shouldn't give them different probabilities. But going even further - if I learn a little information, it should only change my probabilities a little.

The general formulation of this is to make your probability distribution consistent with what you know, in the way that contains the very least information possible (or conversely, the maximum entropy). This is how to not make up information.

I like this procedure; if we write down pieces of knowledge as mathematical constraints, we can find correct distribution by solving a single optimization problem. Very elegant. Which is why it's a shame that this isn't at all what we do for causal problems.

Take the marble game. To get our probabilities, we start with the first causal node, figure out the probability of Heads without thinking about marbles at all (that's easy, it's 1/2), and then move on to the marbles while taking the coin as given (3/4 for black and 1/4 for white).

One cannot do this problem without using causal information. If we neglect the causal diagram, our information is the following: A: We know that Heads and Tails are mutually exclusive and exhaustive (MEE), B: we know that getting a black marble and getting a white marble are MEE, and C: we know that if the coin is Tails, you'll get a black marble.

This leaves three MEE options: Tails and Black (TB), HB, and HW. Maximizing entropy, they all get probability 1/3.

One could alternately think of it like this: if we don't have the causal part of the problem statement (the causal diagram D), we don't know whether the coin causes the marble choice, or the marble causes the coin choice - why not pick a marble first, and if it's W we give you an H coin, but if it's B we flip the coin? Heck, why have one cause the other at all? Indeed, you should recover the 1/3 result if you average over all the consistent causal diagrams.

So my question is - what causal constraints is our distribution subject to, and what is it optimizing? Not piece by piece, but all at once?


Rephrasing the usual process

One method is to just do the same steps as usual, but to think of the rationale in terms of knowledge / constraints and maximum entropy.

We start with the coin, and we say "because the coin's result isn't caused by the marbles, no information pertaining to marbles matters here. Therefore, P(H|ABCD) is just P(H|A) = 1/2" (First application of maximum entropy). Then we move on to the marbles, and applying information B and C, plus maximum entropy a second time, we learn that P(B|ABCD) = 3/4. All that our causal knowledge really meant for our probabilities was the equation P(H|ABCD)=P(H|A).

Alternatively, what if we only wanted to maximize something once, but let causal knowledge change the thing we were maximizing? We can say something like "we want to minimize the amount of information about the state of the coin, since that's the first causal node, and then minimize the amount of information about it's descendant node, the marble." Although this could be represented as one equation using linear multipliers, it's clearly the same process just with different labels.


Is it even possible to be more elegant?

Both of these approaches are... functional. I like the first one a lot better, because I don't want to even come close to messing with the principle of maximum entropy / minimal information. But I don't like that we never get to apply this principle all at once. Can we break our knowledge down more so that everything happen nice and elegantly?

The way we stated our knowledge above was as P(H|ABCD) = P(H|A). But this is equivalent to the statement that there's a symmetry between the left and right branches coming out of the causal node. We can express this symmetry using the equivalence principle as P(H)=P(T), or as P(HB)+P(HW)=P(TB).

But note that this is just hiding what's going on, because the equivalence principle is just a special case of the maximum entropy principle - we might as well just require that P(H)=1/2 but still say that at the end we're "maximizing entropy subject to this constraint."


Answer: Probably not

The general algorithm followed above is, for each causal node, to insert the condition that the probabilities of outputs of that node, given the starting information including the causal diagram, are equal to the probabilities given only the starting information related to that node or its parents - information about the descendants does not help determine probabilities of the parents.

New Comment
8 comments, sorted by Click to highlight new comments since:

What do you do if you aren't sure which way does the arrow point? Like, the more phosphates in the soil, the bigger the plant grows (and the less likelihood of it having a symbiosis with a fungus), but the longer the plant grows (the bigger it is), the less phosphates there are in the soil (and so a larger likelihood of fungus, which accelerates the depletion)? How do you treat an arrow that is a sum of two fluxes going into the opposite directions? Thank you for the post, it sent me on a fruitful self-education binge:))

I think the correct way to represent this is as a time series - the past states of the plant cause future states of the plant, and also have a causal effect on future states of the soil. The past state of the soil affects the future state of the soil, and also the future state of the plant.

Things that affect each other over time like this have a causal diagram that looks like a braid. The structure is kept somewhat simple by the fact that time steps only cause the very next time step - when predicting the future, knowing the present state of the world is enough (if you're Laplace's demon).

I think the correct way to represent this is as a time series - the past states of the plant cause future states of the plant, and also have a causal effect on future states of the soil.

That presumes discrete time. But time is continuous. (Speculations about discreteness on the scale of Planck time are irrelevant to the timescale of plant growth.) Any discretisation involves an arbitrary choice of time step. How do you make that choice? What can you do with a causal diagram constructed in this way, with millions or billions of nodes? With an assumption about the invariance of causal influences over time, it can be represented in a compressed form in which only two time points appear, but it's not clear to me that that offers any advantage over cyclic diagrams and continuous time.

The structure is kept somewhat simple by the fact that time steps only cause the very next time step - when predicting the future, knowing the present state of the world is enough (if you're Laplace's demon).

Only if the "present state" is defined to include all derivatives of the variables you're interested in (or as many as are causally relevant). Computing (a discrete approximation to) the nth derivative of a variable in discretised time requires knowing the value of the variable at n+1 consecutive time points.

That presumes discrete time. But time is continuous.

Yup - any discrete causal model is an approximation. As with any approximation, one chooses it based on what you can exactly solve, what you have the resources to calculate, and what kind of things you need to calculate.

Only if the "present state" is defined to include all derivatives of the variables you're interested in

Indeed - the classical world actually lives in phase space. Quantum mechanics is actually somewhat simpler that way.


Thank you, I will try this model.

Problem 1

Take my wet lawn - it could either be wet because it's raining, or because I'm watering it - and suppose for simplicity that both of these have a base rate of P=0.5. The causal diagram is Rain -> Wetness <- Watering

Then our non-causal information looks like: rain and noRain are MEE and have been observed in 1:2 ratio in the past, watering and notWatering are MEE and have been observed in 1:2 ratio in the past, wet and notWet are MEE, wet and noRain and notWatering are inconsistent, notWet and rain are inconsistent, notWet and watering are inconsistent.

If we call rain / noRain by the letters R / r, watering / not as Wa / wa, and wetness as We / we, we have the following possible MEE events: RWaWe, rWaWe, RwaWe, rwawe.

Following the recipe in the post, our causal information then fixes P(Wa)=P(R)=1/3, and nothing else.

Given these constraints, we have P(RWaWe)=1/6, P(RwaWe)=1/6, P(rWaWe)=1/6, P(rwawe)=1/2.

Uh oh, we have a problem! Our causal information should also be telling us that rain and watering are independent - P(RWa) = P(R)P(Wa). What have I done wrong, and how can I do it right rather than just patching a hole?

The obvious patch is just to say that everything that is d-separated is independent. And if we have the correct prior probability distribution, then conditionalization works properly at handling changing d-separation.

But it's not clear that there are no more problems - and I'm not even sure of a good foothold to attack this more abstractly.

I'm pretty sure the solution here is just to assume that the usual iterative procedure is correct. If something can be proven equivalent to that, it works. Even if the d-separation independence thing is just a patch, the correct solution probably won't need many patches because the iterative procedure is simple.

Problem 2

If I find a causal prior and then make observations, my updates can change my probabilities for various parent nodes. E.g. in the marble game, if I condition on a white marble, my probability of Heads changes. But shouldn't conditioning be equivalent to just adding the information you're conditioning on to your pool of information, and rederiving from scratch? And yet if we follow the procedure above, the parent node's probability is totally fixed. What gives?

This actually works if you condition every probability, including the probability of the parent nodes, on the observed information. For example, say that option one is you could start with all options possible in the marble game and then observe that the result was not Heads and White. And option two is you could determine the marble color causally, in a way that never even has the possibility of White when Heads. And these two options result in different probabilities.

This really reinforces how the information about how a node's value is causally generated is different from observed information about that node.