Causality Adds Up to Normality

johnswentworth

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

For definitions and notational conventions used here, see What is Abstraction?.

The world turns out to run on quantum mechanics, rather than classical. But classical mechanics still predicts the world exactly as well as it did all along. In particular, it still works in all the situations where we already knew it worked before figuring out quantum, i.e. high-level (“macroscopic”) objects. It all adds up to normality.

Formalizing this idea somewhat in the context of abstraction suggests an interesting class of questions.

We learn some high-level model $M^{H}$ by looking only at high-level data - e.g. we learned the laws of classical physics by looking at the behavior of macroscopic objects. Eventually, we learn a lower-level model $M^{L}$ of the same system. What properties of $M^{H}$ must carry over to $M^{L}$ ? When we learn high-level properties of a system, what does that tell us about the low-level properties?

This post is about one particular class of properties: causal structure, i.e. arrows in graphical models. Basic results:

Conditional independence relations between far-apart high-level components must also hold between the corresponding low-level components
Interventional dependence relations between far-apart high-level components must also hold between the corresponding low-level components

Together, these imply that the causal structure of far-apart low-level components is compatible with the causal structure of the corresponding high-level components, assuming that both levels follow a causal DAG at all. Visually:

Conditional Independence

We’ll start with unconditional independence, as an easy warmup. We’ll assume that far-apart high-level variables $X_{i}^{H}$ and $X_{j}^{H}$ are independent:

$P [X_{i}^{H}, X_{j}^{H}] = P [X_{i}^{H}] P [X_{j}^{H}]$

Working directly from the systems view of abstraction, we then have:

$P [(X_{i}^{L}, X_{i}^{H}), (X_{j}^{L}, X_{j}^{H})] = P [X_{i}^{H}, X_{j}^{H}] P [X_{i}^{L} | X_{i}^{H}] P [X_{j}^{L} | X_{j}^{H}]$

$= P [X_{i}^{H}] P [X_{j}^{H}] P [X_{i}^{L} | X_{i}^{H}] P [X_{j}^{L} | X_{j}^{H}]$ (by independence assumption)

We can then integrate out the high-level variables to obtain:

$P [X_{i}^{L}, X_{j}^{L}] = P [X_{i}^{L}] P [X_{j}^{L}]$

This is fairly obvious from a visual representation of the abstraction condition:

Abstraction validity means that far-apart low-level X_i’s are independent given their corresponding high-level summaries f_i(X_i). In other words, we can swap each low-level variable with its summary, and just keep track of dependence relations between the summaries rather than the low-level variables themselves, as long as all the variables are far-apart.

Far-apart low-level variables are independent given their summaries, so if the summaries are themselves unconditionally independent, then the low-level variables must be unconditionally independent.

The proof for conditional independence follows a similar pattern, but with a couple extra steps in the middle, and uses the lemma from the appendix here. We start with high-level conditional independence:

$P [X_{i}^{H}, X_{j}^{H}, X_{S}^{H}] = P [X_{i}^{H} | X_{S}^{H}] P [X_{j}^{H} | X_{S}^{H}] P [X_{S}^{H}]$

… where $i$ , $j$ , and $S$ are all far-apart. Again, we factor via the systems view of abstraction:

$P [(X_{i}^{L}, X_{i}^{H}), (X_{j}^{L}, X_{j}^{H}), (X_{S}^{L}, X_{S}^{H})] = P [X_{i}^{H}, X_{j}^{H}, X_{S}^{H}] P [X_{i}^{L} | X_{i}^{H}] P [X_{j}^{L} | X_{j}^{H}] P [X_{S}^{L} | X_{S}^{H}]$

$= P [X_{i}^{H} | X_{S}^{H}] P [X_{j}^{H} | X_{S}^{H}] P [X_{S}^{H}] P [X_{i}^{L} | X_{i}^{H}] P [X_{j}^{L} | X_{j}^{H}] P [X_{S}^{L} | X_{S}^{H}]$ (by high-level conditional independence)

$= P [X_{i}^{L}, X_{i}^{H} | X_{S}^{H}] P [X_{j}^{L}, X_{j}^{H} | X_{S}^{H}] P [X_{S}^{L}, X_{S}^{H}]$ (by abstraction conditions)

$= P [X_{i}^{L}, X_{i}^{H} | X_{S}^{L}] P [X_{j}^{L}, X_{j}^{H} | X_{S}^{L}] P [X_{S}^{L}, X_{S}^{H}]$ (by lemma)

Then, we just integrate out all the high-level variables, and find

$P [X_{i}^{L}, X_{j}^{L}, X_{S}^{L}] = P [X_{i}^{L} | X_{S}^{L}] P [X_{j}^{L} | X_{S}^{L}] P [X_{S}^{L}]$

Also note that this entire proof goes through exactly the same if we replace the individual variables $i$ and $j$ with sets of far-apart variables $I$ and $J$ .

In summary: if high-level variables $X_{I}^{H}$ and $X_{J}^{H}$ are independent given $X_{S}^{H}$ (all far apart), then the corresponding low-level variables $X_{I}^{L}$ and $X_{J}^{L}$ are independent given $X_{S}^{L}$ .

Interventions

What is Abstraction gave a few possible forms of the interventional query constraints for causal abstraction. These different possibilities correspond to stronger/weaker assumptions about the methods used to perform high-level interventions and/or the data from which the interventional behavior is learned. For purposes of this post, we won’t dissect the particulars of the different possibilities; we’ll just use a general constraint form which matches any of them:

$P [X_{S}^{L}, X_{S}^{H} | d o (X_{T}^{L} = w h a t e v e r)] = P [X_{S}^{H} | d o (X_{T}^{H} = f_{T} (w h a t e v e r)] \prod_{i \in S} P [X_{i}^{L} | X_{i}^{H}]$

… where $S$ and $T$ are far apart. Here “whatever” could range over all possible values of $X_{T}^{L}$ , or it could randomly sample from possible values given some parameter, or it could sample the possible values from some entirely different distribution, or ….

What matters is that there’s two ways to expand $P [X_{S}^{L}, X_{S}^{H} | d o (. . .)]$ . One of them is the abstraction condition, but the other is to just use the factorization associated with the original construction of $X_{S}^{H}$ from $X_{S}^{L}$ (i.e. the “true” underlying causal structure):

$P [X_{S}^{L}, X_{S}^{H} | d o (X_{T}^{L} = w h a t e v e r)] = P [X_{S}^{L} | d o (X_{T}^{L} = w h a t e v e r)] \prod_{i \in S} P [X_{i}^{H} | X_{i}^{L}]$

Equating these two, we find

$P [X_{S}^{L} | d o (X_{T}^{L} = w h a t e v e r)] = P [X_{S}^{H} | d o (X_{T}^{H} = f_{T} (w h a t e v e r)] \prod_{i \in S} \frac{P [X_{i}^{L} | X_{i}^{H}]}{P [X_{i}^{H} | X_{i}^{L}]}$

$= P [X_{S}^{H} | d o (X_{T}^{H} = f_{T} (w h a t e v e r)] \prod_{i \in S} \frac{P [X_{i}^{L}]}{P [X_{i}^{H}]}$ (by Bayes)

Our low-level interventional distribution is proportional to our high-level interventional distribution. Key thing to notice: the proportionality factor $\prod_{i \in S} \frac{P [X_{i}^{L}]}{P [X_{i}^{H}]}$ does not depend on our interventional parameters (i.e. “whatever”)¹.

Why is that so important? Well, it means that varying the intervention value $w h a t e v e r_{i}$ changes the far-away low-level distribution $X_{S}^{L}$ if-and-only-if it changes the far-away high-level distribution $X_{S}^{H}$ . In particular, if interventions on some subset of variables “screen off” interventions on others at the high level, then interventions on the corresponding low-level variables follow the same pattern.

Also note that our general abstraction constraint implies that any variables not intervened upon still follow the usual abstraction constraints - i.e. far-apart low-level variables are still independent given their high-level summaries, even after intervening on other far-away variables.

In combination with our conditional independence results from the previous section, this trivially implies things like:

If $X_{I}^{H}$ and $X_{J}^{H}$ are conditionally independent given $X_{S}^{H}$ under an intervention on $X_{T}^{H}$ (with everything far apart), then $X_{I}^{L}$ and $X_{J}^{L}$ are conditionally independent given $X_{S}^{L}$ under a corresponding intervention on $X_{T}^{L}$
If the distribution of $X_{I}^{H}$ given $X_{S}^{H}$ under an intervention on $X_{T}^{H}$ does not depend on the intervention value, then the distribution of $X_{I}^{L}$ given $X_{S}^{L}$ under an intervention on $X_{T}^{L}$ also does not depend on the intervention value.

These match the conditions Pearl uses (on pg 24 of Causality, 2nd ed) to define what it means for a DAG $G$ to model a family of interventional distributions². In particular, this means that, if a DAG $G$ models the family of high-level interventional distributions on far-apart variables, then it also models the family of low-level interventional distributions on the corresponding variables.

Or, in English: the low-level causal structure is compatible with the high-level causal structure.

Fine Print

Lastly, I’ll highlight the major loopholes in the argument above.

First, this is saying that if the relationship between high-level and low-level model matches the definitions we’ve been using, then causality adds up to normality. But we still haven’t talked about why this would actually happen in practice - i.e. why humans would learn high-level models which are related to the universe’s lower-level structure in this way. For now, my main argument is “after looking at lots of examples that’s what seems to happen in practice”, but eventually I expect to prove that this is a general property of Bayesian learning in causal universes.

Second, we’ve only mentioned DAG-shaped causality, not causality with loops. If we want to deal with e.g. equilibrium processes, then we need to expand out the whole trajectory-over-time. I expect the arguments above do extend to loopy graphs, but it’s not something I’ve put any thought into; I usually just ignore non-DAG notions of causality.

Third, our main conclusion says “if a DAG $G$ models the family of high-level interventional distributions on far-apart variables, then it also models the family of low-level interventional distributions on the corresponding variables”. Notably, this does not say that $G$ is the only DAG which models the family of low-level interventional distributions. We don’t have a uniqueness result here.

Footnotes

¹ This assumes that $P [X_{i}^{H} | X_{i}^{L}] \neq 0$ , which is a big assumption; $P [X_{i}^{H} | X_{i}^{L}]$ is zero for most $(X_{i}^{H}, X_{i}^{L})$ value pairs. So this formula only applies for values of $X_{i}^{L}$ compatible with the $X_{i}^{H}$ value. For purposes of causality, that’s not an issue; we’re mostly worried about varying the intervention parameters, not the query values.

² Along with one other condition, which is trivially satisfied as long as both the high- and low-level models are causal DAG models by themselves.

[-]Mateusz Bagiński2y30

this would actually happen in practice - i.e. why humans would learn high-level models which are related to the universe’s lower-level structure in this way.

My intuition is that this makes it possible to model long-range dependencies with minimum cognitive costs.

Also, do you think there's some utility in modelling abstracting into $X^{H}$ as a forgetful functor?

[-]johnswentworth2y20

Plausibly, it does have that flavor to it.

LESSWRONG
LW

Causality Adds Up to Normality

13

Ω 7

Conditional Independence

Interventions

Fine Print

Footnotes

13

Ω 7