A Correspondence Theorem

johnswentworth

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I’ve been thinking lately about formalizations of the Correspondence Principle - the idea that new theories should reproduce old theories, at least in the places where the old theories work. Special relativity reduces to Galilean relativity at low speed/energy, general relativity reduces to Newtonian gravity when the fields are weak, quantum mechanics should reproduce classical mechanics at large scale, etc. More conceptually, it’s the idea that flowers are “real”: any model which does a sufficiently-good job of predicting the world around me should have some kind of structure in it corresponding to my notion of a flower (though it may not be ontologically basic).

I want theorems telling me when my models are of the right kind, and sufficiently-well-nailed-down, that I can expect some kind of correspondence along these lines to apply for any future theory (assuming the future theory has sufficiently strong predictive power). Ideally, this would allow us to take a model and extract the “generalizable” information from it, construct some universal representation of certain components of the model which lets us find the “corresponding” components in any other model with sufficiently-good predictive performance on the same data. (This problem is one example of a potential application.)

This post introduces one such theorem. We’ll show that a (particular type of) correspondence theorem applies in exactly those cases where our model is able to make deterministic predictions about the relationship between some observable variables.

I've tried to keep the hairiest of the math in the appendices and the "key ideas from the math" section. If you want the key ideas with minimal math, then skip those sections.

The Claim

Here’s the setup. We have two random variables, and $X_{2}$ , and we have enough data that we can perfectly figure out their joint distribution $P [X_{1}, X_{2}]$ . We don’t necessarily know what physical process generates these variables - i.e. we don’t know the true underlying generative model. You can imagine some scientific experiment where we can collect lots of independent samples of $X_{1}, X_{2}$ , then brute-force count how often each $(X_{1}, X_{2})$ pair occurs in order to estimate $P [X_{1}, X_{2}]$ , but we don’t have any way to experimentally probe the “internals” of the experimental setup. In other words, it's the traditional setup from frequentist probability.

We’ll assume that the class of “theories” we’re interested in are generative models - i.e. programs which take in some independent uniform variables and spit out $X_{1}, X_{2}$ . For our purposes, we’ll assume these programs are represented as systems of circuits/causal DAGs/Bayes nets - see here for what that sort of representation looks like. (Note that this representation is Turing complete, so we’re not imposing any significant restriction.)

As long as $X_{1}$ and $X_{2}$ are not independent, any generative model which reproduces the distribution $P [X_{1}, X_{2}]$ must have some variable(s) upstream of both $X_{1}$ and $X_{2}$ - some common upstream cause(s) which account for the relationship between the variables. The claim I’d like to make is: any upstream variable(s) capable of accounting for the relationship between $X_{1}$ and $X_{2}$ must contain some “minimal” information. Formally: we can construct some variable $U^{*}$ (defined by the distributions $P [U^{*}], P [X_{1} | U^{*}], P [X_{2} | U^{*}]$ ) such that $X_{1}$ and $X_{2}$ are independent given $U^{*}$ :

$P [X_{1}, X_{2}, U^{*}] = P [U^{*}] P [X_{1} | U^{*}] P [X_{2} | U^{*}]$

… and, for any other $U$ which induces conditional independence of $X_{1}$ and $X_{2}$ , $U^{*}$ is a (possibly stochastic) function of $U$ : for all $U$ such that

$P [U, X_{1}, X_{2}] = P [U] P [X_{1} | U] P [X_{2} | U]$

... we have

$P [U, U^{*}, X] = P [U] P [X | U] P [U^{*} | U]$ .

Conceptually: $U^{*}$ is the “minimal” information about the relationship between $X_{1}$ and $X_{2}$ . It must be “included in” any variable $U$ capable of accounting for the relationship between those two variables.

Visualization in terms of Bayes nets. Our main goal is to prove the existence of $U^{*}$ , and construct it from $P [X_{1}, X_{2}]$ . We’ll see that this can only be done for some distributions $P [X_{1}, X_{2}]$ .

That’s the claim I’d like to make. However, that claim is too strong - not all distributions $P [X_{1}, X_{2}]$ have such a $U^{*}$ . Indeed, in some sense “most” do not, though we can work around that to a large extent. The next section will explain exactly when we can construct such a $U^{*}$ , and the appendix gives the proof. Specifically, we can construct $U^{*}$ in exactly those cases where the relationship between $X_{1}$ and $X_{2}$ is independent variation subject to a deterministic constraint. Formally: we can construct $U^{*}$ exactly when $P [X_{1}, X_{2}]$ can be represented as

$P [X_{1}, X_{2}] = I [f_{1} (X_{1}) = f_{2} (X_{2})] P [f_{1} (X_{1})] P [X_{1} | f_{1} (X_{1})] P [X_{2} | f_{2} (X_{2})]$

… for some deterministic functions $f_{1}, f_{2}$ (here $I [.]$ is the indicator function). In that case, we can choose $U^{*} = f_{1} (X_{1}) = f_{2} (X_{2})$ .

Key Ideas From The Math

We have to be able to calculate our “universal” upstream $U^{*}$ from any $U$ which induces independence of $X_{1}, X_{2}$ . In particular, $X_{1}$ and $X_{2}$ are independent given either $X_{1}$ or $X_{2}$ itself - so we can choose either of those to be $U$ . Combine that with the original requirements, and we have:

$X_{1}$ independent of $X_{2}$ given $U^{*}$
$U^{*}$ independent of $X_{1}$ given $X_{2}$
$X_{2}$ independent of $U^{*}$ given $X_{1}$

This three-way conditional independence requirement is the main constraint which determines which distributions $P [X_{1}, X_{2}]$ have a $U^{*}$ .

One easy way to satisfy this three-way conditional independence is when $X_{1}$ , $X_{2}$ , and $U^{*}$ are all just completely independent. Another is when all three are deterministically equal - i.e. $U^{*} = 0$ implies $X_{1} = X_{2} = 0$ , $U^{*} = 1$ implies $X_{1} = X_{2} = 1$ , etc. And of course we can rename variable values while maintaining three-way conditional independence - e.g. we could rename $X_{1}$ values from $[0, 1, \dots]$ to $[“ f i s h ”, 12.38, \dots]$ . So all three variables isomorphic - rather than equal - also works: $U^{*} = f_{1} (X_{1}) = f_{2} (X_{2})$ , with $f_{1}$ and $f_{2}$ invertible.

We could also combine these two possibilities: $U^{*}$ , $X_{1}$ , and $X_{2}$ could each have two components, where one component is independent and the other component is fully determined by either of the other variables. For instance, we could have $U^{*} = (U^{* 1}, U^{* 2})$ , $X_{1} = (X_{1}^{1}, X_{1}^{2})$ , $X_{2} = (X_{2}^{1}, X_{2}^{2})$ , with $U^{* 1}$ , $X_{1}^{1}$ , and $X_{2}^{1}$ all completely independent, but $U^{* 2}$ , $X_{1}^{2}$ , and $X_{2}^{2}$ all deterministically isomorphic. Again, we can also rename variable values, so our variables don’t need to literally be written with two components.

… and that turns out to be the most general possibility.

Visually: if we lay out the distribution $P [X_{1}, X_{2}]$ in a matrix (with rows corresponding to $X_{1}$ values and columns corresponding to $X_{2}$ values), then we can arrange that matrix to look like this:

Here the dark blocks are nonzero values; everything else is zero. Within the dark blocks, we have $P [X_{1}, X_{2}] = P [X_{1}] P [X_{2}]$ - i.e. the dark blocks are each rank-1 submatrices. Intuitively, our distribution $P [X_{1}, X_{2}]$ consists of a deterministic relationship (the choice of which block we’re in) plus independent variation (choice of values within the block). This generalizes in the obvious way to the full distribution $P [U^{*}, X_{1}, X_{2}]$ : the full distribution consists of non-overlapping 3D blocks of nonzeros, with the three variables independent within each block.

For the full proof that these are the only distributions for which we can construct $U^{*}$ , see the appendix.

Note that we do still have a degree of freedom in choosing $U^{*}$ - we can add or remove independent components. The obvious choice is to remove all the independent components from $U^{*}$ , and just keep the deterministic component. Visually, we take $U^{*}$ to “choose a block” in the graphic above. (In fact, we could have just specified upfront that $U^{*}$ is a deterministic function of any possible $U$ , rather than the more general stochastic function, but I usually prefer to use the more general version just in case the problem has some implicit embedded game with mixed equilibria - the stochastic function allows $U^{*}$ to “randomize its strategy”. In this case, it turns out that the generality doesn’t buy us anything, and that is itself useful to know; it means there’s probably no diagonalization shenanigans going on.)

The appendix proves that this choice of $U^{*}$ indeed satisfies all of our original conditions.

To wrap up this section, let’s write out the shape of $P [X_{1}, X_{2}]$ mathematically. Each value of $X_{1}$ has nonzero probability in only one “block”, so we can create a function $f_{1} (X_{1})$ which picks out the block for each $X_{1}$ value. Likewise for $f_{2} (X_{2})$ . $P [X_{1}, X_{2}]$ can be nonzero only if $X_{1}$ and $X_{2}$ are in the same block: $f_{1} (X_{1}) = f_{2} (X_{2})$ . The chance of landing in a particular block is $P [f_{1} (X_{1})]$ (which is equal to $P [f_{2} (X_{2})]$ , since $f_{1} (X_{1})$ and $f_{2} (X_{2})$ must always be equal). Within the block (i.e. conditional on $f_{1} (X_{1})$ ), $X_{1}$ and $X_{2}$ are independent. Put all that together, and we get

$P [X_{1}, X_{2}] = I [f_{1} (X_{1}) = f_{2} (X_{2})] P [f_{1} (X_{1})] P [X_{1} | f_{1} (X_{1})] P [X_{2} | f_{2} (X_{2})]$

Since $f_{1} (X_{1})$ picks out the block, and $U^{*}$ is the block choice, $U^{*} = f_{1} (X_{1}) = f_{2} (X_{2})$ .

Applicability

At first glance, it looks like the conditions required to use this theorem are pretty restrictive. But there’s an easy trick to generalize it somewhat.

Suppose we have a distribution $P [X_{1}, X_{2}]$ which does not fully satisfy the requirements, but it does have a deterministic constraint: $g_{1} (X_{1}) = g_{2} (X_{2})$ with probability 1. Then we can’t apply our correspondence theorem to $X_{1}$ and $X_{2}$ , but we can apply it to $g_{1} (X_{1})$ and $g_{2} (X_{2})$ . We can construct $U^{*}$ satisfying this:

In particular, we can choose $U^{*} = g_{1} (X_{1}) = g_{2} (X_{2})$ .

So: whenever we can identify a deterministic constraint in the world, we can apply this correspondence theorem. In other words, deterministic constraints in one model should have corresponding structures in other models, to the extent that all models match the environment.

Note that “deterministic constraints” is exactly the content of most models in the hard sciences. For instance, equations in physics are almost always deterministic constraints on the trajectory of world-states. To the extent that these equations match the real world, we should see corresponding constraints in future theories/models which also match the real world.

We can push this further.

We haven’t talked about approximations here, but let’s assume that the theorem generalizes to approximations in the obvious way - i.e. approximately deterministic constraints give approximate correspondence, with independencies replaced by approximate independencies. Once we have approximations, we can construct deterministic constraints via statistical identification.

Here’s what that means. Consider something like the ideal gas law, $P V = n R T$ . Observing a few specific particles will not tell us the pressure or temperature of the gas. Individual particles have only partial information about the high-level variables, so we don’t have a correspondence theorem. However, if we bucket together a whole bunch of particles, then we can get a very precise estimate of temperature and pressure - in stats terminology, $P$ and $T$ can be “identified” from our many independent particles. We can then have deterministic relationships between the identified variables, like $P V = n R T$ . Then we can apply the correspondence theorem.

Similarly, we could identify variables using multiple “bunches of particles” - or, more generally, multiple meaurements/multiple lines of evidence. For instance, maybe we can use a whole bunch of measurements to determine the gravitational constant. If we do this again with another set of measurements, we expect to get the same number. That’s a deterministic constraint: gravitational constant calculated from one set of measurements should equal gravitational constant calculated from another. To the extent that this constraint matches the real world, our correspondence theorem says the gravitational constant should correspond to something in any future theory.

Conclusion

In some ways, this theorem leaves a lot to be desired. In particular, it assumes access to the “true distribution”, which I’d really like to get rid of. I expect there are other correspondence theorems to be found, with very different formulations, some of which would be better in that regard. For instance, for maximum entropy models, there’s an easy correspondence theorem which says “given an old model and a new model, either the constraints of the old model are satisfied by the new model, or we can construct a third model which strictly outperforms the new model”. (That theorem is unimpressive in other ways, however - it doesn’t really say anything about the “internal structure” of the models.)

Aside from direct use as a correspondence theorem, “independent variation subject to deterministic constraints” is an interesting thing to see pop out. It’s a common pattern in the sciences - it describes far-apart low-level variables in any abstraction where the high-level model is deterministic. I also think it’s equivalent to zero distributed information, which I previously predicted ought to show up quite often when looking at information relevant far away in a system. On top of that, humans often have a (mistaken) intuition that information works like sets - information comes in discrete chunks, and a variable either contains a chunk or it doesn’t. This leads to some common mistakes when studying information theory. For systems in which variation is independent subject to deterministic constraints, the “information is like sets” intuition is correct, and I suspect that it’s the only case where that picture works in general (for some formulation of “in general”).

It seems like there’s something fundamental about this pattern, and this correspondence theorem is another hint.

Appendix: Proofs

Even having guessed the answer, I found these proofs surprisingly difficult. There’s probably some way to set things up so it’s more obvious what to do, but I’m not sure what it is, other than that someone will tell me it has something to do with category theory.

The problem: we want $U^{*}$ for which

$P [U^{*}, X_{1}, X_{2}] = P [U^{*}] P [X_{1} | U^{*}] P [X_{2} | U^{*}]$

and for all $U$ such that $P [U, X_{1}, X_{2}] = P [U] P [X_{1} | U] P [X_{2} | U]$ , we have

$P [U, U^{*}, X] = P [U] P [X | U] P [U^{*} | U]$ .

We’ll show that:

This is only possible when $P [X_{1}, X_{2}] = I [f_{1} (X_{1}) = f_{2} (X_{2})] P [f_{1} (X_{1})] P [X_{1} | f_{1} (X_{1})] P [X_{2} | f_{2} (X_{2})]$ for some $f_{1}, f_{2}$
When $P [X_{1}, X_{2}]$ has that form, the choice $U^{*} = f_{1} (X_{1}) = f_{2} (X_{2})$ works.

Correspondence Only If Independent Variation Subject to Deterministic Constraint

We'll start off the same as earlier: we want $U^{*} ⫫ X | U$ (in English: $U^{*}$ independent of $X$ given $U$ ) for any $U$ satisfying $X_{1} ⫫ X_{2} | U$ . Well, there’s two choices for $U$ which definitely satisfy $X_{1} ⫫ X_{2} | U$ : namely, $X_{1}$ and $X_{2}$ themselves. So:

$X_{1} ⫫ U^{*} | X_{2}$
$X_{2} ⫫ U^{*} | X_{1}$

… and by assumption, $X_{1} ⫫ X_{2} | U^{*}$ . So, any two of the variables $X_{1}$ , $X_{2}$ , and $U^{*}$ are independent given the third.

Let’s write out $P [U, X_{1}, X_{2}]$ in two different ways, using two of our conditional independencies:

$P [U^{*}, X_{1}, X_{2}] = P [X_{1}] P [X_{2} | X_{1}] P [U^{*} | X_{1}] = P [X_{2}] P [X_{1} | X_{2}] P [U^{*} | X_{2}]$

Note that $P [X_{1}] P [X_{2} | X_{1}] = P [X_{2}] P [X_{1} | X_{2}] = P [X_{1}, X_{2}]$ , so we can cancel those terms out and find

$P [U^{*} | X_{1}] = P [U^{*} | X_{2}]$

… if $P [X_{1}, X_{2}] > 0$ . Note that, since our three-way independence condition is symmetric in the variables, we can also switch around the variables to get:

$P [X_{1} | U^{*}] = P [X_{1} | X_{2}]$ when $P [U^{*}, X_{2}] > 0$
$P [X_{2} | U^{*}] = P [X_{2} | X_{1}]$ when $P [U^{*}, X_{1}] > 0$

These are quite strong. Pick any two $X_{2}$ values, $x_{2}$ and $x_{2}^{'}$ , for which any $x_{1}$ value has both $P [X_{1} = x_{1}, X_{2} = x_{2}] > 0$ and $P [X_{1} = x_{1}, X_{2} = x_{2}^{'}] > 0$ . (Terminology: we’ll say that these two $X_{2}$ values “overlap” on $x_{1}$ , i.e. either value occurs with nonzero probability when $X_{1} = x_{1}$ . We'll also say that values of two different variables overlap when they have nonzero joint probability, e.g. $X_{1} = x_{1}$ overlaps $X_{2} = x_{2}$ iff $P [X_{1} = x_{1}, X_{2} = x_{2}] > 0$ ). Then

$P [U^{*} | X_{2} = x_{2}] = P [U^{*} | X_{1} = x_{1}] = P [U^{*} | X_{2} = x_{2}^{'}]$

Furthermore, since $P [U^{*} | X_{2} = x_{2}] = P [U^{*} | X_{2} = x_{2}^{'}]$ for all $U^{*}$ , any $U^{*}$ with $P [U^{*}, X_{2} = x_{2}] > 0$ will also have $P [U^{*}, X_{2} = x_{2}^{'}] > 0$ . In other words, if two $X_{2}$ values overlap on any $X_{1}$ value, then they also overlap on all $U^{*}$ values which overlap either $X_{2}$ value. That, in turn, means

$P [X_{1} | X_{2} = x_{2}] = P [X_{1} | U^{*} = u^{*}] = P [X_{1} | X_{2} = x_{2}^{'}]$

… for any value $u^{*}$ on which $x_{2}, x_{2}^{'}$ overlap. Finally, this also means that two $X_{2}$ values which overlap on any $X_{1}$ value overlap on all $X_{1}$ values which overlap either $X_{2}$ value: ( $P [X_{1} = x_{1}, X_{2} = x_{2}] > 0$ and $P [X_{1} = x_{1}, X_{2} = x_{2}^{'}] > 0$ ) for any $x_{1}$ implies that $P [X_{1} = x_{1}, X_{2} = x_{2}^{'}] > 0$ on all $x_{1}$ with $P [X_{1} = x_{1}, X_{2} = x_{2}] > 0$ , and vice versa.

That last condition is especially useful, since it means that overlap is transitive: if $x_{2}$ and $x_{2}^{'}$ overlap, and $x_{2}^{'}$ and $x_{2}^{''}$ overlap, then $x_{2}$ and $x_{2}^{''}$ overlap, and all three overlap on the same $X_{1}$ values.

That finally lets us construct our functions $f_{1}$ and $f_{2}$ . Since we have transitivity of overlap, we can use overlap as an equivalence relation, and let $f_{2} (X_{2})$ choose the equivalence class into which $X_{2}$ falls. $f_{1} (X_{1})$ chooses the equivalence class of $X_{2}$ values with which $X_{1}$ overlaps (note that there can only be one, since this class contains all the $X_{2}$ values which overlap with this $X_{1}$ value). This is the “choice of block” from our earlier visual.

The rest follows (relatively) easily. From the definition of overlap, $P [X_{1}, X_{2}] = 0$ whenever $f_{1} (X_{1}) \neq f_{2} (X_{2})$ . Our independence conditions from earlier give $P [X_{1} | X_{2}] = P [X_{1} | f_{2} (X_{2})]$ (since all values in an overlap equivalence class give the same conditional distributions), so we have conditional independence given $f_{2} (X_{2})$ :

$P [X_{1}, X_{2} | f_{2} (X_{2})] = P [X_{1} | f_{2} (X_{2})] P [X_{2} | f_{2} (X_{2})]$

Then, we just expand:

$P [X_{1}, X_{2}] = P [f_{2} (X_{2})] P [X_{1}, X_{2} | f_{2} (X_{2})] = P [f_{2} (X_{2})] P [X_{1} | f_{2} (X_{2})] P [X_{2} | f_{2} (X_{2})]$

Since $f_{1} (X_{1}) = f_{2} (X_{2})$ , we have

$P [X_{1} | f_{2} (X_{2})] = I [f_{1} (X_{1}) = f_{2} (X_{2})] P [X_{1} | f_{1} (X_{1})]$

… which gives us our final expression:

$P [X_{1}, X_{2}] = I [f_{1} (X_{1}) = f_{2} (X_{2})] P [f_{2} (X_{2})] P [X_{1} | f_{1} (X_{1})] P [X_{2} | f_{2} (X_{2})]$

(And note that we can freely switch $P [f_{2} (X_{2})]$ with $P [f_{1} (X_{1})]$ here.)

One side note: I’ve defined $f_{1}$ and $f_{2}$ here as selecting equivalence classes, which is kind of annoying - it doesn’t give us a clean explicit representation. If we want an explicit representation, we can choose $f_{1} (X_{1}) = (x_{2} \to P [X_{2} = x_{2} | X_{1}])$ , and likewise for $f_{2}$ . This leads to “fun” expressions like $P [X_{1} | (x_{2} \to P [X_{2} = x_{2} | X_{1}])]$ . See the Minimal Map post for how to interpret expressions like that.

Correspondence If Independent Variation Subject to Deterministic Constraint

Now we assume that $P [X_{1}, X_{2}]$ has the right form, declare that $X_{1}$ and $X_{2}$ are independent given some $U$ , and show that $f_{1} (X_{1})$ (or equivalently, $f_{2} (X_{2})$ ) is a function of $U$ , implying that it’s independent of $X_{1}$ and $X_{2}$ given $U$ .

Suppose that $f_{1} (X_{1})$ is not a function of $U$ - i.e. there is some value $u$ of $U$ for which $f_{1} (X_{1})$ could have two different values $f, f^{'}$ each with nonzero probability. Then, since $X_{1}$ and $X_{2}$ are independent given $U$ :

$P [U = u, f_{1} (X_{1}) = f, f_{2} (X_{2}) = f^{'}] =$ $P [U = u] P [f_{1} (X_{1}) = f | U = u] P [f_{2} (X_{2}) = f^{'} | U = u]$

Since $f_{1} (X_{1}) = f_{2} (X_{2})$ with probability 1, we must have $P [f_{2} (X_{2}) = f^{'} | U = u] = P [f_{1} (X_{1}) = f^{'} | U = u]$ . Substituting:

$P [U = u, f_{1} (X_{1}) = f, f_{2} (X_{2}) = f^{'}] = P [U = u] P [f_{1} (X_{1}) = f | U = u] P [f_{1} (X_{1}) = f^{'} | U = u]$

Now, by assumption:

$f$ and $f^{'}$ are different
$P [f_{1} (X_{1}) = f | U = u]$ and $P [f_{1} (X_{1}) = f^{'} | U = u]$ are both nonzero

… thus $P [U = u, f_{1} (X_{1}) = f, f_{2} (X_{2}) = f^{'}] > 0$ .

… but that’s a contradiction, because that means there’s a nonzero probability that $f_{1} (X_{1}) \neq f_{2} (X_{2})$ .

Thus, $f_{1} (X_{1})$ is always a function of $U$ , and is therefore independent of $X_{1}$ and $X_{2}$ (and anything else) given $U$ .

[-]johnswentworth2yΩ470

Note to self: use infinitely many observable variables instead of just two, and the condition for $U^{*}$ should probably be that no infinite subset of the $X$ 's are mutually dependent (or something along those lines). Intuitively: for any "piece of latent information", either we have infinite data on that piece and can precisely estimate it, or it only significantly impacts finitely many variables.

[-]drocta4y30

There are a few places where I believe you mean to write a but instead have $f (X_{2})$ instead. For example, in the line above the "Applicability" heading.

I like this.

[-]johnswentworth4y20

Ah, thanks. I think I got them all now.

LESSWRONG
LW