I’ve been thinking lately about formalizations of the Correspondence Principle - the idea that new theories should reproduce old theories, at least in the places where the old theories work. Special relativity reduces to Galilean relativity at low speed/energy, general relativity reduces to Newtonian gravity when the fields are weak, quantum mechanics should reproduce classical mechanics at large scale, etc. More conceptually, it’s the idea that flowers are “real”: any model which does a sufficiently-good job of predicting the world around me should have some kind of structure in it corresponding to my notion of a flower (though it may not be ontologically basic).
I want theorems telling me when my models are of the right kind, and sufficiently-well-nailed-down, that I can expect some kind of correspondence along these lines to apply for any future theory (assuming the future theory has sufficiently strong predictive power). Ideally, this would allow us to take a model and extract the “generalizable” information from it, construct some universal representation of certain components of the model which lets us find the “corresponding” components in any other model with sufficiently-good predictive performance on the same data. (This problem is one example of a potential application.)
This post introduces one such theorem. We’ll show that a (particular type of) correspondence theorem applies in exactly those cases where our model is able to make deterministic predictions about the relationship between some observable variables.
I've tried to keep the hairiest of the math in the appendices and the "key ideas from the math" section. If you want the key ideas with minimal math, then skip those sections.
Here’s the setup. We have two random variables, X1 and X2, and we have enough data that we can perfectly figure out their joint distribution P[X1,X2]. We don’t necessarily know what physical process generates these variables - i.e. we don’t know the true underlying generative model. You can imagine some scientific experiment where we can collect lots of independent samples of X1,X2, then brute-force count how often each (X1,X2) pair occurs in order to estimate P[X1,X2], but we don’t have any way to experimentally probe the “internals” of the experimental setup. In other words, it's the traditional setup from frequentist probability.
We’ll assume that the class of “theories” we’re interested in are generative models - i.e. programs which take in some independent uniform variables and spit out X1,X2. For our purposes, we’ll assume these programs are represented as systems of circuits/causal DAGs/Bayes nets - see here for what that sort of representation looks like. (Note that this representation is Turing complete, so we’re not imposing any significant restriction.)
As long as X1 and X2 are not independent, any generative model which reproduces the distribution P[X1,X2] must have some variable(s) upstream of both X1 and X2 - some common upstream cause(s) which account for the relationship between the variables. The claim I’d like to make is: any upstream variable(s) capable of accounting for the relationship between X1 and X2 must contain some “minimal” information. Formally: we can construct some variable U∗ (defined by the distributions P[U∗],P[X1|U∗],P[X2|U∗]) such that X1 and X2 are independent given U∗:
… and, for any other U which induces conditional independence of X1 and X2, U∗ is a (possibly stochastic) function of U: for all U such that
... we have
Conceptually: U∗ is the “minimal” information about the relationship between X1 and X2. It must be “included in” any variable U capable of accounting for the relationship between those two variables.
That’s the claim I’d like to make. However, that claim is too strong - not all distributions P[X1,X2] have such a U∗. Indeed, in some sense “most” do not, though we can work around that to a large extent. The next section will explain exactly when we can construct such a U∗, and the appendix gives the proof. Specifically, we can construct U∗ in exactly those cases where the relationship between X1 and X2 is independent variation subject to a deterministic constraint. Formally: we can construct U∗ exactly when P[X1,X2] can be represented as
… for some deterministic functions f1,f2 (here I[.] is the indicator function). In that case, we can choose U∗=f1(X1)=f2(X2).
We have to be able to calculate our “universal” upstream U∗ from any U which induces independence of X1,X2. In particular, X1 and X2 are independent given either X1 or X2 itself - so we can choose either of those to be U. Combine that with the original requirements, and we have:
This three-way conditional independence requirement is the main constraint which determines which distributions P[X1,X2] have a U∗.
One easy way to satisfy this three-way conditional independence is when X1, X2, and U∗ are all just completely independent. Another is when all three are deterministically equal - i.e. U∗=0 implies X1=X2=0, U∗=1 implies X1=X2=1, etc. And of course we can rename variable values while maintaining three-way conditional independence - e.g. we could rename X1 values from [0,1,…] to [“fish”,12.38,…]. So all three variables isomorphic - rather than equal - also works: U∗=f1(X1)=f2(X2), with f1 and f2 invertible.
We could also combine these two possibilities: U∗, X1, and X2 could each have two components, where one component is independent and the other component is fully determined by either of the other variables. For instance, we could have U∗=(U∗1,U∗2), X1=(X11,X21), X2=(X12,X22), with U∗1, X11, and X12 all completely independent, but U∗2, X21, and X22 all deterministically isomorphic. Again, we can also rename variable values, so our variables don’t need to literally be written with two components.
… and that turns out to be the most general possibility.
Visually: if we lay out the distribution P[X1,X2] in a matrix (with rows corresponding to X1 values and columns corresponding to X2 values), then we can arrange that matrix to look like this:
Here the dark blocks are nonzero values; everything else is zero. Within the dark blocks, we have P[X1,X2]=P[X1]P[X2] - i.e. the dark blocks are each rank-1 submatrices. Intuitively, our distribution P[X1,X2] consists of a deterministic relationship (the choice of which block we’re in) plus independent variation (choice of values within the block). This generalizes in the obvious way to the full distribution P[U∗,X1,X2]: the full distribution consists of non-overlapping 3D blocks of nonzeros, with the three variables independent within each block.
For the full proof that these are the only distributions for which we can construct U∗, see the appendix.
Note that we do still have a degree of freedom in choosing U∗ - we can add or remove independent components. The obvious choice is to remove all the independent components from U∗, and just keep the deterministic component. Visually, we take U∗ to “choose a block” in the graphic above. (In fact, we could have just specified upfront that U∗ is a deterministic function of any possible U, rather than the more general stochastic function, but I usually prefer to use the more general version just in case the problem has some implicit embedded game with mixed equilibria - the stochastic function allows U∗ to “randomize its strategy”. In this case, it turns out that the generality doesn’t buy us anything, and that is itself useful to know; it means there’s probably no diagonalization shenanigans going on.)
The appendix proves that this choice of U∗ indeed satisfies all of our original conditions.
To wrap up this section, let’s write out the shape of P[X1,X2] mathematically. Each value of X1 has nonzero probability in only one “block”, so we can create a function f1(X1) which picks out the block for each X1 value. Likewise for f2(X2). P[X1,X2] can be nonzero only if X1 and X2 are in the same block: f1(X1)=f2(X2). The chance of landing in a particular block is P[f1(X1)] (which is equal to P[f2(X2)], since f1(X1) and f2(X2) must always be equal). Within the block (i.e. conditional on f1(X1)), X1 and X2 are independent. Put all that together, and we get
Since f1(X1) picks out the block, and U∗ is the block choice, U∗=f1(X1)=f2(X2).
At first glance, it looks like the conditions required to use this theorem are pretty restrictive. But there’s an easy trick to generalize it somewhat.
Suppose we have a distribution P[X1,X2] which does not fully satisfy the requirements, but it does have a deterministic constraint: g1(X1)=g2(X2) with probability 1. Then we can’t apply our correspondence theorem to X1 and X2, but we can apply it to g1(X1) and g2(X2). We can construct U∗ satisfying this:
In particular, we can choose U∗=g1(X1)=g2(X2).
So: whenever we can identify a deterministic constraint in the world, we can apply this correspondence theorem. In other words, deterministic constraints in one model should have corresponding structures in other models, to the extent that all models match the environment.
Note that “deterministic constraints” is exactly the content of most models in the hard sciences. For instance, equations in physics are almost always deterministic constraints on the trajectory of world-states. To the extent that these equations match the real world, we should see corresponding constraints in future theories/models which also match the real world.
We can push this further.
We haven’t talked about approximations here, but let’s assume that the theorem generalizes to approximations in the obvious way - i.e. approximately deterministic constraints give approximate correspondence, with independencies replaced by approximate independencies. Once we have approximations, we can construct deterministic constraints via statistical identification.
Here’s what that means. Consider something like the ideal gas law, PV=nRT. Observing a few specific particles will not tell us the pressure or temperature of the gas. Individual particles have only partial information about the high-level variables, so we don’t have a correspondence theorem. However, if we bucket together a whole bunch of particles, then we can get a very precise estimate of temperature and pressure - in stats terminology, P and T can be “identified” from our many independent particles. We can then have deterministic relationships between the identified variables, like PV=nRT. Then we can apply the correspondence theorem.
Similarly, we could identify variables using multiple “bunches of particles” - or, more generally, multiple meaurements/multiple lines of evidence. For instance, maybe we can use a whole bunch of measurements to determine the gravitational constant. If we do this again with another set of measurements, we expect to get the same number. That’s a deterministic constraint: gravitational constant calculated from one set of measurements should equal gravitational constant calculated from another. To the extent that this constraint matches the real world, our correspondence theorem says the gravitational constant should correspond to something in any future theory.
In some ways, this theorem leaves a lot to be desired. In particular, it assumes access to the “true distribution”, which I’d really like to get rid of. I expect there are other correspondence theorems to be found, with very different formulations, some of which would be better in that regard. For instance, for maximum entropy models, there’s an easy correspondence theorem which says “given an old model and a new model, either the constraints of the old model are satisfied by the new model, or we can construct a third model which strictly outperforms the new model”. (That theorem is unimpressive in other ways, however - it doesn’t really say anything about the “internal structure” of the models.)
Aside from direct use as a correspondence theorem, “independent variation subject to deterministic constraints” is an interesting thing to see pop out. It’s a common pattern in the sciences - it describes far-apart low-level variables in any abstraction where the high-level model is deterministic. I also think it’s equivalent to zero distributed information, which I previously predicted ought to show up quite often when looking at information relevant far away in a system. On top of that, humans often have a (mistaken) intuition that information works like sets - information comes in discrete chunks, and a variable either contains a chunk or it doesn’t. This leads to some common mistakes when studying information theory. For systems in which variation is independent subject to deterministic constraints, the “information is like sets” intuition is correct, and I suspect that it’s the only case where that picture works in general (for some formulation of “in general”).
It seems like there’s something fundamental about this pattern, and this correspondence theorem is another hint.
Even having guessed the answer, I found these proofs surprisingly difficult. There’s probably some way to set things up so it’s more obvious what to do, but I’m not sure what it is, other than that someone will tell me it has something to do with category theory.
The problem: we want U∗ for which
and for all U such that P[U,X1,X2]=P[U]P[X1|U]P[X2|U], we have
We’ll show that:
We'll start off the same as earlier: we want U∗⫫X|U (in English: U∗ independent of X given U) for any U satisfying X1⫫X2|U. Well, there’s two choices for U which definitely satisfy X1⫫X2|U: namely, X1 and X2 themselves. So:
… and by assumption, X1⫫X2|U∗. So, any two of the variables X1, X2, and U∗ are independent given the third.
Let’s write out P[U,X1,X2] in two different ways, using two of our conditional independencies:
Note that P[X1]P[X2|X1]=P[X2]P[X1|X2]=P[X1,X2], so we can cancel those terms out and find
… if P[X1,X2]>0. Note that, since our three-way independence condition is symmetric in the variables, we can also switch around the variables to get:
These are quite strong. Pick any two X2 values, x2 and x′2, for which any x1 value has both P[X1=x1,X2=x2]>0 and P[X1=x1,X2=x′2]>0. (Terminology: we’ll say that these two X2 values “overlap” on x1, i.e. either value occurs with nonzero probability when X1=x1. We'll also say that values of two different variables overlap when they have nonzero joint probability, e.g. X1=x1 overlaps X2=x2 iff P[X1=x1,X2=x2]>0). Then
Furthermore, since P[U∗|X2=x2]=P[U∗|X2=x′2] for all U∗, any U∗ with P[U∗,X2=x2]>0 will also have P[U∗,X2=x′2]>0. In other words, if two X2 values overlap on any X1 value, then they also overlap on all U∗ values which overlap either X2 value. That, in turn, means
… for any value u∗ on which x2,x′2 overlap. Finally, this also means that two X2 values which overlap on any X1 value overlap on all X1 values which overlap either X2 value: (P[X1=x1,X2=x2]>0 and P[X1=x1,X2=x′2]>0) for any x1 implies that P[X1=x1,X2=x′2]>0 on all x1 with P[X1=x1,X2=x2]>0, and vice versa.
That last condition is especially useful, since it means that overlap is transitive: if x2 and x′2 overlap, and x′2 and x′′2 overlap, then x2 and x′′2 overlap, and all three overlap on the same X1 values.
That finally lets us construct our functions f1 and f2. Since we have transitivity of overlap, we can use overlap as an equivalence relation, and let f2(X2) choose the equivalence class into which X2 falls. f1(X1) chooses the equivalence class of X2 values with which X1 overlaps (note that there can only be one, since this class contains all the X2 values which overlap with this X1 value). This is the “choice of block” from our earlier visual.
The rest follows (relatively) easily. From the definition of overlap, P[X1,X2]=0 whenever f1(X1)≠f2(X2). Our independence conditions from earlier give P[X1|X2]=P[X1|f2(X2)] (since all values in an overlap equivalence class give the same conditional distributions), so we have conditional independence given f2(X2):
Then, we just expand:
Since f1(X1)=f2(X2), we have
… which gives us our final expression:
(And note that we can freely switch P[f2(X2)] with P[f1(X1)] here.)
One side note: I’ve defined f1 and f2 here as selecting equivalence classes, which is kind of annoying - it doesn’t give us a clean explicit representation. If we want an explicit representation, we can choose f1(X1)=(x2→P[X2=x2|X1]), and likewise for f2. This leads to “fun” expressions like P[X1|(x2→P[X2=x2|X1])]. See the Minimal Map post for how to interpret expressions like that.
Now we assume that P[X1,X2] has the right form, declare that X1 and X2 are independent given some U, and show that f1(X1) (or equivalently, f2(X2)) is a function of U, implying that it’s independent of X1 and X2 given U.
Suppose that f1(X1) is not a function of U - i.e. there is some value u of U for which f1(X1) could have two different values f,f′ each with nonzero probability. Then, since X1 and X2 are independent given U:
Since f1(X1)=f2(X2) with probability 1, we must have P[f2(X2)=f′|U=u]=P[f1(X1)=f′|U=u]. Substituting:
Now, by assumption:
… thus P[U=u,f1(X1)=f,f2(X2)=f′]>0.
… but that’s a contradiction, because that means there’s a nonzero probability that f1(X1)≠f2(X2).
Thus, f1(X1) is always a function of U, and is therefore independent of X1 and X2 (and anything else) given U.
Note to self: use infinitely many observable variables Xi instead of just two, and the condition for U∗ should probably be that no infinite subset of the X's are mutually dependent (or something along those lines). Intuitively: for any "piece of latent information", either we have infinite data on that piece and can precisely estimate it, or it only significantly impacts finitely many variables.
flowers are “real”
I really wish you'd said 'birds' here :-)
There are a few places where I believe you mean to write f2(X2) a but instead have f(X2) instead. For example, in the line above the "Applicability" heading.
I like this.
Ah, thanks. I think I got them all now.