Natural latents are a relatively elegant piece of math which we figured out over the past year, in our efforts to boil down and generalize various results and examples involving natural abstraction. In particular, this new framework handles approximation well, which was a major piece missing previously. This post will present the math of natural latents, with a handful of examples but otherwise minimal commentary. If you want the conceptual story, and a conceptual explanation for how this might connect to various problems, that will probably be in a future post.
While this post is not generally written to be a "concepts" post, it is generally written in the hope that people who want to use this math will see how to do so.
2-Variable Theorems
This section will present a simplified but less general version of all the main theorems of this post, in order to emphasize the key ideas and steps.
Simplified Fundamental Theorem
Suppose we have:
A distribution P[X] over random variables X=(X1,X2)
A latent variable Λ which induces independence between X1 and X2 (first diagram below)
Another latent variable Λ′ about which X1 and X2 give the same information (second diagram below)
Further, assume that X mediates between Λ and Λ′ (third diagram below). This last assumption can typically be satisfied by construction in the minimality/maximality theorems below.
Then, claim: Λ mediates between Λ′ and X.
Intuition
Picture Λ as a pipe between X1 and X2. The only way any information can get from X1 to X2 is via that pipe (that’s the first diagram). Λ′ is a piece of information which is present in both X1 and X2 - something we can learn from either of them (that’s the second pair of diagrams). Intuitively, the only way that can happen is if the information Λ′ went through the pipe - meaning that we can also learn it from Λ.
The third diagram rules out three-variable interactions which could mess up that intuitive picture - for instance, the case where one bit of Λ′ is an xor of some independent random bit of Λ and a bit of X.
Qualitative Example
Let X1 and X2 be the low-level states of two spatially-separated macroscopic chunks of an ideal gas at equilibrium. By looking at either of the two chunks, I can tell whether the temperature is above 50°C; call that Λ′. More generally, the two chunks are independent given the pressure and temperature of the gas; call that Λ.
Notice that Λ′ is a function of Λ, i.e. I can compute whether the temperature is above 50°C from the pressure and temperature itself, so Λ mediates between Λ′ and X.
Some extensions of this example:
We could add more information to Λ. For instance, Λ could be the pressure and temperature and also the outcome of a die roll unrelated to the gas. Or, Λ could be the entire low-level state of one of the two chunks.
We could remove information from Λ′. For instance, Λ′ could be a bit indicating whether the temperature is above 100°C
Intuitive mental picture: in general, Λ can’t have “too little” information; it needs to include all information shared (even partially) across X1 and X2. Λ′, on the other hand, can’t have “too much” information; it can only include information which is fully shared across X1 and X2.
Proof
This is a diagrammatic proof; see Some Rules For An Algebra Of Bayes Nets for how to read it. (That post also walks through a version of this same proof as an example, so you should also look there if you want a more detailed walkthrough.)
(Throughout this post, X¯i denotes all components of X except for Xi.)
Approximation
If the starting diagrams are satisfied approximately (i.e. they have small KL-divergence from the underlying distribution), then the final diagram is also approximately satisfied (i.e. has small KL-divergence from the underlying distribution). We can find quantitative bounds by propagating error through the proof:
Again, see the Algebra of Bayes Nets post for an unpacking of this notation and the proofs for each individual step.
Qualitative Example
Note that our previous example realistically involved approximation: two chunks of an ideal gas won’t be exactly independent given pressure and temperature. But they’ll be independent to a very (very) tight approximation, so our conclusion will also hold to a very tight approximation.
Quantitative Example
Suppose I have a biased coin, with bias θ. Alice flips the coin 1000 times, then takes the median of her flips. Bob also flips the coin 1000 times, then takes the median of his flips. We’ll need a prior on θ, so for simplicity let’s say it’s uniform on [0, 1].
Intuitively, so long as bias θ is unlikely to be very close to ½, Alice and Bob will find the same median with very high probability. So:
Let X1 be Alice’ 1000 flips, and X2 be Bob’s 1000 flips.
Let Λ be the bias θ. Note that the flips are independent given θ, satisfying our first condition exactly.
Let Λ′ be the median computed by either Bob or Alice (we’re assuming they are the same with high probability). Since the same median can be computed with high probability from either X1 or X2, our second condition is approximately satisfied.
Since the median is computed as a deterministic function of X, our third condition is satisfied exactly.
The fundamental theorem will then say that the bias approximately mediates between the median (either Alice’ or Bob’s) and the coinflips X.
To quantify the approximation on the fundamental theorem, we first need to quantify the approximation on the second condition (the other two conditions hold exactly in this example, so their ϵ's are 0). Let’s take Λ′ to be Alice’ median. Alice’ flips mediate between Bob’s flips and the median exactly (i.e. X2→X1→Λ′), but Bob’s flips mediate between Alice’ flips and the median (i.e. X1→X2→Λ′) only approximately. Let’s compute that DKL:
DKL(P[X1,X2,Λ′]||P[X1]P[X2|X1]P[Λ′|X2])
=−∑X1,X2P[X1,X2]lnP[Λ′(X1)|X2]
=E[H(Λ′(X1)|X2)]
This is a dirichlet-multinomial distribution, so it will be cleaner if we rewrite in terms of N1:=∑X1, N2:=∑X2, and n:=1000. Λ′ is a function of N1, so the DKL is
=E[H(Λ′(N1)|N2)]
Assuming I simplified the gamma functions correctly, we then get:
Natural latents are a relatively elegant piece of math which we figured out over the past year, in our efforts to boil down and generalize various results and examples involving natural abstraction. In particular, this new framework handles approximation well, which was a major piece missing previously. This post will present the math of natural latents, with a handful of examples but otherwise minimal commentary. If you want the conceptual story, and a conceptual explanation for how this might connect to various problems, that will probably be in a future post.
While this post is not generally written to be a "concepts" post, it is generally written in the hope that people who want to use this math will see how to do so.
2-Variable Theorems
This section will present a simplified but less general version of all the main theorems of this post, in order to emphasize the key ideas and steps.
Simplified Fundamental Theorem
Suppose we have:
Further, assume that X mediates between Λ and Λ′ (third diagram below). This last assumption can typically be satisfied by construction in the minimality/maximality theorems below.
Then, claim: Λ mediates between Λ′ and X.
Intuition
Picture Λ as a pipe between X1 and X2. The only way any information can get from X1 to X2 is via that pipe (that’s the first diagram). Λ′ is a piece of information which is present in both X1 and X2 - something we can learn from either of them (that’s the second pair of diagrams). Intuitively, the only way that can happen is if the information Λ′ went through the pipe - meaning that we can also learn it from Λ.
The third diagram rules out three-variable interactions which could mess up that intuitive picture - for instance, the case where one bit of Λ′ is an xor of some independent random bit of Λ and a bit of X.
Qualitative Example
Let X1 and X2 be the low-level states of two spatially-separated macroscopic chunks of an ideal gas at equilibrium. By looking at either of the two chunks, I can tell whether the temperature is above 50°C; call that Λ′. More generally, the two chunks are independent given the pressure and temperature of the gas; call that Λ.
Notice that Λ′ is a function of Λ, i.e. I can compute whether the temperature is above 50°C from the pressure and temperature itself, so Λ mediates between Λ′ and X.
Some extensions of this example:
Intuitive mental picture: in general, Λ can’t have “too little” information; it needs to include all information shared (even partially) across X1 and X2. Λ′, on the other hand, can’t have “too much” information; it can only include information which is fully shared across X1 and X2.
Proof
This is a diagrammatic proof; see Some Rules For An Algebra Of Bayes Nets for how to read it. (That post also walks through a version of this same proof as an example, so you should also look there if you want a more detailed walkthrough.)
(Throughout this post, X¯i denotes all components of X except for Xi.)
Approximation
If the starting diagrams are satisfied approximately (i.e. they have small KL-divergence from the underlying distribution), then the final diagram is also approximately satisfied (i.e. has small KL-divergence from the underlying distribution). We can find quantitative bounds by propagating error through the proof:
Again, see the Algebra of Bayes Nets post for an unpacking of this notation and the proofs for each individual step.
Qualitative Example
Note that our previous example realistically involved approximation: two chunks of an ideal gas won’t be exactly independent given pressure and temperature. But they’ll be independent to a very (very) tight approximation, so our conclusion will also hold to a very tight approximation.
Quantitative Example
Suppose I have a biased coin, with bias θ. Alice flips the coin 1000 times, then takes the median of her flips. Bob also flips the coin 1000 times, then takes the median of his flips. We’ll need a prior on θ, so for simplicity let’s say it’s uniform on [0, 1].
Intuitively, so long as bias θ is unlikely to be very close to ½, Alice and Bob will find the same median with very high probability. So:
The fundamental theorem will then say that the bias approximately mediates between the median (either Alice’ or Bob’s) and the coinflips X.
To quantify the approximation on the fundamental theorem, we first need to quantify the approximation on the second condition (the other two conditions hold exactly in this example, so their ϵ's are 0). Let’s take Λ′ to be Alice’ median. Alice’ flips mediate between Bob’s flips and the median exactly (i.e. X2→X1→Λ′), but Bob’s flips mediate between Alice’ flips and the median (i.e. X1→X2→Λ′) only approximately. Let’s compute that DKL:
DKL(P[X1,X2,Λ′]||P[X1]P[X2|X1]P[Λ′|X2])
=−∑X1,X2P[X1,X2]lnP[Λ′(X1)|X2]
=E[H(Λ′(X1)|X2)]
This is a dirichlet-multinomial distribution, so it will be cleaner if we rewrite in terms of N1:=∑X1, N2:=∑X2, and n:=1000. Λ′ is a function of N1, so the DKL is
=E[H(Λ′(N1)|N2)]
Assuming I simplified the gamma functions correctly, we then get:
P[N2]=