Review

Natural latents are a relatively elegant piece of math which we figured out over the past year, in our efforts to boil down and generalize various results and examples involving natural abstraction. In particular, this new framework handles approximation well, which was a major piece missing previously. This post will present the math of natural latents, with a handful of examples but otherwise minimal commentary. If you want the conceptual story, and a conceptual explanation for how this might connect to various problems, that will probably be in a future post.

While this post is not generally written to be a "concepts" post, it is generally written in the hope that people who want to use this math will see how to do so.

2-Variable Theorems

This section will present a simplified but less general version of all the main theorems of this post, in order to emphasize the key ideas and steps.

Simplified Fundamental Theorem

Suppose we have:

  • A distribution  over random variables 
  • A latent variable  which induces independence between  and  (first diagram below)
  • Another latent variable  about which  and  give the same information (second diagram below)

Further, assume that  mediates between  and  (third diagram below). This last assumption can typically be satisfied by construction in the minimality/maximality theorems below.

 induces independence between  and  
 and  give the same information about 
 mediates between  and 

Then, claim:  mediates between  and .

 

Intuition

Picture  as a pipe between  and . The only way any information can get from  to  is via that pipe (that’s the first diagram).  is a piece of information which is present in both  and  - something we can learn from either of them (that’s the second pair of diagrams). Intuitively, the only way that can happen is if the information  went through the pipe - meaning that we can also learn it from .

The third diagram rules out three-variable interactions which could mess up that intuitive picture - for instance, the case where one bit of  is an xor of some independent random bit of  and a bit of .

Qualitative Example

Let  and  be the low-level states of two spatially-separated macroscopic chunks of an ideal gas at equilibrium. By looking at either of the two chunks, I can tell whether the temperature is above 50°C; call that  More generally, the two chunks are independent given the pressure and temperature of the gas; call that .

Notice that  is a function of , i.e. I can compute whether the temperature is above 50°C from the pressure and temperature itself, so  mediates between  and .

Some extensions of this example:

  • We could add more information to . For instance,  could be the pressure and temperature and also the outcome of a die roll unrelated to the gas. Or,  could be the entire low-level state of one of the two chunks.
  • We could remove information from  For instance,  could be a bit indicating whether the temperature is above 100°C

Intuitive mental picture: in general,  can’t have “too little” information; it needs to include all information shared (even partially) across  and  on the other hand, can’t have “too much” information; it can only include information which is fully shared across  and .

Proof

This is a diagrammatic proof; see Some Rules For An Algebra Of Bayes Nets for how to read it. (That post also walks through a version of this same proof as an example, so you should also look there if you want a more detailed walkthrough.)

(Throughout this post,  denotes all components of  except for .) 

Approximation

If the starting diagrams are satisfied approximately (i.e. they have small KL-divergence from the underlying distribution), then the final diagram is also approximately satisfied (i.e. has small KL-divergence from the underlying distribution). We can find quantitative bounds by propagating error through the proof:

Again, see the Algebra of Bayes Nets post for an unpacking of this notation and the proofs for each individual step.

Qualitative Example

Note that our previous example realistically involved approximation: two chunks of an ideal gas won’t be exactly independent given pressure and temperature. But they’ll be independent to a very (very) tight approximation, so our conclusion will also hold to a very tight approximation.

Quantitative Example

Suppose I have a biased coin, with bias . Alice flips the coin 1000 times, then takes the median of her flips. Bob also flips the coin 1000 times, then takes the median of his flips. We’ll need a prior on , so for simplicity let’s say it’s uniform on [0, 1].

Intuitively, so long as bias  is unlikely to be very close to ½, Alice and Bob will find the same median with very high probability. So:

  • Let  be Alice’ 1000 flips, and  be Bob’s 1000 flips.
  • Let  be the bias . Note that the flips are independent given , satisfying our first condition exactly.
  • Let  be the median computed by either Bob or Alice (we’re assuming they are the same with high probability). Since the same median can be computed with high probability from either  or , our second condition is approximately satisfied.
  • Since the median is computed as a deterministic function of , our third condition is satisfied exactly.

The fundamental theorem will then say that the bias approximately mediates between the median (either Alice’ or Bob’s) and the coinflips .

To quantify the approximation on the fundamental theorem, we first need to quantify the approximation on the second condition (the other two conditions hold exactly in this example, so their 's are 0). Let’s take  to be Alice’ median. Alice’ flips mediate between Bob’s flips and the median exactly (i.e. ), but Bob’s flips mediate between Alice’ flips and the median (i.e. ) only approximately. Let’s compute that :

This is a dirichlet-multinomial distribution, so it will be cleaner if we rewrite in terms of , and  is a function of , so the  is

Assuming I simplified the gamma functions correctly, we then get: