This post is a comment on Natural Latents: Latent Variables Stable Across Ontologies by John Wentworth and David Lorell. It assumes some familiarity with that work and does not attempt to explain it. Instead, I present an alternative proof that was developed as an exercise to aid my own understanding. While the original theorem and proof are written in the language of graphical models, mine instead uses the language of information theory. My proof has the advantage of being algebraically succinct, while theirs has the advantage of developing the machinery to work directly with causal structures. Very often, seeing multiple explanations of a fact helps us understand it, so I hope someone finds this post useful.
Specifically, we are concerned with their Theorem 1 (Mediator Determines Redund): both the older Iliad 1 version for stochastic latents, and the newer arXiv version for deterministic latents. I will translate each theorem into the language of information theory: Wentworth & Lorell's assumptions will imply mine, while their conclusions will be equivalent to mine. The equivalences follow from the d-separation criterion and the fact that independence is equivalent to zero mutual information.
In the restated new theorem, the latent variable is a mediator between subsets A and B of the data, meaning that it contains essentially all of the information in common between A and B, whereas the latent variable is a redund between A and B, meaning it essentially only contains information that is in common between A and B.[1]
Let A,B be disjoint subsets of {1,...,n}.
Suppose the random variables satisfy the following:
Mediation: ,
Redundancy: and .
Then, .
by definition of conditional mutual information,
by information theory inequalities,
by Redundancy and Mediation.
Suppose the random variables satisfy the following:
Independent Latents: ,
Mediation: for all j,
Redundancy: for all j.
Then, .
First, we have
by definition of 3-way interaction information,
by symmetry of 3-way interaction information,
by Independent Latents.
Therefore,
by mutual information chain rule,
by the above derivation,
by Mediation and Redundancy.
The result now follows by summing over all j=1,...,n.
Since probabilistic models are often only defined in terms of a latent structure, you might find it philosophically suspect to impose a joint distribution on all variables including the latents. If so, feel free to replace the random variables with their specific instantiations: the derivations go through almost identically with Kolmogorov complexity and algorithmic mutual information replacing the Shannon entropy and mutual information, respectively.