Around two months ago, John and I published Resampling Conserves Redundancy (Approximately). Fortunately, about two weeks ago, Jeremy Gillen and Alfred Harwood showed us that we were wrong.
This proof achieves, using the Jensen-Shannon divergence ("JS"), what the previous one failed to show using KL divergence (""). In fact, while the previous attempt tried to show only that redundancy is conserved (in terms of ) upon resampling latents, this proof shows that the redundancy and mediation conditions are conserved (in terms of JS).
In just about all of our previous work, we have used as our factorization error. (The error meant to capture the extent to which a given distribution fails to factor according to some graphical structure.) In this post I use the Jensen Shannon divergence.
The KL divergence is a pretty fundamental quantity in information theory, and is used all over the place. (JS is usually defined in terms of , as above.) We have pretty strong intuitions about what  means and it has lots of nice properties which I won't go into detail about, but we have considered it a strong default when trying to quantify the extent to which two distributions differ.
The JS divergence looks somewhat ad-hoc by comparison. It also has some nice mathematical properties (its square root is a metric, a feature sorely lacking from ) and there is some reason to like it intuitively:  is equivalent to the mutual information between X, a variable randomly sampled from one of the distributions, and Z, an indicator which determines the distribution X gets sampled from. So in this sense it captures the extent to which a sample distinguishes between the two distributions.
Ultimately, though, we want a more solid justification for our choice of error function going forward.
This proof works, but it uses JS rather than . Is that a problem? Can/Should we switch everything over to JS? We aren't sure. Some of our focus for immediate next steps is going to be on how to better determine the "right" error function for comparing distributions for the purpose of working with (natural) latents.
And now, for the proof:
Let be any distribution over and .
I will omit the subscripts if the distribution at hand is the full joint distribution with all variables unbound. I.e. is the same as P. When variables are bound, they will be written as lower case in the subscript. When this is still ambiguous, the full bracket notation will be used.
First, define auxiliary distributions , , , and :
, , ,
Q, S, and M each perfectly satisfy one of the (stochastic) Natural Latent conditions, with Q and S each satisfying one of the redundancy conditions (, and , respectively,) and M satisfying the mediation condition ().
R represents the distribution when both of the redundancy factorizations are applied in series to P.
Let be a latent variable defined by , with
Now, define the auxiliary distributions , , and , similarly as above, and show some useful relationships to P, Q, S, R, and M:
,
Next, the error metric and the errors of interest:
Jensen-Shannon Divergence, and Jensen-Shannon Distance (a true metric):
Finally, the theorem:
For any distribution over (, ), the latent has redundancy error of zero on one of it's factorizations, while the other factorization errors are bounded by small factor of the errors induced by . More formally:
, the latent  defined by  has bounded factorization errors  and .
In fact, that is a simpler but looser bound than that proven below which achieves the more bespoke bounds of: , , and .
, since and
So, as shown above, (using Jensen-Shannon Divergence as the error function,) resampling any latent variable according to either one of its redundancy diagrams (just swap  and  for the bounds when resampling from ) produces a new latent variable which satisfies the redundancy and mediation diagrams approximately as well as the original, and satisfies one of the redundancy diagrams perfectly.
The bounds are:
Where the epsilons without superscripts are the errors corresponding to factorization via the respective naturality conditions of the original latent and X.
For , by Cauchy-Schwartz with vectors Thus the simpler, though looser, bound:
The joint convexity of , which justifies this inequality, is inherited from the joint convexity of KL Divergence.