Around two months ago, John and I published Resampling Conserves Redundancy (Approximately). Fortunately, about two weeks ago, Jeremy Gillen and Alfred Harwood showed us that we were wrong.

This proof achieves, using the Jensen-Shannon divergence ("JS"), what the previous one failed to show using KL divergence (""). In fact, while the previous attempt tried to show only that redundancy is conserved (in terms of $D_{K L}$ ) upon resampling latents, this proof shows that the redundancy and mediation conditions are conserved (in terms of JS).

Why Jensen-Shannon?

In just about all of our previous work, we have used $D_{K L}$ as our factorization error. (The error meant to capture the extent to which a given distribution fails to factor according to some graphical structure.) In this post I use the Jensen Shannon divergence.

$D_{K L} (U | | V) := E_{U} l n \frac{U}{V}$

$J S (U | | V) := \frac{1}{2} D_{K L} (U | | \frac{U + V}{2}) + \frac{1}{2} D_{K L} (V | | \frac{U + V}{2})$

The KL divergence is a pretty fundamental quantity in information theory, and is used all over the place. (JS is usually defined in terms of $D_{K L}$ , as above.) We have pretty strong intuitions about what $D_{K L}$ means and it has lots of nice properties which I won't go into detail about, but we have considered it a strong default when trying to quantify the extent to which two distributions differ.

The JS divergence looks somewhat ad-hoc by comparison. It also has some nice mathematical properties (its square root is a metric, a feature sorely lacking from $D_{K L}$ ) and there is some reason to like it intuitively: $J S (U | | V)$ is equivalent to the mutual information between X, a variable randomly sampled from one of the distributions, and Z, an indicator which determines the distribution X gets sampled from. So in this sense it captures the extent to which a sample distinguishes between the two distributions.

Ultimately, though, we want a more solid justification for our choice of error function going forward.

This proof works, but it uses JS rather than $D_{K L}$ . Is that a problem? Can/Should we switch everything over to JS? We aren't sure. Some of our focus for immediate next steps is going to be on how to better determine the "right" error function for comparing distributions for the purpose of working with (natural) latents.

And now, for the proof:

Definitions

Let $P$ be any distribution over $X$ and $Λ$ .

I will omit the subscripts if the distribution at hand is the full joint distribution with all variables unbound. I.e. $P_{X, Λ}$ is the same as P. When variables are bound, they will be written as lower case in the subscript. When this is still ambiguous, the full bracket notation will be used.

First, define auxiliary distributions $Q$ , $S$ , $R$ , and $M$ :

$Q := P_{X} P_{Λ | X_{1}}$ , $S := P_{X} P_{Λ | X_{2}}$ , $R := P_{X} Q_{Λ | X_{2}} = P_{X} \sum_{X_{1}} [P_{X_{1} | X_{2}} P_{Λ | X_{1}}]$ , $M := P_{Λ} P_{X_{1} | Λ} P_{X_{2} | Λ}$

Q, S, and M each perfectly satisfy one of the (stochastic) Natural Latent conditions, with Q and S each satisfying one of the redundancy conditions ( $X_{2} \to X_{1} \to Λ$ , and $X_{1} \to X_{2} \to Λ$ , respectively,) and M satisfying the mediation condition ( $X_{1} \leftarrow Λ \to X_{2}$ ).

R represents the distribution when both of the redundancy factorizations are applied in series to P.

Let $Γ$ be a latent variable defined by $P [Γ = γ | X] := P [Λ = γ | X_{1}] = P [Γ = γ | X_{1}]$ , with $P^{Γ} := P_{X, Λ} P_{Γ | X}$

Now, define the auxiliary distributions $Q^{Γ}$ , $S^{Γ}$ , and $M^{Γ}$ , similarly as above, and show some useful relationships to P, Q, S, R, and M:

$Q_{X, γ}^{Γ} := P_{X} P_{γ | X_{1}}^{Γ} = P_{X} Q [Λ = γ | X_{1}] = Q [X, Λ = γ]$ $S_{X, γ}^{Γ} := P_{X} P_{γ | X_{2}}^{Γ} = P_{X} \sum_{X_{1}} (P_{X_{1} | X_{2}} P_{γ | X_{1}}) = R [X, Λ = γ]$ , $M_{X, γ}^{Γ} := P_{γ}^{Γ} P_{X_{1} | γ}^{Γ} P_{X_{2} | Γ}^{Γ} = P [Λ = γ] P [X_{1} | Λ = γ] R [X_{2} | Λ = γ]$

$P_{X, γ}^{Γ} = P_{X} P_{γ | X} = Q [X, Λ = γ]$ $P_{γ}^{Γ} = Q [Λ = γ] = P [Λ = γ] = P^{Γ} [Λ = γ]$ $P_{X_{1} | γ}^{Γ} = P [X_{1} | Λ = γ] = Q [X_{1} | Λ = γ]$ $P_{X_{2} | γ}^{Γ} = \frac{R [X_{2}, Λ = γ]}{P_{γ}^{Γ}} = R [X_{2} | Λ = γ]$

Next, the error metric and the errors of interest:

Jensen-Shannon Divergence, and Jensen-Shannon Distance (a true metric):

$J S (U | | V) := \frac{1}{2} D_{K L} (U | | \frac{U + V}{2}) + \frac{1}{2} D_{K L} (V | | \frac{U + V}{2})$

$δ (U, V) := \sqrt{J S (U | | V)} = δ (V, U)$

$ϵ_{1} := J S (P | | Q), ϵ_{2} := J S (P | | S), ϵ_{m e d} := J S (P | | M)$

$ϵ_{1}^{Γ} := J S (P^{Γ} | | Q^{Γ}), ϵ_{2}^{Γ} := J S (P^{Γ} | | S^{Γ}) = J S (Q | | R), ϵ_{m e d}^{Γ} := J S (P^{Γ} | | M^{Γ}) = J S (Q | | M^{Γ})$

Theorem

Finally, the theorem:

For any distribution $P$ over ( $X$ , $Λ$ ), the latent $Γ \sim P [Λ | X_{i}]$ has redundancy error of zero on one of it's factorizations, while the other factorization errors are bounded by small factor of the errors induced by $Λ$ . More formally:

$\forall P [X, Λ]$ , the latent $Γ$ defined by $P [Γ = γ | X] := P [Λ | X_{1}]$ has bounded factorization errors $ϵ_{1}^{Γ} = 0$ and $m a x (ϵ_{2}^{Γ}, ϵ_{m e d}^{Γ}) \leq 5 (ϵ_{1} + ϵ_{2} + ϵ_{m e d})$ .

In fact, that is a simpler but looser bound than that proven below which achieves the more bespoke bounds of: $ϵ_{1}^{Γ} = 0$ , $ϵ_{2}^{Γ} \leq (2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{2}})^{2}$ , and $ϵ_{m e d}^{Γ} \leq (2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{m e d}})^{2}$ .

Proof

(1) $ϵ_{1}^{Γ} = 0$

Proof of (1)

$J S (P^{Γ} | | Q^{Γ}) = 0$ , since $P_{X, γ}^{Γ} = Q [X, Λ = γ] = Q_{X, γ}^{Γ}$ and $P_{Λ | X}^{Γ} = P_{Λ | X}$

$■$

(2) $ϵ_{2}^{Γ} \leq (2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{2}})^{2}$

Lemma 1: $J S (S | | R) \leq ϵ_{1}$

$S [Λ | X_{2}] = P [Λ | X_{2}] = \sum_{X_{1}} P [X_{1} | X_{2}] P [Λ | X]$

$R [Λ | X_{2}] = Q [Λ | X_{2}] = \sum_{X_{1}} P [X_{1} | X_{2}] P [Λ | X_{1}]$

$\begin{matrix} J S (S | | R) & = \sum X_{2} J S (S_{Λ | X_{2}} | | R_{Λ | X_{2}}) \leq \sum X P [X_{2}] P [X_{1} | X_{2}] J S (P_{Λ | X} | | P [Λ | X_{1}]) = J S (P | | Q) =: ϵ_{1} \end{matrix}$ ^[1]

Lemma 2: $δ (Q, R) \leq \sqrt{ϵ_{1}} + \sqrt{ϵ_{2}}$

$Let d_{x} := δ (P_{Λ | x_{1}}, P_{Λ | x_{2}}), a_{x} := δ (P_{Λ | x}, P_{Λ | x_{1}}), and b_{x} := δ (P_{Λ | x}, P_{Λ | x_{2}})$

$\begin{matrix} δ (Q, S) = \sqrt{J S (Q, S)} & = \sqrt{E_{P_{X}} J S (P_{Λ | X_{1}} | | P_{Λ | X_{2}})} = \sqrt{E_{P_{X}} (d_{X})^{2}} \leq \sqrt{E_{P_{X}} (a_{X} + b_{X})^{2}} by the triangle inequality of metric δ \leq \sqrt{E_{P_{X}} (a_{X})^{2}} + \sqrt{E_{P_{X}} (b_{X})^{2}} via the Minkowski Ineqality = \sqrt{J S (P | | Q)} + \sqrt{J S (P | | S)} = \sqrt{ϵ_{1}} + \sqrt{ϵ_{2}} \end{matrix}$

Proof of (2)

$\sqrt{ϵ_{2}^{Γ}} = \sqrt{J S (P^{Γ} | | S^{Γ})} = \sqrt{J S (Q | | R)} =: δ (Q, R)$

$\begin{matrix} δ (Q, R) & \leq δ (Q, S) + δ (S, R) by the triangle inequality of metric δ \leq δ (Q, R) + \sqrt{ϵ_{1}} by Lemma 1 \leq 2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{2}} by Lemma 2 \end{matrix}$

$■$

(3) $ϵ_{m e d}^{Γ} \leq (2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{m e d}})^{2}$

Proof of (3)

$\begin{matrix} J S (M | | M^{Γ}) & = \sum γ P [Λ = γ] J S (P [X_{2} | Λ = γ] | | R [X_{2} | Λ = γ]) = E_{P_{Λ}} J S (S_{X_{2} | Λ} | | R_{X_{2} | Λ}) \leq J S (S | | R) by the Data Processing Inequality \end{matrix}$

$\begin{matrix} \sqrt{ϵ_{m e d}^{Γ}} & = δ (P^{Γ}, M^{Γ}) = δ (Q, M^{Γ}) \leq δ (Q, P) + δ (P, M) + δ (M, M^{Γ}) by the triangle inequality of metric δ = \sqrt{ϵ_{1}} + \sqrt{ϵ_{m e d}} + \sqrt{J S (M, M^{Γ})} \leq \sqrt{ϵ_{1}} + \sqrt{ϵ_{m e d}} + \sqrt{J S (M, M^{Γ})} \leq 2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{m e d}} by Lemma 1 \end{matrix}$

$■$

Results

So, as shown above, (using Jensen-Shannon Divergence as the error function,) resampling any latent variable according to either one of its redundancy diagrams (just swap $ϵ_{1}$ and $ϵ_{2}$ for the bounds when resampling from $X_{2}$ ) produces a new latent variable which satisfies the redundancy and mediation diagrams approximately as well as the original, and satisfies one of the redundancy diagrams perfectly.

The bounds are:
$\begin{matrix} ϵ_{1}^{Γ} = 0 ϵ_{2}^{Γ} \leq (2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{2}})^{2} ϵ_{m e d}^{Γ} \leq (2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{m e d}})^{2} \end{matrix}$

Where the epsilons without superscripts are the errors corresponding to factorization via the respective naturality conditions of the original latent $Λ$ and X.

Bonus

For $a, b > 0$ , $(2 \sqrt{a} + \sqrt{b})^{2} \leq 5 (a + b)$ by Cauchy-Schwartz with vectors $[2, 1], [\sqrt{a}, \sqrt{b}]$ Thus the simpler, though looser, bound: $m a x {ϵ_{1}^{Γ}, ϵ_{2}^{Γ}, ϵ_{m e d}^{Γ}} \leq 5 (ϵ_{1} + ϵ_{2} + ϵ_{m e d})$

^{^}
The joint convexity of $J S (U | | V)$ , which justifies this inequality, is inherited from the joint convexity of KL Divergence.

LESSWRONG
LW

LESSWRONG
LW

35

Resampling Conserves Redundancy & Mediation (Approximately) Under the Jensen-Shannon Divergence

35

35

Why Jensen-Shannon?

Definitions

Theorem

Proof

(1) $ϵ_{1}^{Γ} = 0$

Proof of (1)

(2) $ϵ_{2}^{Γ} \leq (2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{2}})^{2}$

Lemma 1: $J S (S | | R) \leq ϵ_{1}$

Lemma 2: $δ (Q, R) \leq \sqrt{ϵ_{1}} + \sqrt{ϵ_{2}}$

Proof of (2)

(3) $ϵ_{m e d}^{Γ} \leq (2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{m e d}})^{2}$

Proof of (3)

Results

Bonus

35

Resampling Conserves Redundancy & Mediation (Approximately) Under the Jensen-Shannon Divergence

35

35

Why Jensen-Shannon?

Definitions

Theorem

Proof

(1) ϵΓ1=0

Proof of (1)

(2) ϵΓ2≤(2√ϵ1+√ϵ2)2

Lemma 1: JS(S||R)≤ϵ1

Lemma 2: δ(Q,R)≤√ϵ1+√ϵ2

Proof of (2)

(3) ϵΓmed≤(2√ϵ1+√ϵmed)2

Proof of (3)

Results

Bonus

(1) $ϵ_{1}^{Γ} = 0$

(2) $ϵ_{2}^{Γ} \leq (2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{2}})^{2}$

Lemma 1: $J S (S | | R) \leq ϵ_{1}$

Lemma 2: $δ (Q, R) \leq \sqrt{ϵ_{1}} + \sqrt{ϵ_{2}}$

(3) $ϵ_{m e d}^{Γ} \leq (2 \sqrt{ϵ_{1}} + \sqrt{ϵ_{m e d}})^{2}$