Small Steps Towards Proving Stochastic → Deterministic Natural Latent

Alfred Harwood; Jeremy Gillen

The story so far

We (Alfred and Jeremy) started a Dovetail project on Natural Latents in order to get some experience with the proofs. Originally we were going to take a crack at this bounty, but just before we got started John and David published a proof, closing the bounty. This proof involves a series of transformations that can take any stochastic latent, and transform it into a deterministic latent, where each step can only increase the error by a small multiple. We decided to work through the proof and understand each step, and attempt to improve the bound. After 20 hours or so working through it, we found a fatal flaw in one step of the proof, and spent the next several days understanding and verifying it. With David’s help, the counterexamples we found were strengthened to show that this transformation step had unbounded error on some cases. We started using sympy for arbitrary precision evaluation of counterexamples, and along the way found a bug in sympy that occasionally caused us some problems. We switched to Mathematica.

Since then, we’ve been working part time on trying to prove the bounty conjecture ourselves (or show that the conjecture is false). Ultimately we’ve been unsuccessful so far, but it still feels tantalizingly within reach. Perhaps an automated theorem prover will crack it in the next year or two. It makes a good test case.

Empirical evidence suggests the conjecture is true

We still think the overall bounty conjecture [(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)] is true. In fact, the largest difference in errors that we can find empirically is around 1.82. (i.e. optimal deterministic sum of errors < 1.82 * optimal stochastic sum of errors).

In our quest, we made lots of graphs like the above. Each point on the orange and blue curve is the score of a numerically optimized latent distribution , for a fixed $P [X, Y]$ . For simplicity, all three of $Λ, X, Y$ are binary variables. In this graph, the maximal difference between the blue and orange lines is ~1.82. On the x-axis, Delta represents the correlation between X and Y. For this graph, the marginal distributions are fixed at $P [X] = P [Y] = {0.5, 0.5}$ . Changing the marginals has some effect on the graphs, but doesn’t seem to increase the difference between Stochastic and Deterministic scores.

Larger N

One way the conjecture could fail is by larger X and Y variables allowing larger differences. This is more difficult to reliably check numerically, because we’ve found there are a lot of local optima for our numerical optimizer to get stuck on, and this problem gets quickly much worse as N increases.

We did a long running Bayesian optimization outer loop over 3x3 $P [X, Y]$ distributions, and an inner loop to find the optimal stochastic and optimal deterministic latent. The largest differences found clustered around 1.79 (apart from a few larger ones that turned out to bad local minima on the inner loop).

So we think there are several lines of evidence suggesting the conjecture is true. 1. We’ve failed to find a counterexample despite putting a lot of time into it, 2. the conjecture seems to be true using Jensen-Shannon divergences, and 3. it’s true in the 0 error cases.

We are impressed by how tantalizingly, frustratingly simple this problem is while being so difficult to prove.

The Problem

The problem that we are trying to solve can be stated as follows.

Consider two variables X and Y with a joint probability distribution $P (X, Y)$ . A latent variable $Λ$ is defined by a conditional probability distribution $P (Λ | X, Y)$ .

A variable $Λ$ is called a stochastic natural latent if the joint distribution $P (X, Y, Λ)$ satisfies the three natural latent conditions to within an ‘error’ $ϵ$ :

Redundancy 1: $I (X; Λ | Y) \leq ϵ$

Redundancy 2: $I (Y; Λ | X) \leq ϵ$

Mediation: $I (X; Y | Λ) \leq ϵ$

Note: Here we are expressing the conditions as conditional mutual information, as Aram did in this post. These are equivalent to the conditions when stated in terms of KL divergences ^[1].

We want to show that if a stochastic natural latent exists, then there also exists a latent $Λ_{D}$ (defined by a conditional distribution $P (Λ_{D} | X, Y)$ ) whose joint distribution $P (X, Y, Λ_{D})$ satisfies the ‘deterministic natural latent conditions’ to within $n ϵ$ . The deterministic natural latent conditions are:

Deterministic Redundancy 1: $H (Λ_{D} | X) \leq n ϵ$

Deterministic Redundancy 2: $H (Λ_{D} | Y) \leq n ϵ$

Mediation : $I (X; Y | Λ_{D}) \leq n ϵ$

We would like $n > 1$ to be a reasonably small number (for some definition of ‘reasonably small’).

Upper bounds on deterministic errors

One way in which the conjecture could be false is if there existed $P (X, Y)$ distributions for which all deterministic natural latents had very large errors (compared to the optimal stochastic latents). It is therefore useful to identify some simple families of deterministic latents and find upper bounds on how large their errors can be.

Constant Latents

One of the simplest kind of latents we can imagine is the constant latent. This is a latent which only takes one value, regardless of the X,Y values. Notice that in this case, we have $H (Λ_{D}) = 0$ , as well as $H (Λ_{D} | X) = 0$ and $H (Λ_{D} | Y) = 0$ so the constant latent perfectly satisfies the two deterministic redundancy conditions. Conditioning on the latent does not affect the $P (X, Y)$ distribution so the conditional mutual information $I (X; Y | Λ_{D})$ simply equals the mutual information between X and Y. This means that the only nonzero error for this latent is the mediation error which is $I (X; Y)$ .

Copy Latents

Another kind of latent we can consider is the ‘copy latent’ where $Λ_{D}$ simply deterministically ‘copies’ one of the variables (X or Y). This will always perfectly satisfy one of the deterministic redundancy conditions. For example if $Λ$ copies X then $Λ_{D} = X$ and we have:

Deterministic Redundancy 1: $H (Λ_{D} | X) = H (X | X) = 0$

The other deterministic redundancy condition will be satisfied to an error equal to the condition entropy $H (X | Y)$ :

Deterministic Redundancy 2: $H (Λ_{D} | Y) = H (X | Y)$

The mediation condition will also be satisfied perfectly:

Mediation: $I (X; Y | Λ_{D}) = H (X | Λ_{D}) - H (X | Λ_{D}, Y) = H (X | X) - H (X | X, Y) = 0$

If instead we choose a latent which copies Y so $Λ_{D} = Y$ then the mediation and Deterministic Redundancy 2 conditions will be perfectly satisfied and the Deterministic Redundancy 1 error will be $H (Y | X)$ .

Deterministic Functions of X,Y

We can also consider the family of latents where $Λ_{f} = f (X, Y)$ is a deterministic function of the X,Y pair (though not necessarily a deterministic function of X or Y in isolation). All latents in this family satisfy the deterministic redundancy conditions with errors bounded by the conditional entropies of X and Y:

Deterministic Redundancy 1: $H (Λ_{f} | Y) \leq H (X | Y)$

Deterministic Redundancy 2: $H (Λ_{f} | X) \leq H (Y | X)$

A proof of this can be found in this shortform.

The worst case upper bound on the mediation is the same as that of the constant latent, $I (X; Y)$ .

General Upper Bound

Notice that three of the latents considered above (’constant’, ‘copy X’ and ‘copy Y’) can be applied to any initial $P (X, Y)$ distribution. This means that we can always find a deterministic natural latent with maximum error $min [I (X; Y), H (X | Y), H (Y | X)]$ . In general, one of the copy latents will be best for $P (X, Y)$ distributions where X,Y are highly correlated (so that $H (X | Y)$ or $H (Y | X)$ is low) and the constant latent will be best for distributions where (X,Y) are independent or have low correlation (so that $I (X; Y)$ is low).

We have some numerical evidence that in many cases, the optimal deterministic latent has an error very close to $min [I (X; Y), H (X | Y), H (Y | X)]$ , suggesting that this bound is quite tight (for example, in the first graph in this post, the optimal deterministic latent hugs the $min [I (X; Y), H (Y | X)]$ line perfectly). We know that this bound isn't tight when the $P (X)$ or $P (Y)$ is far from 0.5 (but it isn't extremely loose).

Unfortunately, proving these upper bounds on the deterministic error is not sufficient to prove the conjecture. In order to prove the conjecture, we need to relate upper bounds on the deterministic error to the error achievable using a stochastic latent.

To complete the proof this way, an upper bound on the deterministic error needs to be combined with a lower bound on the stochastic error. Numerical evidence suggests that distributions for which $ϵ_{D} = min [I (X; Y), H (X | Y), H (Y | X)]$ is high are also distributions with higher stochastic natural latent errors so this approach seems like a reasonable thing to attempt. For example, take a look at the first graph in this post. The stochastic error is lower than the deterministic error but they both follow similar shapes. In the middle of the graph, the stochastic error and deterministic error both peak at approximately (but not exactly) the same point (around $δ = 0.19$ ) and follow similar decreasing paths as $δ$ moves away from this point.

If we could prove a lower bound on the error of the stochastic latents of the form $ϵ_{S} > \frac{1}{n} min [I (X; Y), H (X | Y), H (Y | X)]$ , then we would have proved the conjecture. This is because, for any distribution, we could find a deterministic latent with error $ϵ_{D} \leq min [I (X; Y), H (X | Y), H (Y | X)] \leq n ϵ_{S}$ . So if a latent with stochastic error $ϵ_{S}$ existed, we could always find a latent (’constant’, ‘copy X’ or ‘copy Y’) with an error only $n$ times greater.

Lower Bounding the Stochastic Latent error

We tried a couple of ways of lower bounding the stochastic latent error but were not successful. First, we tried assuming that there was a stochastic latent which achieved an error of less than $\frac{1}{n} min [I (X; Y), H (X | Y), H (Y | X)]$ and seeing if this implied a contraction. We tried to derive a contradiction from these three inequalities:

Redundancy 1: $I (X; Λ | Y) \leq \frac{1}{n} min [I (X; Y), H (X | Y), H (Y | X)]$

Redundancy 2: $I (Y; Λ | X) \leq \frac{1}{n} min [I (X; Y), H (X | Y), H (Y | X)]$

Mediation: $I (X; Y | Λ) \leq \frac{1}{n} min [I (X; Y), H (X | Y), H (Y | X)]$

i.e. we wanted to show that satisfying two of these inequalities necessarily meant violating the third (for some distribution over X and Y).

If we could do this, we would have proved a lower bound on the stochastic error. Unfortunately, we failed, but this does feel like an approach someone could take if they were more comfortable than us at manipulating information-theoretic inequalities.

Mixing Latents and Reverse Jensen inequalities

Here is another attempt we made at lower bounding the stochastic error.

Suppose we took two latents, defined by conditional distributions $R (Λ | X, Y)$ and $S (Λ | X, Y)$ and created a new latent using a probabilistic mixture of the two:

$P (Λ | X, Y) = a R (Λ | X, Y) + (1 - a) R (Λ | X, Y)$

with $0 \leq a \leq 1$ .

If we applied the mixed latent to a distribution $P (X, Y)$ , the joint $P (X, Y, Λ)$ distribution would be also be a mixture of $R (X, Y, Λ)$ and $S (X, Y, Λ)$ .

We can also consider the refactored distribution $P_{r e d} (X, Y, Λ)$ , used to calculated the redundancy error:

$\begin{matrix} P_{r e d} & = P (Y) P (X | Y) P (Λ | X) = P (X, Y) [a R (Λ | X) + (1 - a) S (Λ | X)] = a R_{r e d} + (1 - a) S_{r e d} \end{matrix}$

Where $R_{r e d}$ and $S_{r e d}$ indicate the refactored versions of $R$ and $S$ respectively.

This means that the KL divergence for the redundancy error of P is given by:

$D_{K L} (P (X, Y, Λ) | | P_{r e d} (X, Y, Λ)) = D_{K L} (a R + (1 - a) S | | a R_{r e d} + (1 - a) S_{r e d})$

(Note: writing the KL as a mixture in both of its arguments works for both redundancy conditions, but not the mediation condition. So even if we got this proof strategy to work, we would still need to tie up the loose end of the mediation condition.)

Next, since the KL divergence is convex in both its arguments, we can use this convexity to write a Jensen inequality:

$D_{K L} (P | | P_{r e d}) \leq a D_{K L} (R | | R_{r e d}) + (1 - a) D_{K L} (S | | S_{r e d})$

This is potentially interesting for two reasons. First, we can often (always?) write stochastic latents as probabilistic mixtures of deterministic latents, so if R and S are deterministic latents, then this expression links the stochastic error $D_{K L} (P | | P_{r e d})$ with the two deterministic errors $D_{K L} (R | | R_{r e d})$ and $D_{K L} (S | | S_{r e d})$ which is almost the kind of thing we are trying to do.

Unfortunately, this inequality is the wrong way round. We are looking for a lower bound on the stochastic errors in terms of the deterministic error of the form $\frac{ϵ_{D e t}}{ϵ_{S t o c h}} \leq n$ . But the inequality from the convexity of the KL divergence gives us an upper bound on stochastic errors instead.

We hoped that ‘Reverse Jensen Inequalities’ such as the ones found here might be helpful. These are ways of bounding the ‘Jensen gap’ between the KL divergences of the mixture and the mixture of the KL divergences. However, when we attempted this, we got results of the form:

$ϵ_{S t o c h} \geq k ϵ_{D e t} - c$

with $k, c > 0$ . This gives a ratio $\frac{ϵ_{S t o c h}}{ϵ_{D e t}} \leq \frac{1}{k} (1 + \frac{c}{ϵ_{S t o c h}})$ which unfortunately diverges as $ϵ_{S t o c h}$ tends to zero.

Characterizing the optimal stochastic latent

If we define $P [X, Y]$ using four parameters:

$p 11, p 12, p 21, p 22$

and $P [Λ | X, Y]$ using another four parameters:

{{{l11,1-l11},{l12,1-l12}},{{l21,1-l21},{l22,1-l22}}}

where lij is $P (Λ = 1 | X = i, Y = j)$ .

Then we can write out the full error equations in terms of these parameters:

Mediation error:

$\frac{(1 - l11) p11 log (\frac{(1 - l11) p11 ((1 - l11) p11 + (1 - l12) p12 + (1 - l21) p21 + (1 - l22) p22)}{((1 - l11) p11 + (1 - l12) p12) ((1 - l11) p11 + (1 - l21) p21)})}{log (2)} + \frac{(1 - l12) p12 log (\frac{(1 - l12) p12 ((1 - l11) p11 + (1 - l12) p12 + (1 - l21) p21 + (1 - l22) p22)}{((1 - l11) p11 + (1 - l12) p12) ((1 - l12) p12 + (1 - l22) p22)})}{log (2)} + \frac{(1 - l21) p21 log (\frac{(1 - l21) p21 ((1 - l11) p11 + (1 - l12) p12 + (1 - l21) p21 + (1 - l22) p22)}{((1 - l11) p11 + (1 - l21) p21) ((1 - l21) p21 + (1 - l22) p22)})}{log (2)} + \frac{(1 - l22) p22 log (\frac{(1 - l22) p22 ((1 - l11) p11 + (1 - l12) p12 + (1 - l21) p21 + (1 - l22) p22)}{((1 - l12) p12 + (1 - l22) p22) ((1 - l21) p21 + (1 - l22) p22)})}{log (2)} + \frac{l11 p11 log (\frac{l11 p11 (l11 p11 + l12 p12 + l21 p21 + l22 p22)}{(l11 p11 + l12 p12) (l11 p11 + l21 p21)})}{log (2)} + \frac{l12 p12 log (\frac{l12 p12 (l11 p11 + l12 p12 + l21 p21 + l22 p22)}{(l11 p11 + l12 p12) (l12 p12 + l22 p22)})}{log (2)} + \frac{l21 p21 log (\frac{l21 p21 (l11 p11 + l12 p12 + l21 p21 + l22 p22)}{(l11 p11 + l21 p21) (l21 p21 + l22 p22)})}{log (2)} + \frac{l22 p22 log (\frac{l22 p22 (l11 p11 + l12 p12 + l21 p21 + l22 p22)}{(l12 p12 + l22 p22) (l21 p21 + l22 p22)})}{log (2)}$

Redundancy1 error:

$\frac{(1 - l11) p11 log (\frac{(1 - l11) (p11 + p21)}{(1 - l11) p11 + (1 - l21) p21})}{log (2)} + \frac{(1 - l21) p21 log (\frac{(1 - l21) (p11 + p21)}{(1 - l11) p11 + (1 - l21) p21})}{log (2)} + \frac{l11 p11 log (\frac{l11 (p11 + p21)}{l11 p11 + l21 p21})}{log (2)} + \frac{l21 p21 log (\frac{l21 (p11 + p21)}{l11 p11 + l21 p21})}{log (2)} + \frac{(1 - l12) p12 log (\frac{(1 - l12) (p12 + p22)}{(1 - l12) p12 + (1 - l22) p22})}{log (2)} + \frac{(1 - l22) p22 log (\frac{(1 - l22) (p12 + p22)}{(1 - l12) p12 + (1 - l22) p22})}{log (2)} + \frac{l12 p12 log (\frac{l12 (p12 + p22)}{l12 p12 + l22 p22})}{log (2)} + \frac{l22 p22 log (\frac{l22 (p12 + p22)}{l12 p12 + l22 p22})}{log (2)}$

One thing we want here is to characterize the optimal latent (minimizes sum of Mediation error, Redundancy1 error and Redundancy2 error) at any given setting of the $P [X, Y]$ parameters. Mathematica can’t handle this analytically, so we tried restricting the $P [X, Y]$ distribution such that both marginals were always 50%. Then the distribution can be parameterized by one variable, $δ \in [- \frac{1}{4}, \frac{1}{4}]$ , like so:

$P [X, Y] = (\begin{matrix} \frac{1}{4} + δ & \frac{1}{4} - δ \frac{1}{4} - δ & \frac{1}{4} + δ \end{matrix})$

We can plot the numerical optimal latents for each value of $δ$ (same as the image at the top of the post, but with each error plotted):

By looking at these numerically optimal latents, we can also hardcode some of the latent parameters to values that we’ve empirically observed are optimal. When $δ < 1 / 8,$ the latent is 0.5 everywhere (independent of X and Y, so equivalent to the constant latent). When $δ > 1 / 8$ , we observed that l12 and l21 are always 0.5, and l11 = 1-l22. These leaves only one parameter free.

Hardcoding these simplifies each error, so we can get each error as a function of $δ$ and l22. For example the mediation error:

$\frac{2 (d + \frac{1}{4}) (1 - l22) log ⎛ ⎜ ⎝ \frac{(d + \frac{1}{4}) (1 - l22) ((d + \frac{1}{4}) (1 - l22) + (d + \frac{1}{4}) l22 - d + \frac{1}{4})}{{((d + \frac{1}{4}) (1 - l22) + \frac{1}{2} (\frac{1}{4} - d))}^{2}} ⎞ ⎟ ⎠}{log (2)} + \frac{2 (d + \frac{1}{4}) l22 log ⎛ ⎜ ⎝ \frac{(d + \frac{1}{4}) l22 ((d + \frac{1}{4}) (1 - l22) + (d + \frac{1}{4}) l22 - d + \frac{1}{4})}{{((d + \frac{1}{4}) l22 + \frac{1}{2} (\frac{1}{4} - d))}^{2}} ⎞ ⎟ ⎠}{log (2)} + \frac{2 (\frac{1}{4} - d) log ⎛ ⎜ ⎝ \frac{(\frac{1}{4} - d) ((d + \frac{1}{4}) (1 - l22) + (d + \frac{1}{4}) l22 - d + \frac{1}{4})}{2 ((d + \frac{1}{4}) (1 - l22) + \frac{1}{2} (\frac{1}{4} - d)) ((d + \frac{1}{4}) l22 + \frac{1}{2} (\frac{1}{4} - d))} ⎞ ⎟ ⎠}{log (2)}$

Note that in this equation and the following one, we use $d$ in place of $δ$ .

Now it’s simple enough that Mathematica (with some cajoling) can analytically find the l22 (as a function of d) such that the total error is minimized:

$Root [{# 1}^{6} (4096 d^{4} + 4096 d^{3} + 1536 d^{2} + 256 d + 16) + {# 1}^{5} (- 12288 d^{4} - 12288 d^{3} - 4608 d^{2} - 768 d - 48) + {# 1}^{4} (18432 d^{4} + 12288 d^{3} + 3840 d^{2} + 768 d + 72) + {# 1}^{3} (- 16384 d^{4} - 4096 d^{3} - 256 d - 64) + {# 1}^{2} (8448 d^{4} - 1280 d^{3} - 672 d^{2} + 48 d + 17) + # 1 (- 2304 d^{4} + 1280 d^{3} - 96 d^{2} - 48 d + 7) + 256 d^{4} - 256 d^{3} + 96 d^{2} - 16 d + 1 &, 2]$

It’s the second root of a degree 6 polynomial, whose parameters each contain a degree 4 polynomial. It looks like this:

This is for $δ \in [\frac{1}{8}, \frac{1}{4}] .$ In $δ \in [0, \frac{1}{8}]$ , l22 simply equals 0.5.

By analytically calculating the gradient and hessian with respect to the total error, and showing the eigenvalues of the hessian are positive for all $δ$ , we know that every local movement of the latent increases the error, so this latent that we’ve described is definitely locally minimal. We think it is also globally minimal, but haven’t shown this. Presumably it won’t be too hard to prove this by going through every other 0 gradient solution and showing that it has a larger error.

Mathematica can plot the total stochastic error (it’s missing the first half, but we know that the first half is equal to the optimal deterministic error, because we checked the eigenvalues of the hessian):

And using our known values for the optimal deterministic latents, we can plot the optimal deterministic error:

And we can plot the ratio of these:

And (numerically) find that the peak is ratio=1.81792, { $δ = 0.194986$ }.

Summary

This isn't the most satisfying conclusion, but we have achieved a partial success. We have (almost) a proof that the conjecture is true for the special case of a 2x2, when the marginals of X and Y are 0.5. And in this case, the ratio $n$ is ~1.82. We failed to find a worse ratio empirically in the general 2x2 case and the general 3x3 case (despite extensive manual effort, and running 10s of hours of automated searches). We think this is moderately strong evidence that the conjecture is true in general.

But we don't have any good leads on methods to prove this in general.

Acknowledgement: Alfred's contribution to this work was funded by the Advanced Research + Invention Agency (ARIA) through project code MSAI-SE01-P005

^{^}
When stated in terms of KL divergences, these conditions are:
Redundancy 1: $D_{K L} (P (X, Y, Λ) | | P (X, Y) P (Λ | Y)) \leq ϵ$
Redundancy 2: $D_{K L} (P (X, Y, Λ) | | P (X, Y) P (Λ | X)) \leq ϵ$
Mediation: $D_{K L} (P (X, Y, Λ) | | P (Λ) P (Y | Λ) P (X | Λ)) \leq ϵ$

[-]johnswentworth1mo110

A barrier for : suppose X and Y are both bitstrings of length 2k. The first k bits of the two strings are equal (i.e. X[:k] == Y[:k] in python notation); the rest are independent. Otherwise, all bits are maxentropic (i.e. IID 50/50 coinflips).

Then there's an exact (deterministic) natural latent: $Λ$ := X[:k] = Y[:k]. But I(X; Y), H(X|Y), and H(Y|X) are all much larger than zero; each is k bits.

[-]Alfred Harwood1mo40

Ah yes, thanks for pointing this out.

LESSWRONG
LW