Alfred Harwood's Shortform

Here are some updates on the work that @Jeremy Gillen and I have been doing on Natural Latents, following on from Jeremy's comment on Resampling Conserves Redundancy (Approximately). We are still trying to prove that a stochastic natural latent implies the existence of a deterministic latent which is almost as good.

First, if a latent is a deterministic function of X and Y, (ie. ), the deterministic redundancy conditions will be satisfied up to errors of $H (Y | X)$ and H(X|Y). (We don't actually use this in anything that follows, but think its quite nice.)

Proof:Proof

First, write out $H (X, Y, Λ)$ using the entropy chain rule:

$H (X, Y, Λ) = H (X) + H (Y | X) + H (Λ | X, Y) = H (X) + H (Y | X)$

Since $Λ$ is a deterministic function of X and Y, we can set $H (Λ | X, Y) = 0$ .

We can expand $H (X, Y, Λ)$ differently using the entropy chain rule.

$H (X, Y, Λ) = H (X) + H (Λ | X) + H (Y | X, Λ)$

The two expressions for $H (X, Y, Λ)$ must be equal so:

$H (X) + H (Y | X) = H (X) + H (Λ | X) + H (Y | X, Λ)$ .

Re-arranging gives:

$H (Λ | X) = H (Y | X) - H (Y | X, Λ) \leq H (Y | X)$

We can repeat this proof with X and Y swapped to get the bound $H (Λ | Y) \leq H (X | Y)$

Furthermore, we can always consider a latent which just copies one of the variables (for example $Λ = Y$ ). This kind of latent will always perfectly satisfy one of the deterministic redundancy conditions (if $Λ = Y$ , then $H (Λ | Y) = 0$ ). The other deterministic redundancy condition will be satisfied with error $H (Λ | X) = H (Y | X)$ . The mediation condition will also have zero error since $I (X; Y | Λ) = H (Y | Λ) - H (Y | X, Λ) = 0$ (since $H (Y | Λ) = 0$ ).

We can choose $Λ$ to copy either $X$ or $Y$ , meaning that using this method we can always construct a deterministic natural latent with $error = min [H (Y | X), H (X | Y)] .$

Another deterministic latent that can always be constructed is the constant latent $Λ = c o n s t .$ This will always perfectly satisfy the two deterministic redundancy conditions, since $H (Λ | X) \leq H (Λ) = 0$ . The mediation error for the constant latent is just the mutual information between $X$ and $Y$ .

Between these two types of latent, a deterministic natural latent can always be constructed with error bounded by $min [I (X; Y), H (X | Y), H (Y | X)]$ . Loosely: if $X$ and $Y$ are highly correlated you can use the copy latent and if $X$ and $Y$ are close to being independent, you can use the constant latent.

To test how tight this bound is, Jeremy has numerically found the optimal deterministic and stochastic natural latents for a family of (X,Y) distributions. The distribution is parametrised by $α$ which captures how correlated $X$ and $Y$ are ( $α = 0$ implies perfect correlation and $α = 1$ implies no correlation). A nice graph of these results is below:

The ‘score’ of the latents is the sum of the errors on the mediation and redundancy conditions. The fact that the best deterministic score closely hugs the line $m i n [I (X, Y), H (Y | X)]$ suggests that the bound we found above is pretty tight. When we look at the latents found by the numerical optimizer, we find that it is using the copy latent at low $H (Y | X)$ and the constant latent when $I (X; Y) < H (Y | X)$ .

We also have a graph which breaks down the individual errors for optimal stochastic and deterministic latents:

It shows quite clearly the point where the optimal deterministic latent switches from ‘copy’ to constant around $α = 0.1$ . Furthermore, for $α > 1 / 3$ we have shown that the constant latent is at a local minimum in the space of possible stochastic latents. We did this by differentiating with respect to the conditional probabilities $P (Λ | X, Y)$ and finding the Hessian. This was done using a symbolic program so we don’t have a compact proof to put here. The point $α = 1 / 3$ is also the point where the errors for the best stochastic and deterministic latents become the same. This seems like good news for the general NxN conjecture, since it shows that when two states are roughly independent, there’s zero cost (to the difference between stochastic and deterministic scores) to ‘merging’ those states from the perspective of the latent.

Eyeballing the first graph, it seems like the deterministic score is never more than twice the best stochastic score for any given distribution. It is interesting to note that the optimal stochastic latent roughly follows the same pattern as the deterministic latent but it notably performs better than the deterministic latent in situations where both $H (Y | X)$ and $I (X; Y)$ are high. Bear in mind that these graphs are just for one family of distributions (but we have tested others and got similar results).

At the moment we are thinking about ways to lower bound the stochastic latent error in terms of $I (X; Y)$ and $H (Y | X)$ . For example if we could show that it is impossible for a stochastic latent to have a total error less than $\frac{1}{n} min [I (X; Y), H (Y | X), H (X | Y)]$ then this would prove that the existence of a stochastic latent with error $ϵ$ implies the existence of a deterministic latent with error $n ϵ$ or less. We are also thinking about ways to show that certain latent types have errors which are global minima (as opposed to just local minima) in the space of possible latents.

LESSWRONG
LW

LESSWRONG
LW

Alfred Harwood's Shortform

4