Congratulations!
I'm very glad to see some serious progress on this agenda. If I'm being honest, I was less than excited by the latest long stretch of (published) results on it. It felt like there was no progress on the core ideas being made, only the shoring-up of the foundations and some minor peripheral insights. It looked concerningly plausible that this path routed through regions of math-space so thorny they were effectively impassable. (This might be an unfair characterization, particularly if those results looked more important in the context of unpublished research/considerations. Sorry if so.)
This result I am very much excited about. It seems to be a meaningful step "depthwards", and serves as an existence proof that depthwards progress is possible at all. Great job!
I wrote this tl;dr for a friend, and thought it worth sharing. I'm not sure it's accurate. I've only read the "Recap"
Here is how I understand it.
Suppose that, depending on the temperature, your mirror might be foggy and you might have goose pimples. As in, the temperature helps you predict those variables. But once you know the temperature, there's (approximately) nothing you learn about the state of your mirror from your skin, and vice versa. And! Once you know whether your mirror is foggy, there's basically nothing left to learn about the temperature by observing your skin (and vice versa).
But you still don't know the temperature once you observe those things.
This is a stochastic (approximate) natural latent. The stochasticity is that you don't know the temperature once you know the mirror and skin states.
Their theorem, iiuc, says that there does exist a variable where you (approximately) know its exact state after you've observed either the mirror or your skin.
(I don't currently understand exactly what coarse-graining process they're using to construct the exact natural latent).
Yup, good example!
The theorem doesn't actually specify a coarse-graining process. The proof would say:
Because the middle bullet is not constructive, we don't technically specify a process. That said, one could specify a process straightforwardly by just starting from T' and pareto-improving the latent in a specific direction until one hits an optimum.
In this case, the coarse-graining would probably just be roughly (temperatures at which the mirror fogs) and (temperatures at which it doesn't), since that's the only nontrivial coarse-graining allowed by the setup (because the coarse-grained value must be approximately determined by the mirror-state).
Noticing this only works as an example if the two signals are (approximately) the same partition of , i.e. (temperatures at which the mirror fogs) is approximately the same as (temperatures at which you have goosebumps).
Yeah I think
And! Once you know whether your mirror is foggy, there's basically nothing left to learn about the temperature by observing your skin (and vice versa).
is supposed to be scoped under the "Suppose that" from the beginning of the paragraph
This is fantastic. I was a bit annoyed by the pareto optimality section and felt that surely there must be a way to skip that part of the proof. I tried a number of simple transformations on that intuitively I thought would makes the 's equal for the appropriate 's. None worked. Lesson learned (again), test out ideas first before trying to prove them correct.
How did you work out you could use pareto-optimality? I'm guessing you got it from looking at properties of optimized and unoptimized empirical latents?
stochastic natural latents are relatively easy to test for in datasets
Why is it that stochastic natural latents are easier to test for than deterministic? It is that you can use just use the variables themselves as the latent and quickly compute mediation error?
How did you work out you could use pareto-optimality? I'm guessing you got it from looking at properties of optimized and unoptimized empirical latents?
No, actually. The magic property doesn't always hold for pareto optimal latents; the resampling step is load bearing. So when we numerically experimented with optimizing the latents, we often got latents which didn't have the structure leveraged in the proof (though we did sometimes get the right structure, but we didn't notice that until we knew to look for it).
We figured it out by algebraically playing around with the first-order conditions for pareto optimality, generally trying to simplify them, and noticed that if we assumed zero error on one resampling condition (which at the time we incorrectly thought we had already proven was a free move), then it simplified down a bunch and gave the nice form.
It is that you can use just use the variables themselves as the latent and quickly compute mediation error?
Yup.
How did you work out you could use pareto-optimality? I'm guessing you got it from looking at properties of optimized and unoptimized empirical latents?
No, actually. The magic property doesn't always hold for pareto optimal latents; the resampling step is load bearing. So when we numerically experimented with optimizing the latents, we often got latents which didn't have the structure leveraged in the proof (though we did sometimes get the right structure, but we didn't notice that until we knew to look for it).
We figured it out by algebraically playing around with the first-order conditions for pareto optimality, generally trying to simplify them, and noticed that if we assumed zero error on one resampling condition (which at the time we incorrectly thought we had already proven was a free move), then it simplified down a bunch and gave the nice form.
It is that you can use just use the variables themselves as the latent and quickly compute mediation error?
Yup.
Congrats!
Some interesting directions I think this opens up: Intuitively, given a set of variables , we want natural latents to be approximately deterministic across a wide variety of (collections of) variables, and if a natural latent is approximately deterministic w.r.t a subset of variables , then we want to be as small as possible (e.g. strong redundancy is better than weak redundancy when the former is attainable)
The redundancy lattice seems natural for representing this: Given an element of the redundancy lattice , we say is a redund over if it’s approximately deterministic w.r.t each subset in . E.g. is weakly redundant over if it’s a redund over (approximately deterministic function of each ), and strongly redundant if it’s a redund over . If is a redund over , our intuitive desiderata for natural latents correspond to containing more subsets (more redundancy), and each subset being small (less “synergy”). Combine this with the mediation condition can probably give us a notion of pareto-optimality for natural latents.
Another thing we could do is when we construct pareto-optimal natural latents over , we add them to the original set of variables to augment the redundancy lattice, so that new natural latents can be approximately deterministic functions over (collections of) existing natural latents, and this naturally allows us to represent the “hierarchical nature of abstractions” where lower-level abstractions makes it easier to compute higher-level ones.
A concrete setting where this can be useful is where a bunch of agents receive different but partially overlapping sets of observations and aims to predict partially overlapping domains. Having a fine grained collection of natural latents redundant across different elements of the redundancy lattice means we get to easily zoom in on the smaller subset of latent variables that’s (maximally) redundantly represented by all of the agents (& be able to tell which domains of predictions these latents actually mediate).
Our posts on natural latents have involved two distinct definitions, which we call "stochastic" and "deterministic" natural latents. We conjectured that, whenever there exists a stochastic natural latent (to within some approximation), there also exists a deterministic natural latent (to within a comparable approximation). Four months ago, we put up a bounty to prove this conjecture.
We've been bottlenecked pretty hard on this problem, and spent most of the last four months attacking it. At long last, we have a proof. As hoped, the proof comes with some qualitative new insights about natural latents, and we expect it will unbottleneck a bunch of future work. The main purpose of this post is to present the proof.
This post officially closes the corresponding bounty.
Recap: What Was The Problem Again?
(This section is mostly copied from the bounty post.)
Some Intuition From The Exact Case
In the exact case, in order for a natural latent to exist over random variables X1,X2, the distribution has to look roughly like this:
Each value of X1 and each value of X2 occurs in only one "block", and within the "blocks", X1 and X2 are independent. In that case, we can take the (exact) natural latent to be a block label.
Notably, that block label is a deterministic function of X.
However, we can also construct other natural latents for this system: we simply append some independent random noise to the block label. That natural latent is not a deterministic function of X; it's a "stochastic" natural latent.
In the exact case, if a stochastic natural latent exists, then the distribution must have the form pictured above, and therefore the block label is a deterministic natural latent. In other words: in the exact case, if a stochastic natural latent exists, then a deterministic natural latent also exists.
Our goal here is to prove that this still holds in the approximate case, using the same information theoretic approximation methods used in our other posts on natural latents (and explained here).
The Problem
"Stochastic" Natural Latents
Stochastic natural latents were introduced in the original Natural Latents post. Any latent Λ over random variables X1,X2 is defined to be a stochastic natural latent when it satisfies these diagrams:
... and Λ is an approximate stochastic natural latent (with error ϵ) when it satisfies the approximate versions of those diagrams to within ϵ, i.e.
ϵ≥DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ])
ϵ≥DKL(P[X,Λ]||P[X2]P[X1|X2]P[Λ|X1])
ϵ≥DKL(P[X,Λ]||P[X1]P[X2|X1]P[Λ|X2])
Key thing to note: if Λ satisfies these conditions, then we can create another stochastic natural latent Λ′ by simply appending some random noise to Λ, independent of X. This shows that Λ can, in general, contain arbitrary amounts of irrelevant noise while still satisfying the stochastic natural latent conditions.
"Deterministic" Natural Latents
Deterministic natural latents were introduced in a post by the same name. Any latent Λ over random variables X1,X2 is defined to be a deterministic natural latent when it satisfies these diagrams:
... and Λ is an approximate deterministic natural latent (with error ϵ) when it satisfies the approximate versions of those diagrams to within ϵ, i.e.
ϵ≥DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ])
ϵ≥H(Λ|X1)
ϵ≥H(Λ|X2)
See the linked post for explanation of a variable appearing multiple times in a diagram, and how the approximation conditions for those diagrams simplify to entropy bounds.
Note that the deterministic natural latent conditions, either with or without approximation, imply the stochastic natural latent conditions; a deterministic natural latent is also a stochastic natural latent.
The Goal
We'd like a proof that, if a stochastic natural latent exists over two variables X1,X2 to within approximation ϵ, then a deterministic natural latent exists over those two variables to within approximation roughly ϵ - specifically, 9ϵ.
The Proof
Key Ideas
There are two key ideas to the proof.
The first key idea is to use resampling to obtain a latent which satisfies one of the natural latent conditions exactly, and the others approximately.
The second key idea is to consider pareto optimal stochastic natural latents - i.e. latents with pareto minimal error on the three natural latent conditions.
It turns out that stochastic natural latents which exactly satisfy one of the natural latent conditions and are pareto optimal work like the exact case, even when no exact natural latent exists.
Specifically: pareto optimal stochastic natural latents Λ′ over X1,X2 which satisfy one redundancy condition exactly can always be coarse grained into a deterministic natural latent - i.e.
So, fΛ(Λ′) is itself a natural latent with the same errors as Λ′, and it's exactly a deterministic function of X.
This was a big and very welcome surprise to us!
Math
Assumptions & Preconditions
We will assume P[X,Λ]>0 for the original stochastic natural latent Λ. This implies that there are no nontrivial exact natural latents.[1] We make this assumption mainly for mathematical convenience; because we're interested in practical approximation it is a reasonable assumption to use. But we do expect the assumption can be dropped, at the cost of more details to handle in the proof.
The main preconditions for our proof are that three random variables X1, X2, Λ approximately satisfy the three natural latent conditions, i.e.
or, written out (and simplified a little),
First redundancy condition: ϵ≥DKL(P[X,Λ]||P[X]P[Λ|X1])
Second redundancy condition: ϵ≥DKL(P[X,Λ]||P[X]P[Λ|X2])
Mediation condition: ϵ≥DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ])
Resampling Conserves Naturality
A previous post showed that resampling conserves redundancy. Specifically, we can construct a new latent Λ′ by sampling from P[Λ|X2]; the new joint distribution is then
P[Λ′,X]=P[X]P[Λ=λ′|X2]
Given the two redundancy conditions and P[X,Λ]>0, the new latent Λ′ satisfies the second redundancy condition perfectly (by construction), and satisfies the first redundancy condition to within 9ϵ. (Leveraging all three natural latent conditions, empirical tests also strongly suggest that that bound can be improved from 9ϵ to 3ϵ, but that is not yet proven.)
Now, imagine that instead of constructing Λ′ by sampling from P[Λ|X2], we instead construct X′1 by sampling from P[X1|X2]. Key insight: this results in exactly the same joint distribution as sampling Λ′ from P[Λ|X2], i.e.
P[X′1=x1,X2,Λ]=P[X1,X2,Λ′=λ]=P[X2]P[X1|X2]P[Λ|X2]
By the same "resampling conserves redundancy" theorem, the X′1 construction satisfies the mediation condition to within 9ϵ (rather than the first redundancy condition). But since the X′1 construction and the Λ′ construction yield the same joint distribution, with one of them implying the first redundancy condition is satisfied to within 9ϵ and the other implying the mediation condition is satisfied to within 9ϵ, that same joint distribution must satisfy both conditions to within 9ϵ.
Putting that all together: starting from a latent Λ over X1,X2 which satisfies all three natural latent conditions to within ϵ, we can construct a new latent Λ′ which satisfies the second redundancy condition perfectly, and satisfies the other two conditions to within 9ϵ.
We'll use that new latent as our starting point for the second half of the proof, in which we look for pareto improvements upon Λ′.
Pareto Minimization -> Single Objective Minimization
In the second half of the proof, we'll consider taking pareto improvements upon Λ′, i.e. looking for a latent Λ∗ with pareto-optimal error on the three natural latent conditions. Since we're pareto improving on Λ′,Λ∗ will have zero error on the second redundancy condition (i.e. P[X,Λ∗]=P[X]P[Λ∗|X2]), and at most 9ϵ error on the other two conditions.
First, we convert our pareto minimization problem into a single objective minimization problem in the standard way, in order to use the standard optimization toolset.
A latent Λ∗ is defined by P[Λ∗|X]. We want to characterize latents for which the errors on the three natural latent conditions are pareto minimal, holding P[X1,X2] constant. The three errors are:
(Note that we've written these slightly differently from the previous section. They are equivalent, and these expressions will save some minor rearrangement in the proof.)
To use the usual optimization toolset, we convert the pareto minimization problem into a single objective minimization problem by assigning weights α to each error. Our single objective is
obj:=αmedDKL(X1←Λ∗→X2)+α1DKL(Λ∗→X1→X2)+α2DKL(Λ∗→X2→X1)
Any pareto minimum for the original problem must be a minimum of obj for some (αmed,α1,α2) with αi≥0 for all indices (including med, which is an ordinary index). Without loss of generality, we assume ∑iαi=1. In general, different α's in the single objective problem pick out different pareto minima in the original problem.
Lagrangian & First Order Conditions
Now we turn the crank.
We (implicitly so far) have two constraints on our optimization problem:
We introduce Lagrange multipliers ν[X], μ[Λ∗,X], respectively, for these two constraints. That gives the Lagrangian
L:=obj−∑Xν[X](∑∗ΛP[Λ∗|X]−1)−∑Λ∗,Xμ[Λ∗,X]P[Λ∗|X]
Differentiating L with respect to P[Λ∗|X] (at constant P[X]), simplifying, and absorbing some terms into the Lagrange multipliers yields our first order condition:
0=lnP[X|Λ∗]−(α1+αmed)lnP[X1|Λ∗]−(α2+αmed)lnP[X2|Λ∗]−ν′[X]−μ[X,Λ∗]
where the Lagrange multiplier ν′ has absorbed some terms which depend only on X.
Note that, while the term μ[X,Λ∗] looks completely arbitrary, it is constrained by complementary slackness: μ[X,Λ∗] can be nonzero only if P[Λ∗|X] is zero, i.e. μ[X,Λ∗]P[Λ∗|X]=0 for all X,Λ∗.
Putting The Pieces Together & Solving The Equations
Earlier, we established that a latent exists which satisfies the second redundancy condition perfectly and has error at most 9ϵ on the other two conditions. So, let's use our first order condition to characterize specifically those pareto optimal latents which perfectly satisfy the second redundancy condition.
Perfect satisfaction of the second redundancy condition means P[X|Λ∗]=P[X1|X2]P[X2|Λ∗]. Substituting that into the first order condition and simplifying gives
0=−(α1+αmed)lnP[X1|Λ∗]+α1lnP[X2|Λ∗]−ν′′[X]−μ[X,Λ∗]
Now, pick values x,λ0,λ1 such that P[X|Λ∗=λ0]>0 and P[X=x|Λ∗=λ1]>0. Then μ[X=x,Λ∗=λ0]=μ[X=x,Λ∗=λ1]=0 by complementary slackness, and we can subtract the first order conditions at (x,λ1) and (x,λ0) to get
0=−(α1+αmed)(lnP[X1=x1|Λ∗=λ1]−lnP[X1=x1|Λ∗=λ0])+α1(lnP[X2=x2|Λ∗=λ1]−lnP[X2=x2|Λ∗=λ0])
Note that one of those terms depends on X1 (but not X2), and the other depends on X2 (but not X1), so the only way they can add to 0 for all X values for which P[X|Λ∗=λ0]>0 and P[X|Λ∗=λ1]>0 is if both are equal to some C(λ0,λ1) which does not depend on X:
C(λ0,λ1)=(α1+αmed)(lnP[X1|Λ∗=λ1]−lnP[X1|Λ∗=λ0])
C(λ0,λ1)=α1(lnP[X2|Λ∗=λ1]−lnP[X2|Λ∗=λ0])
Both of those equations must hold for all X such that P[X|Λ∗=λ0]>0 and P[X|Λ∗=λ1]>0.
Notably, our assumption P[X,Λ]>0[2] implies P[X]>0 implies P[X1|X2]>0. Combined with P[X|Λ∗]=P[X1|X2]P[X2|Λ∗] (the perfect second redundancy condition for Λ∗), that means P[X1|Λ∗]>0 for all X1,Λ∗ - or, put differently, for all X1, Λ∗, there exists some X2 such that P[X1,X2,Λ∗]>0. So,
C(λ0,λ1)=(α1+αmed)(lnP[X1|Λ∗=λ1]−lnP[X1|Λ∗=λ0])
must hold for all X1, whenever the support of P[X2|Λ∗=λ0] and P[X2|Λ∗=λ1] overlap at all. Shuffling terms around, we get
P[X1|Λ∗=λ1]=P[X1|Λ∗=λ0]eC(λ0,λ1)/(α1+αmed)
Sum on X1 on both sides, and we get 1=eC(λ0,λ1)/(α1+αmed), implying C(λ0,λ1)=0 and therefore P[X1|Λ∗=λ1]=P[X1|Λ∗=λ0].
In short: given two Λ∗ values λ0,λ1, if there exists any X2 such that P[X2|Λ∗=λ0]>0 and P[X2|Λ∗=λ1]>0 (i.e. the support of P[X2|Λ∗] overlaps for the two Λ∗ values), then the two values yield exactly the same distribution P[X1|Λ∗].
Furthermore, since C(λ0,λ1)=0 whenever P[X2|Λ∗=λ0] and P[X2|Λ∗=λ1] have overlapping support, we also have
P[X2|Λ∗=λ0]=P[X2|Λ∗=λ1]
anywhere that both of those quantities are nonzero.
A (Non-Strict) Pareto Improvement Via Coarse Graining
A quick recap of where that last section leaves us. We've established that:
Now, assume the supports of P[X2|Λ∗=λ0] and P[X2|Λ∗=λ1] overlap somewhere, and consider coarse graining those two values of Λ∗. Compared to Λ∗ itself, how does the coarse grained variable g(Λ∗) score on each of the natural latent conditions?
So, without making the errors on any of the three natural latent conditions any worse, we can coarse grain all Λ∗ values λ0,λ1 for which the supports of P[X2|Λ∗=λ0] and P[X2|Λ∗=λ1] overlap somewhere.
Once all such coarse graining is performed, we have a new coarse grained latent g(Λ∗) for which the support of P[X2|g(Λ∗)] is nonoverlapping for all pairs of Λ∗ values.
In other words: g(Λ∗) is exactly a deterministic function of X2 (and therefore still perfectly satisfies the second redundancy condition), and satisfies the first redundancy condition and mediation condition each to within 9ϵ.
Finally, A Deterministic Natural Latent
Lastly, note that H(g(Λ∗)|X2)=0 and X2→X1→g(Λ∗) to within9ϵ implies 9ϵ≥H(g(Λ∗)|X1). Combined with the mediation condition, that implies g(Λ∗) is approximately a deterministic natural latent, with errors at most:
Can we do better?
The main room for improvement of the bounds in this proof is in the resampling step. The resampling conserves redundancy post notes where those bounds could be improved, and presents a little empirical evidence that they can be improved to 3ϵ (using all three natural latent conditions) or 4ϵ (using only redundancy).
What's Next?
We've been bottlenecked pretty hard on this theorem for the past 3-4 months.
Now that we finally have it, we expect to largely abandon stochastic natural latents in favor of deterministic natural latents. For instance, one immediate next step will be to rewrite our Illiad paper from last year to work with deterministic natural latents, which will eliminate the weakest parts of that paper and give a much more compelling case. (No, we're not linking to the old paper, because the new one is going to be a lot better.)
On another front: stochastic natural latents are relatively easy to test for in datasets, by looking for three variables each of which mediates between the other two. Now we have some idea of what to do with those triples when we find them: compute the deterministic constraint between them.
Beyond those two immediate projects, we expect this result to be foundational for basically all of our work on natural latents going forward.
This is because P[X, Λ] > 0 implies P[X] > 0, and nontrivial exact natural latents rely necessarily on some values of X to have probability 0.
Note that that's Λ, i.e. the original latent, not Λ∗.