Congratulations!
I'm very glad to see some serious progress on this agenda. If I'm being honest, I was less than excited by the latest long stretch of (published) results on it. It felt like there was no progress on the core ideas being made, only the shoring-up of the foundations and some minor peripheral insights. It looked concerningly plausible that this path routed through regions of math-space so thorny they were effectively impassable. (This might be an unfair characterization, particularly if those results looked more important in the context of unpublished research/considerations. Sorry if so.)
This result I am very much excited about. It seems to be a meaningful step "depthwards", and serves as an existence proof that depthwards progress is possible at all. Great job!
I wrote this tl;dr for a friend, and thought it worth sharing. I'm not sure it's accurate. I've only read the "Recap"
Here is how I understand it.
Suppose that, depending on the temperature, your mirror might be foggy and you might have goose pimples. As in, the temperature helps you predict those variables. But once you know the temperature, there's (approximately) nothing you learn about the state of your mirror from your skin, and vice versa. And! Once you know whether your mirror is foggy, there's basically nothing left to learn about the temperature by observing your skin (and vice versa).
But you still don't know the temperature once you observe those things.
This is a stochastic (approximate) natural latent. The stochasticity is that you don't know the temperature once you know the mirror and skin states.
Their theorem, iiuc, says that there does exist a variable where you (approximately) know its exact state after you've observed either the mirror or your skin.
(I don't currently understand exactly what coarse-graining process they're using to construct the exact natural latent).
Yup, good example!
The theorem doesn't actually specify a coarse-graining process. The proof would say:
Because the middle bullet is not constructive, we don't technically specify a process. That said, one could specify a process straightforwardly by just starting from T' and pareto-improving the latent in a specific direction until one hits an optimum.
In this case, the coarse-graining would probably just be roughly (temperatures at which the mirror fogs) and (temperatures at which it doesn't), since that's the only nontrivial coarse-graining allowed by the setup (because the coarse-grained value must be approximately determined by the mirror-state).
This is fantastic. I was a bit annoyed by the pareto optimality section and felt that surely there must be a way to skip that part of the proof. I tried a number of simple transformations on that intuitively I thought would makes the 's equal for the appropriate 's. None worked. Lesson learned (again), test out ideas first before trying to prove them correct.
How did you work out you could use pareto-optimality? I'm guessing you got it from looking at properties of optimized and unoptimized empirical latents?
stochastic natural latents are relatively easy to test for in datasets
Why is it that stochastic natural latents are easier to test for than deterministic? It is that you can use just use the variables themselves as the latent and quickly compute mediation error?
How did you work out you could use pareto-optimality? I'm guessing you got it from looking at properties of optimized and unoptimized empirical latents?
No, actually. The magic property doesn't always hold for pareto optimal latents; the resampling step is load bearing. So when we numerically experimented with optimizing the latents, we often got latents which didn't have the structure leveraged in the proof (though we did sometimes get the right structure, but we didn't notice that until we knew to look for it).
We figured it out by algebraically playing around with the first-order conditions for pareto optimality, generally trying to simplify them, and noticed that if we assumed zero error on one resampling condition (which at the time we incorrectly thought we had already proven was a free move), then it simplified down a bunch and gave the nice form.
It is that you can use just use the variables themselves as the latent and quickly compute mediation error?
Yup.
How did you work out you could use pareto-optimality? I'm guessing you got it from looking at properties of optimized and unoptimized empirical latents?
No, actually. The magic property doesn't always hold for pareto optimal latents; the resampling step is load bearing. So when we numerically experimented with optimizing the latents, we often got latents which didn't have the structure leveraged in the proof (though we did sometimes get the right structure, but we didn't notice that until we knew to look for it).
We figured it out by algebraically playing around with the first-order conditions for pareto optimality, generally trying to simplify them, and noticed that if we assumed zero error on one resampling condition (which at the time we incorrectly thought we had already proven was a free move), then it simplified down a bunch and gave the nice form.
It is that you can use just use the variables themselves as the latent and quickly compute mediation error?
Yup.
Congrats!
Some interesting directions I think this opens up: Intuitively, given a set of variables , we want natural latents to be approximately deterministic across a wide variety of (collections of) variables, and if a natural latent is approximately deterministic w.r.t a subset of variables , then we want to be as small as possible (e.g. strong redundancy is better than weak redundancy when the former is attainable)
The redundancy lattice seems natural for representing this: Given an element of the redundancy lattice , we say is a redund over if it’s approximately deterministic w.r.t each subset in . E.g. is weakly redundant over if it’s a redund over (approximately deterministic function of each ), and strongly redundant if it’s a redund over . If is a redund over , our intuitive desiderata for natural latents correspond to containing more subsets (more redundancy), and each subset being small (less “synergy”). Combine this with the mediation condition can probably give us a notion of pareto-optimality for natural latents.
Another thing we could do is when we construct pareto-optimal natural latents over , we add them to the original set of variables to augment the redundancy lattice, so that new natural latents can be approximately deterministic functions over (collections of) existing natural latents, and this naturally allows us to represent the “hierarchical nature of abstractions” where lower-level abstractions makes it easier to compute higher-level ones.
A concrete setting where this can be useful is where a bunch of agents receive different but partially overlapping sets of observations and aims to predict partially overlapping domains. Having a fine grained collection of natural latents redundant across different elements of the redundancy lattice means we get to easily zoom in on the smaller subset of latent variables that’s (maximally) redundantly represented by all of the agents (& be able to tell which domains of predictions these latents actually mediate).
Our posts on natural latents have involved two distinct definitions, which we call "stochastic" and "deterministic" natural latents. We conjectured that, whenever there exists a stochastic natural latent (to within some approximation), there also exists a deterministic natural latent (to within a comparable approximation). Four months ago, we put up a bounty to prove this conjecture.
We've been bottlenecked pretty hard on this problem, and spent most of the last four months attacking it. At long last, we have a proof. As hoped, the proof comes with some qualitative new insights about natural latents, and we expect it will unbottleneck a bunch of future work. The main purpose of this post is to present the proof.
This post officially closes the corresponding bounty.
(This section is mostly copied from the bounty post.)
In the exact case, in order for a natural latent to exist over random variables , the distribution has to look roughly like this:
Each value of and each value of occurs in only one "block", and within the "blocks", and are independent. In that case, we can take the (exact) natural latent to be a block label.
Notably, that block label is a deterministic function of X.
However, we can also construct other natural latents for this system: we simply append some independent random noise to the block label. That natural latent is not a deterministic function of X; it's a "stochastic" natural latent.
In the exact case, if a stochastic natural latent exists, then the distribution must have the form pictured above, and therefore the block label is a deterministic natural latent. In other words: in the exact case, if a stochastic natural latent exists, then a deterministic natural latent also exists.
Our goal here is to prove that this still holds in the approximate case, using the same information theoretic approximation methods used in our other posts on natural latents (and explained here).
Stochastic natural latents were introduced in the original Natural Latents post. Any latent over random variables is defined to be a stochastic natural latent when it satisfies these diagrams:
... and is an approximate stochastic natural latent (with error ) when it satisfies the approximate versions of those diagrams to within , i.e.
Key thing to note: if satisfies these conditions, then we can create another stochastic natural latent by simply appending some random noise to , independent of . This shows that can, in general, contain arbitrary amounts of irrelevant noise while still satisfying the stochastic natural latent conditions.
Deterministic natural latents were introduced in a post by the same name. Any latent over random variables is defined to be a deterministic natural latent when it satisfies these diagrams:
... and is an approximate deterministic natural latent (with error ) when it satisfies the approximate versions of those diagrams to within , i.e.
See the linked post for explanation of a variable appearing multiple times in a diagram, and how the approximation conditions for those diagrams simplify to entropy bounds.
Note that the deterministic natural latent conditions, either with or without approximation, imply the stochastic natural latent conditions; a deterministic natural latent is also a stochastic natural latent.
We'd like a proof that, if a stochastic natural latent exists over two variables to within approximation , then a deterministic natural latent exists over those two variables to within approximation roughly - specifically, 9.
There are two key ideas to the proof.
The first key idea is to use resampling to obtain a latent which satisfies one of the natural latent conditions exactly, and the others approximately.
The second key idea is to consider pareto optimal stochastic natural latents - i.e. latents with pareto minimal error on the three natural latent conditions.
It turns out that stochastic natural latents which exactly satisfy one of the natural latent conditions and are pareto optimal work like the exact case, even when no exact natural latent exists.
Specifically: pareto optimal stochastic natural latents over which satisfy one redundancy condition exactly can always be coarse grained into a deterministic natural latent - i.e.
So, is itself a natural latent with the same errors as , and it's exactly a deterministic function of .
This was a big and very welcome surprise to us!
We will assume for the original stochastic natural latent . This implies that there are no nontrivial exact natural latents.[1] We make this assumption mainly for mathematical convenience; because we're interested in practical approximation it is a reasonable assumption to use. But we do expect the assumption can be dropped, at the cost of more details to handle in the proof.
The main preconditions for our proof are that three random variables , , approximately satisfy the three natural latent conditions, i.e.
or, written out (and simplified a little),
First redundancy condition:
Second redundancy condition:
Mediation condition:
A previous post showed that resampling conserves redundancy. Specifically, we can construct a new latent by sampling from ; the new joint distribution is then
Given the two redundancy conditions and , the new latent satisfies the second redundancy condition perfectly (by construction), and satisfies the first redundancy condition to within . (Leveraging all three natural latent conditions, empirical tests also strongly suggest that that bound can be improved from to , but that is not yet proven.)
Now, imagine that instead of constructing by sampling from , we instead construct by sampling from . Key insight: this results in exactly the same joint distribution as sampling from , i.e.
By the same "resampling conserves redundancy" theorem, the construction satisfies the mediation condition to within (rather than the first redundancy condition). But since the construction and the construction yield the same joint distribution, with one of them implying the first redundancy condition is satisfied to within and the other implying the mediation condition is satisfied to within , that same joint distribution must satisfy both conditions to within .
Putting that all together: starting from a latent over which satisfies all three natural latent conditions to within , we can construct a new latent which satisfies the second redundancy condition perfectly, and satisfies the other two conditions to within .
We'll use that new latent as our starting point for the second half of the proof, in which we look for pareto improvements upon .
In the second half of the proof, we'll consider taking pareto improvements upon , i.e. looking for a latent with pareto-optimal error on the three natural latent conditions. Since we're pareto improving on will have zero error on the second redundancy condition (i.e. ), and at most error on the other two conditions.
First, we convert our pareto minimization problem into a single objective minimization problem in the standard way, in order to use the standard optimization toolset.
A latent is defined by . We want to characterize latents for which the errors on the three natural latent conditions are pareto minimal, holding constant. The three errors are:
(Note that we've written these slightly differently from the previous section. They are equivalent, and these expressions will save some minor rearrangement in the proof.)
To use the usual optimization toolset, we convert the pareto minimization problem into a single objective minimization problem by assigning weights to each error. Our single objective is
Any pareto minimum for the original problem must be a minimum of for some with for all indices (including , which is an ordinary index). Without loss of generality, we assume . In general, different 's in the single objective problem pick out different pareto minima in the original problem.
Now we turn the crank.
We (implicitly so far) have two constraints on our optimization problem:
We introduce Lagrange multipliers , , respectively, for these two constraints. That gives the Lagrangian
Differentiating with respect to (at constant ), simplifying, and absorbing some terms into the Lagrange multipliers yields our first order condition:
where the Lagrange multiplier has absorbed some terms which depend only on .
Note that, while the term looks completely arbitrary, it is constrained by complementary slackness: can be nonzero only if is zero, i.e. for all .
Earlier, we established that a latent exists which satisfies the second redundancy condition perfectly and has error at most on the other two conditions. So, let's use our first order condition to characterize specifically those pareto optimal latents which perfectly satisfy the second redundancy condition.
Perfect satisfaction of the second redundancy condition means . Substituting that into the first order condition and simplifying gives
Now, pick values such that and . Then by complementary slackness, and we can subtract the first order conditions at and to get
Note that one of those terms depends on (but not ), and the other depends on (but not ), so the only way they can add to 0 for all values for which and is if both are equal to some which does not depend on :
Both of those equations must hold for all X such that and .
Notably, our assumption [2] implies implies . Combined with (the perfect second redundancy condition for ), that means for all - or, put differently, for all , , there exists some such that . So,
must hold for all , whenever the support of and overlap at all. Shuffling terms around, we get
Sum on on both sides, and we get , implying and therefore .
In short: given two values , if there exists any such that and (i.e. the support of overlaps for the two values), then the two values yield exactly the same distribution .
Furthermore, since whenever and have overlapping support, we also have
anywhere that both of those quantities are nonzero.
A quick recap of where that last section leaves us. We've established that:
Now, assume the supports of and overlap somewhere, and consider coarse graining those two values of . Compared to itself, how does the coarse grained variable score on each of the natural latent conditions?
So, without making the errors on any of the three natural latent conditions any worse, we can coarse grain all values for which the supports of and overlap somewhere.
Once all such coarse graining is performed, we have a new coarse grained latent g() for which the support of is nonoverlapping for all pairs of values.
In other words: is exactly a deterministic function of (and therefore still perfectly satisfies the second redundancy condition), and satisfies the first redundancy condition and mediation condition each to within .
Lastly, note that and to within implies . Combined with the mediation condition, that implies is approximately a deterministic natural latent, with errors at most:
The main room for improvement of the bounds in this proof is in the resampling step. The resampling conserves redundancy post notes where those bounds could be improved, and presents a little empirical evidence that they can be improved to (using all three natural latent conditions) or (using only redundancy).
We've been bottlenecked pretty hard on this theorem for the past 3-4 months.
Now that we finally have it, we expect to largely abandon stochastic natural latents in favor of deterministic natural latents. For instance, one immediate next step will be to rewrite our Illiad paper from last year to work with deterministic natural latents, which will eliminate the weakest parts of that paper and give a much more compelling case. (No, we're not linking to the old paper, because the new one is going to be a lot better.)
On another front: stochastic natural latents are relatively easy to test for in datasets, by looking for three variables each of which mediates between the other two. Now we have some idea of what to do with those triples when we find them: compute the deterministic constraint between them.
Beyond those two immediate projects, we expect this result to be foundational for basically all of our work on natural latents going forward.