Second content post in a planned cluster on exact results for natural latents.
See the introduction and the previous post. In this post, I share some key theoretical results that this framing generates. This post has more math in it than the last one, most of which I've banished to collapsible sections.[1]
I assume you've read the previous posts, but I tried to make the pedagogical arc make sense even if you start with this one. It helps if you have some familiarity with natural latents, canonical correlation analysis[2], and information theory.
Consider an Elephant in a Room
Say Alice and Bob have access to two different camera feeds observing the same room.
Alice and Bob watch the same room through cameras in opposite corners, but they can only see their own feeds. Their feeds are correlated because it's the same room, but neither shows what the other shows. A natural latent over the two feeds is a variable that is
redundant: each of them can pin it down from their own feed alone, and
mediating: given , the feeds carry no further information about each other; it explains all of their agreement.
"What's in the room?" is the intended example.[3] The last post showed the two conditions correspond to the two classical notions of common information (Gács–Körner and Wyner), that an exact natural latent exists iff , and that this collapse is destroyed by one percent of noise. Exact natural latents are measure-zero objects.
So the natural latents framework runs on approximation, and each condition above gets an associated error term. The mediation error is the agreement fails to explain. The redundancy error is what Bob's feed still teaches you about the concept after you've seen Alice's: the part of stuck on Bob's side, unreachable from Alice's feed alone.
Let's call that stuck part the remainder. (Symmetrically for Bob.)
Then, an obvious question to ask: how small can the two errors be, together?How natural can natural latents be?
As I noted in the introductory post, this question is hard because the core objects have no generic closed form. This post answers the question in the jointly Gaussian case, where everything does have a closed form.
Exact Natural Latents
Claim. For jointly Gaussian views with correlation , any latent with zero redundancy error from each view is independent of the entire system. Suppose is exactly recoverable from each feed alone. Then the best estimate of from all the data equals the best estimate from Alice's feed by itself, and also from Bob's by itself. So one quantity is simultaneously a function of and a function of .
If it varied at all, we'd have a feature of Alice's feed and a feature of Bob's that agree with correlation exactly 1, which two views with do not possess (due to Witsenhausen, again). So it is a constant: a degenerate case which isn't useful for our purposes. I.e., non-degenerate exact natural latents do not exist.[4]
Top: No informative function can exactly agree, Bottom: Guaranteed agreement carries no information.
A Distribution-free Sum Rule
The two errors are connected by an identity we can obtain with some simple algebra. The derivation uses the chain rule for mutual information, see below.
Derivation of the Sum Rule
The chain rule for mutual information is
which says: what the pair tells you about is what tells you, plus what adds once you already have . We use it three times, grouping differently each time.
Bob's remainder. Expand what the two feeds together say about , taking Bob's feed first:
Rearranged: . Bob's remainder is the total information in the concept, minus the part his feed reaches on its own.
Alice's remainder. Now expand what the pair (concept + Alice's feed) says about Bob's feed, in two different orders:
The middle expression takes Alice's feed first: her feed predicts Bob's by , and then the concept adds exactly her remainder (the part of her feed didn't already contain, which is relevant to Bob's). The right expression takes the concept first: the concept predicts Bob's feed by , and then Alice's feed adds exactly the agreement the concept failed to explain. They are the same quantity, so:
Now adding, the terms cancel (nice), to give:
So, for any latent over any pair of views, with any[5] distribution:
The interpretation: every bit of the concept that can't be read off one feed alone shows up as either bits the concept carries beyond the shared information, or bits of agreement it leaves unexplained.
This already gives one part of the answer we're looking for. If we demand zero mediation error, then the total remainder is , minimized by the simplest exact mediator. But the complexity of the simplest exact mediator is , and for Gaussians , so:
The views can be almost identical, and the best fully-explanatory concept still can't be pinned down from one of them to better than half a bit.
The Exact Tradeoff Curve
Fix a mediation budget : the agreement you'll tolerate leaving unexplained, from 0 (explain everything) to (explain nothing).
Then, for each , there is a smallest achievable per-view redundancy error:
where is the correlation whose mutual information is exactly bits. The floor is achieved by an explicit symmetric Gaussian latent, as below.
Derivation of the Curve
The latent that sits exactly on the tradeoff curve is the Wyner mediator from the previous post with one change. Take a standard normal and build the two feeds as
for . Here the private noises are themselves jointly Gaussian with correlation (rather than independent, as in the exact mediator). Given , the residual correlation between the feeds is exactly , so the mediation error is exactly bits, by the definition of . To reproduce the total correlation of the actual pair, the dial must be set to
As a sanity check, at , we get , the exact Wyner mediator. At , we get , the constant latent. The tradeoff curve sweeps between these two points.
Its redundancy errors come out of the sum rule. Everything is jointly Gaussian, so is a one-line covariance computation, and the sum rule turns it into
with the two remainders equal by symmetry.
So this construction achieves the curve, but why does nothing beat it?
Reading the sum rule backwards: at a fixed mediation budget , minimizing the total remainder is the same problem as minimizing the latent's complexity . This is the relaxed Wyner problem (Wyner's question with the exact-mediation constraint loosened to a budget), and for Gaussian pairs it was solved by Sula and Gastpar, also published as Common Information Components Analysis (Entropy, 2021). Taking their result, we have the closed form
Importantly, this minimum is over any possible latent. Substituting their closed form back through the sum rule gives a floor of on the total remainder, so the worse of the two errors is at least for any latent, and the symmetric achiever above meets it with both errors equal.[6]
The derivation follows. Write everything as logs of the four factors , :
The sum rule at mediation budget says the minimal total remainder is . Add the three expressions and the factors cancel in pairs: the 's cancel between and , the 's cancel between and , and we have
We rearrange to get the desired expression:
The attached script checks the achieving latent against the closed form and runs a random search over valid latents; none crosses the floor set by the curve. That search is the gray cloud in the figure below.
And here is that curve in a plot:
Consider the left side of the curve. At exact mediation, the remainder is , which increases as the feeds become more alike, approaching half a bit as . At exactly it drops to zero, since identical feeds can just take to be the feed itself.
The Cliff
The cost of a fully shared concept rises as the views converge, all the way until the instant they coincide, where it falls off a cliff. At , we have identical feeds, the same informative function applied to both always agrees trivially.
So near-perfect agreement is the most expensive place to want a fully shared concept, and only exact agreement makes it free. If you want the free regime, we must make the views exactly similar.[7]
The Exchange Rate
The curve is also steepest at the left: the first sliver of mediation tolerance is the most valuable. Tolerate just bits of unexplained agreement and the remainder drops from to : a 40% reduction, bought with a 7% discount on explanation. (The slope at is infinite.)
At the right end the trade flattens to 8 bits of agreement explained per bit of remainder taken on, at , and better still at higher .[8] Strongly overlapping views get rich shared concepts, almost for free.
At , the feeds share bits, and explaining all of it forces bits of remainder per view, more than everything the feeds share.
My interpretation is that the existence of a worthwhile shared concept depends on a sharp threshold: high-overlap views share richly and cheaply, views with low-or-middling overlap have nothing that's "worth it" to share.
The Floor
Maybe you don't care about either error separately and just want the best balanced concept, minimizing the worse of the two errors. That's the point where the curve crosses the diagonal: bits at , rising to bits as .[9]
So, this is a floor under naturality itself. Below bits of error tolerance (the crossing value), there are no natural latents over a noisy pair. Every shared concept over a noisy world has an error floor set by the amount of noise.
Example: A Biased Die
Of the above, the half-bit cliff in particular is a weird result. It's discontinuous and counterintuitive. Could it be an artifact of Gaussian algebra? Here it is showing up in a familiar example, the biased die.
Consider a die with an unknown bias. Alice gets one long run of rolls and Bob gets another independent run. The bias is the thing they share: it's the entire reason their runs look alike, so it's the exact mediator, and mediation error is zero by construction. The redundancy error is the question: how much of the bias is stuck outside Alice's run?
For estimating the bias, Bob's run doubles Alice's sample. Doubling the sample halves her posterior variance, and halving a variance is worth , half a bit, no matter how much data she already had.
Below is this experiment computed in the Beta-binomial model: the remainder is bits at rolls per side and at .[10]
What this Means
Two observers looking at the same world from different views never hold exactly the same concept. Any concept available to both carries a remainder, a part stuck on the other side that only pooling views would resolve. The size of the remainder is set by how much the vantages overlap, reaching zero only when they coincide exactly.
The tradeoff curve says[11] that the AI's version of any shared concept holds a remainder relative to ours, and ours relative to its. The potentially useful question: how large is the remainder for the concepts we care about, and when is it small enough to rely on? [12]
What's Next
Agents track many features of an object at once (e.g. shape, color, position), with each overlapping a different amount. Next post: the optimal shared concept over many features keeps the strongly overlapping ones and discards the weakly overlapping ones. Combined with correlations decaying over distance, this says that shared concepts will drop modes (features) one at a time discretely rather than blurring them gradually. I think this could be tested on learned representations in neural networks.
Also, everything in this post allows the latent to be stochastic. Whether a deterministic concept can always do essentially as well is Wentworth & Lorell's $500 bounty question; the machinery here settles only the Gaussian case (with linear error transfer), which I plan to write up later in the sequence.[13]
Every number and plot in this post can be generated from this script.
For those who want to get a deep understanding of this post, I highly recommend watching this presentation introducing CCA and CICA by Prof. Michael Gastpar, whose work I build on.
The claim holds for any number of views, and for the stronger "recoverable from each complement-of-a-view" variant of redundancy; that version needs one extra idea (the latent becomes an invariant of a Gibbs resampling chain, and ergodicity forces invariants to be constant).
The MI algebra does not assume any type of distribution, hence this identity applies in the general case. I think this is the nicest result of the post, even though it's quite simple.
I suspect this is part of why minds and systems function with discrete (digital) schemas to represent information: symbolic language, error correction, the genetic code, etc.
The slope of the curve has a closed form: , so the marginal exchange rate at any point is , where is the correlation not yet explained. The going rate is always the corner rate of the residual. At the right end the residual is all of , giving ; at the left end the residual goes to zero and the rate goes to zero with it, which is the infinite slope at . The agreement a marginal shared latent explains scales with the residual correlation, while the remainder it incurs scales with how different the views still are.
The runs are exchangeable rather than Gaussian, so this is reassurance that the half-bit is not an artifact of our Gaussian setting. It turns out to be a Fisher information statement (doubling data halves variance) that works asymptotically for any regular parametric family.
The curve exists for every distribution, not just Gaussians. The sum rule is distribution-free, so the minimal total remainder at mediation budget is , where is the relaxed Wyner common information of the pair. This is defined for any two views, but there is no closed form in the general case. Its left endpoint is already informative in general: at exact mediation the total remainder is , which is positive for generic pairs. This is the general-case analog of the half-bit.
The remaining hard case for the general bounty lives near decomposable distributions (tiny-mixtures territory), which Gaussian methods provably can't reach. Again, maybe more on this later.
Second content post in a planned cluster on exact results for natural latents.
See the introduction and the previous post. In this post, I share some key theoretical results that this framing generates. This post has more math in it than the last one, most of which I've banished to collapsible sections.[1]
I assume you've read the previous posts, but I tried to make the pedagogical arc make sense even if you start with this one. It helps if you have some familiarity with natural latents, canonical correlation analysis[2], and information theory.
Consider an Elephant in a Room
Say Alice and Bob have access to two different camera feeds observing the same room.
Alice and Bob watch the same room through cameras in opposite corners, but they can only see their own feeds. Their feeds are correlated because it's the same room, but neither shows what the other shows. A natural latent over the two feeds is a variable that is
"What's in the room?" is the intended example.[3] The last post showed the two conditions correspond to the two classical notions of common information (Gács–Körner and Wyner), that an exact natural latent exists iff , and that this collapse is destroyed by one percent of noise. Exact natural latents are measure-zero objects.
So the natural latents framework runs on approximation, and each condition above gets an associated error term. The mediation error is the agreement fails to explain. The redundancy error is what Bob's feed still teaches you about the concept after you've seen Alice's: the part of stuck on Bob's side, unreachable from Alice's feed alone.
Let's call that stuck part the remainder. (Symmetrically for Bob.)
Then, an obvious question to ask: how small can the two errors be, together? How natural can natural latents be?
As I noted in the introductory post, this question is hard because the core objects have no generic closed form. This post answers the question in the jointly Gaussian case, where everything does have a closed form.
Claim. For jointly Gaussian views with correlation , any latent with zero redundancy error from each view is independent of the entire system. Suppose is exactly recoverable from each feed alone. Then the best estimate of from all the data equals the best estimate from Alice's feed by itself, and also from Bob's by itself. So one quantity is simultaneously a function of and a function of .
If it varied at all, we'd have a feature of Alice's feed and a feature of Bob's that agree with correlation exactly 1, which two views with do not possess (due to Witsenhausen, again). So it is a constant: a degenerate case which isn't useful for our purposes. I.e., non-degenerate exact natural latents do not exist.[4]
Top: No informative function can exactly agree, Bottom: Guaranteed agreement carries no information.
A Distribution-free Sum Rule
The two errors are connected by an identity we can obtain with some simple algebra. The derivation uses the chain rule for mutual information, see below.
Derivation of the Sum Rule
The chain rule for mutual information is
which says: what the pair tells you about is what tells you, plus what adds once you already have . We use it three times, grouping differently each time.
Bob's remainder. Expand what the two feeds together say about , taking Bob's feed first:
Rearranged: . Bob's remainder is the total information in the concept, minus the part his feed reaches on its own.
Alice's remainder. Now expand what the pair (concept + Alice's feed) says about Bob's feed, in two different orders:
The middle expression takes Alice's feed first: her feed predicts Bob's by , and then the concept adds exactly her remainder (the part of her feed didn't already contain, which is relevant to Bob's). The right expression takes the concept first: the concept predicts Bob's feed by , and then Alice's feed adds exactly the agreement the concept failed to explain. They are the same quantity, so:
Now adding, the terms cancel (nice), to give:
So, for any latent over any pair of views, with any[5] distribution:
The interpretation: every bit of the concept that can't be read off one feed alone shows up as either bits the concept carries beyond the shared information, or bits of agreement it leaves unexplained.
This already gives one part of the answer we're looking for. If we demand zero mediation error, then the total remainder is , minimized by the simplest exact mediator. But the complexity of the simplest exact mediator is , and for Gaussians , so:
The views can be almost identical, and the best fully-explanatory concept still can't be pinned down from one of them to better than half a bit.
The Exact Tradeoff Curve
Fix a mediation budget : the agreement you'll tolerate leaving unexplained, from 0 (explain everything) to (explain nothing).
, there is a smallest achievable per-view redundancy error:
Then, for each
where is the correlation whose mutual information is exactly bits. The floor is achieved by an explicit symmetric Gaussian latent, as below.
Derivation of the Curve
The latent that sits exactly on the tradeoff curve is the Wyner mediator from the previous post with one change. Take a standard normal and build the two feeds as
for . Here the private noises are themselves jointly Gaussian with correlation (rather than independent, as in the exact mediator). Given , the residual correlation between the feeds is exactly , so the mediation error is exactly bits, by the definition of . To reproduce the total correlation of the actual pair, the dial must be set to
As a sanity check, at , we get , the exact Wyner mediator. At , we get , the constant latent. The tradeoff curve sweeps between these two points.
Its redundancy errors come out of the sum rule. Everything is jointly Gaussian, so is a one-line covariance computation, and the sum rule turns it into
with the two remainders equal by symmetry.
, minimizing the total remainder is the same problem as minimizing the latent's complexity . This is the relaxed Wyner problem (Wyner's question with the exact-mediation constraint loosened to a budget), and for Gaussian pairs it was solved by Sula and Gastpar, also published as Common Information Components Analysis (Entropy, 2021). Taking their result, we have the closed form
So this construction achieves the curve, but why does nothing beat it?
Reading the sum rule backwards: at a fixed mediation budget
Importantly, this minimum is over any possible latent. Substituting their closed form back through the sum rule gives a floor of on the total remainder, so the worse of the two errors is at least for any latent, and the symmetric achiever above meets it with both errors equal.[6]
, :
The derivation follows. Write everything as logs of the four factors
The sum rule at mediation budget says the minimal total remainder is . Add the three expressions and the factors cancel in pairs: 's cancel between and , the 's cancel between and , and we have
the
We rearrange to get the desired expression:
The attached script checks the achieving latent against the closed form and runs a random search over valid latents; none crosses the floor set by the curve. That search is the gray cloud in the figure below.
And here is that curve in a plot:
Consider the left side of the curve. At exact mediation, the remainder is , which increases as the feeds become more alike, approaching half a bit as . At exactly it drops to zero, since identical feeds can just take to be the feed itself.
The Cliff
The cost of a fully shared concept rises as the views converge, all the way until the instant they coincide, where it falls off a cliff. At , we have identical feeds, the same informative function applied to both always agrees trivially.
So near-perfect agreement is the most expensive place to want a fully shared concept, and only exact agreement makes it free. If you want the free regime, we must make the views exactly similar.[7]
The Exchange Rate
The curve is also steepest at the left: the first sliver of mediation tolerance is the most valuable. Tolerate just bits of unexplained agreement and the remainder drops from to : a 40% reduction, bought with a 7% discount on explanation. (The slope at is infinite.)
At the right end the trade flattens to 8 bits of agreement explained per bit of remainder taken on, at , and better still at higher .[8] Strongly overlapping views get rich shared concepts, almost for free.
At , the feeds share bits, and explaining all of it forces bits of remainder per view, more than everything the feeds share.
My interpretation is that the existence of a worthwhile shared concept depends on a sharp threshold: high-overlap views share richly and cheaply, views with low-or-middling overlap have nothing that's "worth it" to share.
The Floor
Maybe you don't care about either error separately and just want the best balanced concept, minimizing the worse of the two errors. That's the point where the curve crosses the diagonal: bits at , rising to bits as .[9]
So, this is a floor under naturality itself. Below bits of error tolerance (the crossing value), there are no natural latents over a noisy pair. Every shared concept over a noisy world has an error floor set by the amount of noise.
Example: A Biased Die
Of the above, the half-bit cliff in particular is a weird result. It's discontinuous and counterintuitive. Could it be an artifact of Gaussian algebra? Here it is showing up in a familiar example, the biased die.
Consider a die with an unknown bias. Alice gets one long run of rolls and Bob gets another independent run. The bias is the thing they share: it's the entire reason their runs look alike, so it's the exact mediator, and mediation error is zero by construction. The redundancy error is the question: how much of the bias is stuck outside Alice's run?
For estimating the bias, Bob's run doubles Alice's sample. Doubling the sample halves her posterior variance, and halving a variance is worth , half a bit, no matter how much data she already had.
Below is this experiment computed in the Beta-binomial model: the remainder is bits at rolls per side and at .[10]
What this Means
Two observers looking at the same world from different views never hold exactly the same concept. Any concept available to both carries a remainder, a part stuck on the other side that only pooling views would resolve. The size of the remainder is set by how much the vantages overlap, reaching zero only when they coincide exactly.
Recall that the Natural Abstractions agenda is motivated by the hope that a capable AI—modeling the same world as we do—will form the concepts that we form such that we could find ours in it or point at them.
The tradeoff curve says[11] that the AI's version of any shared concept holds a remainder relative to ours, and ours relative to its. The potentially useful question: how large is the remainder for the concepts we care about, and when is it small enough to rely on? [12]
What's Next
Agents track many features of an object at once (e.g. shape, color, position), with each overlapping a different amount. Next post: the optimal shared concept over many features keeps the strongly overlapping ones and discards the weakly overlapping ones. Combined with correlations decaying over distance, this says that shared concepts will drop modes (features) one at a time discretely rather than blurring them gradually. I think this could be tested on learned representations in neural networks.
Also, everything in this post allows the latent to be stochastic. Whether a deterministic concept can always do essentially as well is Wentworth & Lorell's $500 bounty question; the machinery here settles only the Gaussian case (with linear error transfer), which I plan to write up later in the sequence.[13]
Every number and plot in this post can be generated from this script.
Conventions: Logarithms are base 2, quantities are in bits. A python script reproducing the numbers shown in this post is linked at the end.
For those who want to get a deep understanding of this post, I highly recommend watching this presentation introducing CCA and CICA by Prof. Michael Gastpar, whose work I build on.
See my comment on the last post, for a more detailed breakdown of what each classical object corresponds to in the case of the diagram.
The claim holds for any number of views, and for the stronger "recoverable from each complement-of-a-view" variant of redundancy; that version needs one extra idea (the latent becomes an invariant of a Gibbs resampling chain, and ergodicity forces invariants to be constant).
The MI algebra does not assume any type of distribution, hence this identity applies in the general case. I think this is the nicest result of the post, even though it's quite simple.
A latent with mediation error has total remainder at least , since is decreasing. So, the floor applies at the budget, not just on it.
I suspect this is part of why minds and systems function with discrete (digital) schemas to represent information: symbolic language, error correction, the genetic code, etc.
The slope of the curve has a closed form: , so the marginal exchange rate at any point is , where is the correlation not yet explained. The going rate is always the corner rate of the residual. At the right end the residual is all of , giving ; at the left end the residual goes to zero and the rate goes to zero with it, which is the infinite slope at . The agreement a marginal shared latent explains scales with the residual correlation, while the remainder it incurs scales with how different the views still are.
The runs are exchangeable rather than Gaussian, so this is reassurance that the half-bit is not an artifact of our Gaussian setting. It turns out to be a Fisher information statement (doubling data halves variance) that works asymptotically for any regular parametric family.
The curve exists for every distribution, not just Gaussians. The sum rule is distribution-free, so the minimal total remainder at mediation budget is , where is the relaxed Wyner common information of the pair. This is defined for any two views, but there is no closed form in the general case. Its left endpoint is already informative in general: at exact mediation the total remainder is , which is positive for generic pairs. This is the general-case analog of the half-bit.
This is speculative: I'm not too sure how useful it is to actually compute remainders for practical purposes. I hope to write more on this later.
The remaining hard case for the general bounty lives near decomposable distributions (tiny-mixtures territory), which Gaussian methods provably can't reach. Again, maybe more on this later.