Natural Latents Are Not Robust To Tiny Mixtures

johnswentworth; David Lorell

In our previous natural latent posts, our core theorem typically says something like:

Assume two agents have the same predictive distribution over variables $X$ , but model that distribution using potentially-different latent variables. If the latents both satisfy some simple “naturality” conditions (mediation and redundancy) then the two agents’ latents contain approximately the same information about $X$ . So, insofar as the two agents both use natural latents internally, we have reason to expect that the internal latents of one can be faithfully translated into the internal latents of the other.

This post is about one potential weakness in that claim: what happens when the two agents’ predictive distributions are only approximately the same?

Following the pattern of our previous theorems, we’d ideally say something like

If the two agents’ distributions are within $ϵ$ of each other (as measured by some KL-divergences), then their natural latents contain approximately the same information about $X$ , to within some $O (ϵ)$ bound.

But that turns out to be false.

The Tiny Mixtures Counterexample

Let’s start with two distributions, $P^{0}$ and $Q^{0}$ , over $X$ . These won’t be our two agents’ distributions - we’re going to construct our two agents’ distributions by mixing these two together, as the name “tiny mixtures” suggests.

$P^{0}$ and $Q^{0}$ will have extremely different natural latents. Specifically:

$X_{1}$ consists of 1 million bits, $X_{2}$ consists of another 1 million bits
Under $P^{0}$ , $X_{1}$ is uniform, and $X_{2} = X_{1}$ . So, there is an exact natural latent $Λ^{P} = X_{1} = X_{2}$ under $P^{0}$ .
Under $Q^{0}$ , $X_{1}$ and $X_{2}$ are independent and uniform. So, the empty latent $Λ^{Q}$ is exactly natural under $Q^{0}$ .

Mental picture: we have a million-bit channel, under $P^{0}$ the output ( $X_{2}$ ) is equal to the input ( $X_{1}$ ), while under $Q^{0}$ the channel hardware is maintained by Comcast so they’re independent.

Now for our two agents’ distributions, $P$ and $Q$ . $P$ will be almost $P^{0}$ , and $Q$ will be almost $Q^{0}$ , but each agent puts a $\frac{1}{2^{50}}$ probability on the other distribution:

$P = (1 - \frac{1}{2^{50}}) P^{0} + \frac{1}{2^{50}} Q^{0}$
$Q = \frac{1}{2^{50}} P^{0} + (1 - \frac{1}{2^{50}}) Q^{0}$

First key observation: $D_{K L} (P | | Q)$ and $D_{K L} (Q | | P)$ are both roughly 50 bits. Calculation:

$D_{K L} (P | | Q) = \sum_{X_{1}, X_{2}} P [X] (log P [X] - log Q [X])$ $\approx \sum_{X_{1} = X_{2}} \frac{1}{2^{1000000}} (- 1000000 - log (\frac{1}{2^{2000000}} + \frac{1}{2^{50}} \frac{1}{2^{1000000}}) \approx 50$
$D_{K L} (Q | | P) = \sum_{X_{1}, X_{2}} Q [X] (log Q [X] - log P [X])$ $\approx \sum_{X_{1} \neq X_{2}} \frac{1}{2^{2000000}} (- 2000000 - log (\frac{1}{2^{50}} \frac{1}{2^{2000000}})) \approx 50$

Intuitively: since each distribution puts roughly $\frac{1}{2^{50}}$ on the other, it takes about 50 bits of evidence to update from either one to the other.

Second key observation: the empty latent is approximately natural under $Q$ , and the latent $Λ := X_{1}$ is approximately natural under $P$ . Epsilons:

Under $Q$ , the empty latent satisfies mediation to within about $\frac{1}{2^{50}} * 1000000 \approx \frac{1}{2^{30}}$ bits (this is just mutual information of $X_{1}$ and $X_{2}$ under $Q$ ), and redundancy exactly (since the empty latent can always be exactly computed from any input).
Under $P$ , $Λ := X_{1}$ satisfies mediation exactly (since $X_{1}$ mediates between $X_{1}$ and anything else), redundancy with respect to $X_{2}$ exactly ( $Λ = X_{1}$ can be exactly computed from just $X_{1}$ without $X_{2}$ ), and redundancy with respect to $X_{1}$ to within about $\frac{1}{2^{50}} * 1000000 \approx \frac{1}{2^{30}}$ bits (since there’s a $\frac{1}{2^{50}}$ chance that $X_{2}$ doesn’t tell us the relevant 1000000 bits).

… and of course the information those two latents tell us about $X$ differs by 1 million bits: one of them is empty, and the other directly tells us 1 million bits about $X_{1}$ .

Now, let’s revisit the claim we would’ve liked to make:

If the two agents’ distributions are within $ϵ$ of each other (as measured by some KL-divergences), then their natural latents contain approximately the same information about $X$ , to within some $O (ϵ)$ bound.

Tiny mixtures rule out any claim along those lines. Generalizing the counterexample to an $N$ bit channel (where $N = 1000000$ above) and a mixin probability of $\frac{1}{2^{M}}$ (where $M = 50$ above), we generally see that the two latents are natural over their respective distributions to about $\frac{1}{2^{M}} N$ , the $D_{K L}$ between the distributions is about $\frac{1}{2^{M}}$ in either direction, yet one latent contains $N$ bits of information about $X$ while the other contains zero. By choosing $2^{M} >> N$ , with both $M$ and $N$ large, we can get arbitrarily precise natural latents over the two distributions, with the difference in the latents exponentially large with respect to the $D_{K L}$ ’s between distributions.

What To Do Instead?

So the bound we’d ideally like is ruled out. What alternatives might we aim for?

Different Kind of Approximation

Looking at the counterexample, one thing which stands out is that $P$ and $Q$ are, intuitively, very different distributions. Arguably, the problem is that a “small” $D_{K L}$ just doesn’t imply that the distributions are all that close together; really we should use some other kind of approximation.

On the other hand, $D_{K L}$ is a pretty nice principled error-measure with nice properties, and in particular it naturally plugs into information-theoretic or thermodynamic machinery. And indeed, we are hoping to plug all this theory into thermodynamic-style machinery down the road. For that, we need global bounds, and they need to be information-theoretic.

Additional Requirements for Natural Latents

Coming from another direction: a 50-bit update can turn $Q$ into $P$ , or vice-versa. So one thing this example shows is that natural latents, as they’re currently formulated, are not necessarily robust to even relatively small updates, since 50 bits can quite dramatically change a distribution.

Interestingly, there do exist other natural latents over these two distributions which are approximately the same (under their respective distributions) as the two natural latents we used above, but more robust (in some ways) to turning one distribution into the other. In particular: we can always construct a natural latent with competitively optimal approximation via resampling. Applying that construction to $Q$ , we get a latent which is usually independent random noise (which gives the same information about $X$ as the empty latent), but there’s a $\frac{1}{2^{50}}$ chance that it contains the value of $X_{1}$ and another $\frac{1}{2^{50}}$ chance that it contains the value of $X_{2}$ . Similarly, we can use the resampling construction to find a natural latent for $P$ , and it will have a $\frac{1}{2^{50}}$ chance of containing random noise instead of $X_{1}$ , and an independent $\frac{1}{2^{50}}$ chance of containing random noise instead of $X_{2}$ .

Those two latents still differ in their information content about $X$ by roughly 1 million bits, but the distribution of $X$ given each latent differs by only about 100 bits in expectation. Intuitively: while the agents still strongly disagree about the distribution of their respective latents, they agree (to within ~100 bits) on what each value of the latent says about $X$ .

Does that generalize beyond this one example? We don’t know yet.

But if it turns out that the competitively optimal natural latent is generally robust to updates, in some sense, then it might make sense to add a robustness-to-updates requirement for natural latents - require that we use the “right” natural latent, in order to handle this sort of problem.

Same Distribution

A third possible approach is to formulate the theory around a single distribution $P [X]$ .

For instance, we could assume that the environment follows some “true distribution”, and both agents look for latents which are approximately natural over the “true distribution” (as far as they can tell, since the agents can’t observe the whole environment distribution directly). This would probably end up with a Fristonian flavor.

ADDED July 9: The Competitively Optimal Natural Latent from Resampling Always Works (At Least Mediocrely)

Recall that, for a distribution $P [X_{1}, . . ., X_{n}]$ , we can always construct a competitively optimal natural latent (under strong redundancy) $X^{'}$ by resampling each component $X_{i}$ conditional on the others $X_{¯ i}$ , i.e.

$P [X = x, X^{'} = x^{'}] := P [X = x] \prod_{i} P [X_{i} = x_{i}^{'} | X_{¯ i} = x_{¯ i}]$

We argued above that this specific natural latent works just fine in the tiny mixtures counterexample: roughly speaking, the resampling natural latent constructed for $P$ approximates the resampling natural latent constructed for $Q$ (to within an error comparable to how well $P$ approximates $Q$ ).

Now we'll show that that generalizes. Our bound will be mediocre, but it's any bound at all, so that's progress.

Specifically: suppose we have two distributions over the same variables, $P [X_{1}, . . ., X_{n}]$ and $Q [X_{1}, . . ., X_{n}]$ . We construct a competitively optimal natural latent $X^{'}$ via resampling for each distribution:

$P [X = x, X^{'} = x^{'}] := P [X = x] \prod_{i} P [X_{i} = x_{i}^{'} | X_{¯ i} = x_{¯ i}]$

$Q [X = x, X^{'} = x^{'}] := Q [X = x] \prod_{i} Q [X_{i} = x_{i}^{'} | X_{¯ i} = x_{¯ i}]$

Then, we'll use $E [D_{K L} (P [X^{'} | X] | | Q [X^{'} | X])]$ (with expectation taken over $X$ under distribution $P$ ) as a measure of how well $Q$ 's latent $X^{'}$ matches $P$ 's latent $X^{'}$ . Core result:

$E [D_{K L} (P [X^{'} | X] | | Q [X^{'} | X])] \leq n D_{K L} (P [X] | | Q [X])$

Proof:

$E [D_{K L} (P [X^{'} | X] | | Q [X^{'} | X])] = E [D_{K L} (\prod_{i} P [X_{i} = x_{i}^{'} | X_{¯ i} = x_{¯ i}] | | \prod_{i} Q [X_{i} = x_{i}^{'} | X_{¯ i} = x_{¯ i}])]$

$= \sum_{i} E [D_{K L} (P [X_{i} = x_{i}^{'} | X_{¯ i} = x_{¯ i}] | | Q [X_{i} = x_{i}^{'} | X_{¯ i} = x_{¯ i}])]$

$= \sum_{i} E [D_{K L} (P [X_{i} | X_{¯ i}] | | Q [X_{i} | X_{¯ i}])]$

$\leq \sum_{i} (E [D_{K L} (P [X_{i} | X_{¯ i}] | | Q [X_{i} | X_{¯ i}])] + D_{K L} (P [X_{¯ i}] | | Q [X_{¯ i}]))$

$= \sum_{i} D_{K L} (P [X] | | Q [X])$

$= n D_{K L} (P [X] | | Q [X])$

So we have a bound. Unfortunately, the factor of $n$ (number of variables) makes the bound kinda mediocre. We could sidestep that problem in practice by just using natural latents over a small number of variables at any given time (which is actually fine for many and arguably most use cases). But based on the proof, it seems like we should be able to improve a lot on that factor of n; we outright add $\sum_{i} D_{K L} (P [X_{¯ i}] | | Q [X_{¯ i}]))$ , which should typically be much larger than the quantity we're trying to bound.

Coming from another direction: a 50-bit update can turn into $P$ , or vice-versa. So one thing this example shows is that natural latents, as they’re currently formulated, are not necessarily robust to even relatively small updates, since 50 bits can quite dramatically change a distribution.

Are you sure this is undesired behavior? Intuitively, small updates (relative to the information-content size of the system regarding which we're updating) can drastically change how we're modeling a particular system, into what abstractions we decompose it. E. g., suppose we have two competing theories regarding how to predict the neural activity in the human brain, and a new paper comes out with some clever (but informationally compact) experiment that yields decisive evidence in favour of one of those theories. That's pretty similar to the setup in the post here, no? And reading this paper would lead to significant ontology shifts in the minds of the researchers who read it.

Which brings to mind How Many Bits Of Optimization Can One Bit Of Observation Unlock?, and the counter-example there...

Indeed, now that I'm thinking about it, I'm not sure the quantity $\frac{bit-size of the update}{bit-size of the system}$ is in any way interesting at all? Consider that the researchers' minds could be updated either from reading the paper and examining the experimental procedure in detail (a "medium" number of bits), or by looking at the raw output data and then doing a replication of the paper (a "large" number of bits), or just by reading the names of the authors and skimming the abstract (a "small" number of bits).

There doesn't seem to be a direct causal connection between the system's size and the amount of bits needed to drastically update on its structure at all? You seem to expect some sort of proportionality between the two, but I think the size of one is straight-up independent of the size of the other if you let the nature of the communication channel between the system and the agent-doing-the-updating vary freely (i. e., if you're uncertain regarding whether it's "direct observation of the system" OR "trust in science" OR "trust in the paper's authors" OR ...).^[1]

Indeed, merely describing how you need to update using high-level symbolic languages, rather than by throwing raw data about the system at you, already shaves off a ton of bits, decoupling "the size of the system" from "the size of the update".

Perhaps $D_{K L}$ really isn't the right metric to use, here? The motivation for having natural abstractions in your world-model is that they make the world easier to predict for the purposes of controlling said world. So similar-enough natural abstractions would recommend the same policies for navigating that world. Back-tracking further, the distributions that would give rise to similar-enough natural abstractions would be distributions that correspond to worlds the policies for navigating which are similar-enough...

I. e., the distance metric would need to take interventions/the $do$ operator into account. Something like SID comes to mind (but not literally SID, I expect).

^{^}
Though there may be some more interesting claim regarding that entire channel? E. g., that if the agent can update drastically just based on a few bits output by this channel, we have to assume that the channel contains "information funnels" which compress/summarize the raw state of the system down? That these updates have to be entangled with at least however-many-bits describing the ground-truth state of the system, for them to be valid?

Which brings to mind How Many Bits Of Optimization Can One Bit Of Observation Unlock?, and the counter-example there...

We actually started from that counterexample, and the tiny mixtures example grew out of it.

In the context of alignment, we want to be able to pin down which concepts we are referring to, and natural latents were (as I understand it) partly meant to be a solution to that. However if there are multiple different concepts that fit the same natural latent but function very differently then that doesn't seem to solve the alignment aspect.

I do see the intuitive angle of "two agents exposed to mostly-similar training sets should be expected to develop the same natural abstractions, which would allow us to translate between the ontologies of different ML models and between ML models and humans", and that this post illustrated how one operationalization of this idea failed.

However if there are multiple different concepts that fit the same natural latent but function very differently

That's not quite what this post shows, I think? It's not that there are multiple concepts that fit the same natural latent, it's that if we have two distributions that are judged very close by the KL divergence, and we derive the natural latents for them, they may turn out drastically different. The agent and the $Q$ agent legitimately live in very epistemically different worlds!

Which is likely not actually the case for slightly different training sets, or LLMs' training sets vs. humans' life experiences. Those are very close on some metric $X$ , and now it seems that $X$ isn't (just) $D_{K L}$ .

Maybe one way to phrase it is that the X's represent the "type signature" of the latent, and the type signature is the thing we can most easily hope is shared between the agents, since it's "out there in the world" as it represents the outwards interaction with things. We'd hope to be able to share the latent simply by sharing the type signature, because the other thing that determines the latent is the agents' distribution, but this distribution is more an "internal" thing that might be too complicated to work with. But the proof in the OP shows that the type signature is not enough to pin it down, even for agents whose models are highly compatible with each other as-measured-by-KL-in-type-signature.

Sure, but what I question is whether the OP shows that the type signature wouldn't be enough for realistic scenarios where we have two agents trained on somewhat different datasets. It's not clear that their datasets would be different the same way and $Q$ are different here.

I may misunderstand (I’ve only skimmed), but its not clear to me we want natural latents to be robust to small updates. Phase changes and bifurcation points seem like something you should expect here. I would however feel more comfortable if such points had small or infinitesimal measure.

Another angle to consider: in this specific scenario, would realistic agents actually derive natural latents for and $Q$ as a whole, as opposed to deriving two mutually incompatible latents for the $Q^{0}$ and $P^{0}$ components, then working with a probability distribution over those latents?

Intuitively, that's how humans operate if they have two incompatible hypotheses about some system. We don't derive some sort of "weighted-average" ontology for the system, we derive two separate ontologies and then try to distinguish between them.

This post comes to mind:

If you only care about betting odds, then feel free to average together mutually incompatible distributions reflecting mutually exclusive world-models. If you care about planning then you actually have to decide which model is right or else plan carefully for either outcome.

Like, "just blindly derive the natural latent" is clearly not the whole story about how world-models work. Maybe realistic agents have some way of spotting setups structured the way the OP is structured, and then they do something more than just deriving the latent.

Coming from another direction: a 50-bit update can turn into $P$ , or vice-versa. So one thing this example shows is that natural latents, as they’re currently formulated, are not necessarily robust to even relatively small updates, since 50 bits can quite dramatically change a distribution.

Which brings to mind How Many Bits Of Optimization Can One Bit Of Observation Unlock?, and the counter-example there...

I. e., the distance metric would need to take interventions/the $do$ operator into account. Something like SID comes to mind (but not literally SID, I expect).

^{^}
Though there may be some more interesting claim regarding that entire channel? E. g., that if the agent can update drastically just based on a few bits output by this channel, we have to assume that the channel contains "information funnels" which compress/summarize the raw state of the system down? That these updates have to be entangled with at least however-many-bits describing the ground-truth state of the system, for them to be valid?

Which brings to mind How Many Bits Of Optimization Can One Bit Of Observation Unlock?, and the counter-example there...

We actually started from that counterexample, and the tiny mixtures example grew out of it.

However if there are multiple different concepts that fit the same natural latent but function very differently

This post comes to mind:

If you only care about betting odds, then feel free to average together mutually incompatible distributions reflecting mutually exclusive world-models. If you care about planning then you actually have to decide which model is right or else plan carefully for either outcome.

65

Natural Latents Are Not Robust To Tiny Mixtures

65

The Tiny Mixtures Counterexample

What To Do Instead?

Different Kind of Approximation

Additional Requirements for Natural Latents

Same Distribution

ADDED July 9: The Competitively Optimal Natural Latent from Resampling Always Works (At Least Mediocrely)

65

65