LESSWRONG
LW

Bounties (closed)AI
Frontpage

92

$500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need?

by johnswentworth, David Lorell
21st Apr 2025
4 min read
24

92

Bounties (closed)AI
Frontpage

92

$500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need?
23Alfred Harwood
5johnswentworth
4Donald Hobson
4Alfred Harwood
4Donald Hobson
5David Johnston
7J Bostock
7johnswentworth
9Alex Gibson
9johnswentworth
4J Bostock
4J Bostock
4johnswentworth
2johnswentworth
3Arjun Pitchanathan
5johnswentworth
4Arjun Pitchanathan
3Arjun Pitchanathan
3Arjun Pitchanathan
3Arjun Pitchanathan
1David Johnston
2johnswentworth
6David Johnston
4johnswentworth
-13[comment deleted]
New Comment
24 comments, sorted by
top scoring
Click to highlight new comments since: Today at 9:47 PM
[-]Alfred Harwood4mo230

This seems like an interesting problem! I've been thinking about it a little bit but wanted to make sure I understood before diving in too deep. Can I see if I understand this by going through the biased coin example?

Suppose I have 2^5 coins and each one is given a unique 5-bit string label covering all binary strings from 00000 to 11111. Call the string on the label Λ.

The label given to the coin indicates its 'true' bias. The string 00000 indicates that the coin with that label has p(heads)=0. The coin labelled 11111 has p(heads)=1. The ‘true’ p(heads) increases in equal steps going up from 00000 to 00001 to 00010 etc. Suppose I randomly pick a coin from this collection, toss it 200 times and call the number of heads X_1. Then I toss it another 200 times and call the number of heads X_2.

Now, if I tell you what the label on the coin was (which tells us the true bias of the coin), telling you X_1 would not give you any more information to help you guess X_2 (and vice versa). This is the first Natural Latent condition (Λ induces independence between X_1 and X_2). Alternatively, if I didn’t tell you the label, you could estimate it from either X_1 or X_2 equally well. This is the other two diagrams.

I think that the full label Λ will be an approximate stochastic natural latent. But if we consider only the first bit[1] of the label (which roughly tells us whether the bias is above or below 50% heads) then this bit will be a deterministic natural latent because with reasonably high certainty, you can guess the first bit of Λ from X_1 or X_2. This is because the conditional entropy H(first bit of Λ|X_1) is low. On the other hand H(Λ | X_1) will be high. If I get only 23 heads out of 200 tosses, I can be reasonably certain that the first bit of Λ is a 0 (ie the coin has a less than 50% of coming up heads) but can't be as certain what the last bit of Λ is. Just because Λ satisfies the Natural Latent conditions within ϵ, this doesn’t imply that Λ satisfies H(Λ|X1)<ϵ. We can use X_1 to find a 5-bit estimate of Λ, but most of the useful information in that estimate is contained in the first bit. The second bit might be somewhat useful, but its less certain than the first. The last bit of the estimate will largely be noise. This means that going from using Λ to using ‘first bit of Λ’ doesn’t decrease the usefulness of the latent very much, since the stuff we are throwing out is largely random. As a result, the ‘first bit of Λ’ will still satisfy the natural latent conditions almost as well as Λ. By throwing out the later bits, we threw away the most 'stochastic' bits, while keeping the most 'latenty' bits.

So in this case, we have started from a stochastic natural latent and used it to construct a deterministic natural latent which is almost as good. I haven’t done the calculation, but hopefully we could say something like ‘if Λ satisfies the natural latent conditions within ϵ then the first bit of Λ satisfies the natural latent conditions within 2ϵ (or 3ϵ or something else)’. Would an explicit proof of a statement like this for this case be a special case of the general problem?

The problem question could be framed as something like: “Is there some standard process we can do for every stochastic natural latent, in order to obtain a deterministic natural latent which is almost as good (in terms of \epsilon)”. This process will be analogous to the ‘throwing away the less useful/more random bits of \lambda’ which we did in the example above. Does this sound right?

Also, can all stochastic natural latents can be thought of as 'more approximate' deterministic latents? If a latent satisfies the the three natural latents conditions within ϵ1, we can always find a (potentially much bigger) ϵ2 such that this latent also satisfies the deterministic latent condition, right? This is why you need to specify that the problem is showing that a deterministic natural latent exists with 'almost the same' ϵ. Does this sound right?

 

 

  1. ^

    I'm going to talk about the 'first bit' but an equivalent argument might also hold for the 'first two bits' or something. I haven't actually checked the maths. 

Reply
[-]johnswentworth4mo50

Some details mildly off, but I think you've got the big picture basically right.

Alternatively, if I didn’t tell you the label, you could estimate it from either X_1 or X_2 equally well. This is the other two diagrams.

Minor clarification here: the other two diagrams say not only that I can estimate the label equally well from either X1 or X2, but that I can estimate the label (approximately) equally well from X1, X2, or the pair (X1,X2).

I think that the full label Λ will be an approximate stochastic natural latent.

I'd have to run the numbers to check that 200 flips is enough to give a high-confidence estimate of Λ (in which case 400 flips from the pair of variables will also put high confidence on the same value with high probability), but I think yes.

But if we consider only the first bit[1] of the label (which roughly tells us whether the bias is above or below 50% heads) then this bit will be a deterministic natural latent because with reasonably high certainty, you can guess the first bit of Λ from X1 or X2.

Not quite; I added some emphasis. The first bit will (approximately) satisfy the two redundancy conditions, i.e. X1→X2→1bit(Λ) and X2→X1→1bit(Λ), and indeed will be an approximately deterministic function of X. But it won't (approximately) satisfy the mediation condition X1←1bit(Λ)→X2; the two sets of flips will not be (approximately) independent given only the first bit. (At least not to nearly as good an approximation as the original label.)

That said, the rest of your qualitative reasoning is correct. As we throw out more low-order bits, the mediation condition becomes less well approximated, the redundancy conditions become better approximated, and the entropy of the coarse-grained latent given X falls.

So to build a proof along these lines, one would need to show that a bit-cutoff can be chosen such that bit_cutoff(Λ) still mediates (to an approximation roughly ϵ-ish), while making the entropy of bit_cutoff(Λ) low given X.

I do think this is a good angle of attack on the problem, and it's one of the main angles I'd try.

If a latent satisfies the the three natural latents conditions within ϵ1, we can always find a (potentially much bigger) ϵ2 such that this latent also satisfies the deterministic latent condition, right? This is why you need to specify that the problem is showing that a deterministic natural latent exists with 'almost the same' ϵ. Does this sound right?

Yes. Indeed, if we allow large enough ϵ (possibly scaling with system size/entropy) then there's always a deterministic natural latent regardless; the whole thing becomes trivial.

Reply
[-]Donald Hobson4mo40

I'd have to run the numbers to check that 200 flips is enough to give a high-confidence estimate of Λ

It isn't enough. See plot. Also, 200 not being enough flips is part of what makes this interesting. With a million flips, this would pretty much just be the exact case. The fact that it's only 200 flips gives you a tradeoff in how many label_bits to include. 

Reply
[-]Alfred Harwood4mo40

Thanks for the clarifications, that all makes sense. I will keep thinking about this!

Reply
[-]Donald Hobson4mo40

Here is the probability density function for heads plotted for each of your coins. 

 

python code

import numpy as np
l=np.linspace(0,1,32)
def f(a):
   a=np.array([a,1-a])
   b=a
   for i in range(199):
       b=np.convolve(b,a)
   return b

q=np.arange(201)
import matplotlib.pyplot as plt
ff=[f(i) for i in l]
plt.plot(ff);plt.show()

for i in ff:
   _=plt.plot(q,i)

   
plt.show()
plt.xlabel("heads")
Text(0.5, 0, 'heads')
plt.ylabel("prob")
Text(0, 0.5, 'prob')
for i in ff:
   _=plt.plot(q,i)

   
plt.show()

Reply
[-]David Johnston4mo50

I've thought about it a bit, I have a line of attack for a proof, but there's too much work involved in following it through to an actual proof so I'm going to leave it here in case it helps anyone.

I'm assuming everything is discrete so I can work with regular Shannon entropy.

Consider the range R1 of the function g1:λ↦P(X1|Λ=λ) and R2 defined similarly. Discretize R1 and R2 (chop them up into little balls). Not sure which metric to use, maybe TV.

Define Λ′1(λ) to be the index of the ball into which P(X1|Λ=λ) falls, Λ′2 similar. So if d(P(X1|Λ=a),P(X1|Λ=b)) is sufficiently small, then Λ′1(a)=Λ′1(b).

By the data processing inequality, conditions 2 and 3 still hold for Λ′=(Λ′1,Λ′2). Condition 1 should hold with some extra slack depending on the coarseness of the discretization.

It takes a few steps, but I think you might be able to argue that, with high probability, for each X2=x2, the random variable Q1:=P(X1|Λ′1) will be highly concentrated (n.b. I've only worked it through fully in the exact case, and I think it can be translated to the approximate case but I haven't checked). We then invoke the discretization to argue that H(Λ′1|X1) is bounded. The intuition is that the discretization forces nearby probabilities to coincide, so if Q1 is concentrated then it actually has to "collapse" most of its mass onto a few discrete values.

We can then make a similar argument switching the indices to get H(Λ′2|X2) bounded. Finally, maybe applying conditions 2 and 3 we can get H(Λ′1|X2) bounded as well, which then gives a bound on H(Λ|Xi).

I did try feeding this to Gemini but it wasn't able to produce a proof.

Reply
[-]J Bostock4mo70

I've been working on the reverse direction: chopping up P[Λ] by clustering the points (treating each distribution as a point in distribution space) given by P[Λ|X=x], optimizing for a deterministic-in-X latent Δ=Δ(X) which minimizes DKL(P[Λ|X]||P[Λ|Δ(X)]).

This definitely separates X1 and X2 to some small error, since we can just use Δ to build a distribution over Λ which should approximately separate X1 and X2.

To show that it's deterministic in X1 (and by symmetry X2) to some small error, I was hoping to use the fact that---given X1---X2 has very little information about Λ, so it's unlikely that P[Λ|X1] is in a different cluster to P[Λ|X1,X2]. This means that P[Δ|X1] would just put most of the weight on the cluster containing P[Λ|X1].

A constructive approach for Δ would be marginally more useful in the long-run, but it's also probably easier to prove things about the optimal Δ. It's also probably easier to prove things about Δ for a given number of clusters |Δ|, but then you also have to prove things about what the optimal value of |Δ| is.

Reply
[-]johnswentworth4mo70

Sounds like you've correctly understood the problem and are thinking along roughly the right lines. I expect a deterministic function of X won't work, though.

Hand-wavily: the problem is that, if we take the latent to be a deterministic function Δ(X), then P[X|Δ(X)] has lots of zeros in it - not approximate-zeros, but true zeros. That will tend to blow up the KL-divergences in the approximation conditions.

I'd recommend looking for a function Δ(Λ). Unfortunately that does mean that low entropy of Δ(Λ)given X has to be proven.

Reply
[-]Alex Gibson1mo92

I'm confused by this. The KL term we are looking at in the deterministic case is 
DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ]), right?

For simplicity, we imagine we have finite discrete spaces. Then this would blow up if P[X=(x1,x2),Λ=λ]≠0, and P[Λ=λ]P[X1=x1|Λ=λ]P[X2=x2|Λ=λ]=0. But this is impossible, because any of the terms in the product being 0 imply that P[X=(x1,x2),Λ=λ] is 0. 

 Intuitively, we construct an optimal code for encoding the distribution P[Λ]P[X1|Λ]P[X2|Λ], and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by P[X,Λ].  Issues occur when but the true distribution P[X,Λ] takes on values which never occur according to P[Λ]P[X1|Λ]P[X2|Λ], i.e: the optimal code doesn't account for those values potentially occurring. 

Potentially there are subtleties when we have continuous spaces. In any case I'd be grateful if you're able to elaborate.

Reply
[-]johnswentworth1mo92

Yeah, I've since updated that deterministic functions are probably the right thing here after all, and I was indeed wrong in exactly the way you're pointing out.

Reply1
[-]J Bostock4mo41

Huh, I had vaguely considered that but I expected any P[X|Δ(X)]=0 terms to be counterbalanced by P[X,Δ(X)]=0 terms, which together contribute nothing to the KL-divergence. I'll check my intuitions though.

I'm honestly pretty stumped at the moment. The simplest test case I've been using is for X1 and X2 to be two flips of a biased coin, where the bias is known to be either k or 1−k with equal probability of either. As k varies, we want to swap from Δ≅Λ to the trivial case |Δ|=1 and back. This (optimally) happens at around k=0.08 and k=0.92. If we swap there, then the sum of errors for the three diagrams of Δ does remain less than 2(ϵ+ϵ+ϵ) at all times.

Likewise, if we do try to define Δ(X), we need to swap from a Δ which is equal to the number of heads, to |Δ|=1, and back.

In neither case can I find a construction of Δ(X) or Δ(Λ) which swaps from one phase to the other at the right time! My final thought is for Δ to be some mapping Λ→P(Λ) consisting of a ball in probability space of variable radius (no idea how to calculate the radius) which would take k→{k} at k≈1 and k→{k,1−k} at k≈0.5. Or maybe you have to map Λ→P(X) or something like that. But for now I don't even have a construction I can try to prove things for.

Perhaps a constructive approach isn't feasible, which probably means I don't have quite the right skillset to do this.

Reply
[-]J Bostock4mo*40

OK so some further thoughts on this: suppose we instead just partition the values of Λ directly by something like a clustering algorithm, based on DKL in P[X|Λ] space, and take Δ(Λ) just be the cluster that λ is in:

Assuming we can do it with small clusters, we know that P[X|Λ]≈P[X|Δ] is pretty small, so DKL(P[X]||P[X|Δ]) is also small.

And if we consider X2←X1→Λ, this tells us that learning X1 restricts us to a pretty small region of P[X2] space (since P[X2|X1]≈P[X2|X1,Λ]) so Δ should be approximately deterministic in X1. This second part is more difficult to formalize, though.

Edit: The real issue is whether or not we could have lots of Λ values which produce the same distribution over X2 but different distributions over X1, and all be pretty likely given X1=x1 for some x1. I think this just can't really happen for probable values of x1, because if these values of λ produce the same distribution over X2, but different distributions over X1, then that doesn't satisfy X1←X2→Λ, and secondly because if they produced wildly different distributions over X1, then that means they can't all have high values of P[X1=x1|Λ=λ], and so they're not gonna have high values of P[Λ=λ|X1=x1].

Reply
[-]johnswentworth4mo*40

Here's a trick which might be helpful for anybody tackling the problem.

First, note that f(Λ):=(x↦P[X=x|Λ]) is always a sufficient statistic of Λ for X, i.e.

Λ→f(Λ)→X

Now, we typically expect that the lower-order bits of f(Λ) are less relevant/useful/interesting. So, we might hope that we can do some precision cutoff on f(Λ), and end up with an approximate suficient statistic, while potentially reducing the entropy (or some other information content measure) of f(Λ) a bunch. We'd broadcast the cutoff function like this:

g(Λ):=precison_cutoff(f(Λ))=(x↦precision_cutoff(P[X=x|Λ]))

Now we'll show a trick for deriving DKL bounds involving g(Λ).

First note that

E[DKL(P[X|Λ]||P[X|g(Λ)])]≤E[DKL(P[X|Λ]||g(Λ))]

This is a tricky expression, so let's talk it through. On the left, g(Λ) is treated informationally; it's just a generic random variable constructed as a generic function of Λ, and we condition on that random variable in the usual way. On the right, the output-value of g is being used as a distribution over X.

The reason this inequality holds is because a Bayes update is the "best" update one can make, as measured by expected DKL. Specifically, if I'm given the value of any function g(Λ), then the distribution Q (as a function of g(Λ)) which minimizes E[DKL(P[X|Λ]||Q)] is P[X|g(Λ)]. Since P[X|g(Λ)]  minimizes that expected DKL, any other distribution over X (as a function of g(Λ)) can only do "worse" - including g(Λ) itself, since that's a distribution over X, and is a function of g(Λ).

Plugging in the definition of g, that establishes

E[DKL(P[X|Λ]||P[X|g(Λ)])]≤E[DKL(P[X|Λ]||(x↦precision_cutoff(P[X=x|Λ])))]

Then the final step is to use the properties of whatever precision_cutoff function one chose, to establish that E[DKL(P[X|Λ]||(x↦precision_cutoff(P[X=x|Λ])))] can't be too far from E[DKL(P[X|Λ]||P[X|Λ])], i.e. 0. That produces an upper bound on E[DKL(P[X|Λ]||P[X|g(Λ)])], where the bound is 0 + (whatever terms came from the precision cutoff).

Reply
[-]johnswentworth4mo20

@Alfred Harwood @David Johnston 

If anyone else would like to be tagged in comments like this one on this post, please eyeball-react on this comment. Alfred and David, if you would like to not be tagged in the future, please say so.

Reply5
[-]Arjun Pitchanathan3mo*30

Epistemic status: Quick dump of something that might be useful to someone. o3 and Opus 4 independently agree on the numerical calculations for the bolded result below, but I didn't check the calculations myself in any detail.

When we say "roughly", e.g. 2ϵ or 3ϵ would be fine; it may be a judgement call on our part if the bound is much larger than that. 

Let Y∼Ber(p). With probability r, set Z:=X, and otherwise draw Z∼Ber(p). Let Y∼Ber(1/2). Let A=X⊕Y and B=Y⊕Z. We will investigate latents for (A,B). 

Set Λ:=Y, then note that the stochastic error ϵ:=I(A;Y|B)) because Y induces perfect conditional independence and symmetry of A and B. Now compute the deterministic errors of Λ:=Y, Λ:=0, Λ:=A, which are equal to H(Y∣A),I(A;B),H(A|B) respectively. 

Then it turns out that with p:=0.9,r:=0.44, all of these latents have error greater than 5ϵ, if you believe this claude opus 4 artifact (full chat here, corroboration by o3 here). Conditional on there not being some other kind of latent that gets better deterministic error, and the calculations being correct, I would expect that a bit more fiddling around could produce much better bounds, say 10ϵ or more, since I think I've explored very little of the search space. 

e.g. one could create more As and Bs by either adding more Ys, or more Xs and Zs. Or one could pick the probabilities p,r out of some discrete set of possibilities instead of having them be fixed.

Reply
[-]johnswentworth2mo53

Set Λ:=Y, then note that the stochastic error ϵ:=I(A;Y|B)) because Y induces perfect conditional independence and symmetry of A and B.

I don't think Y induces perfect conditional independence? Conditional on Y, we have:

  • (Probability r) A = B, else
  • (Probability 1 - r) A and B are independent

... which means that learning the value of A tells me something about the value of B, conditional on Y (specifically, B is more likely to have the same value A had).

Am I missing something here?

(Also, for purposes of me tracking how useful LLMs are for research: assuming I'm not missing something and this was a mistake, was the mistake originally made by you or an LLM?)

Reply
[-]Arjun Pitchanathan1mo40

Yeah, my comment went through a few different versions and that statement doesn't apply to the final setting. I should've checked it better before hitting submit, sorry. I only used LLMs for writing code for numerical calculations, so the error is mine. [1]

I think that I didn't actually use this claim in the numerical calculations, so I'd hope that the rest of the comment continues to hold. I had hoped to verify that before replying, but given that it's been two weeks already, I don't know when I'll manage to get to it.

  1. ^

    I did try to see if it could write a message explaining the claim, but didn't use that

Reply1
[-]Arjun Pitchanathan1mo*30

To check my understanding: for random variables A,B, the stochastic error of a latent Λ is the maximum among  I(A;B|Λ),I(A;Λ|B),I(B;Λ|A). The deterministic error is the maximum among I(A;B∣Λ),H(Λ|A),H(Λ|B). If so, the claim in my original comment holds -- I also wrote code (manually) to verify.  Here's the fixed claim:

Let X∼Ber(p). With probability r, set Z:=X, and otherwise draw Z∼Ber(p). Let Y∼Ber(1/2). Let A=X⊕Y and B=Y⊕Z. We will investigate latents for (A,B). Let ϵ be the stochastic error of latent Λ:=Y. Now compute the deterministic errors of each of the latents X, Y, Z, A, B, A⊕B, X⊕Y⊕Z. Then for p:=0.9,r:=0.44, all of these latents have deterministic error greater than 5ϵ.

It should be easy to modify the code to consider other latents. I haven't thought much about proving that there aren't any other latents better than these, though.

Reply
[-]Arjun Pitchanathan1mo*30

On this particular example you can achieve deterministic error ≈2.5ϵ with latent A∧B, but it seems easy to find other examples with ratio > 5 (including over latents A∧B,A∨B) in the space of distributions over (X,Y,Z) with a random-restart hill-climb. Anyway, my takeaway is that if you think you can derandomize latents in general you should probably try to derandomize the latent Λ:=Y for variables A:=X⊕Y, B:=Z⊕Y for distributions over boolean variables X,Y,Z.

(edited to fix typo in definition of B)

Reply
[-]Arjun Pitchanathan1mo*30

My impression is that prior discussion focused on discretizing Λ. Λ is already boolean here, so if the hypothesis is true then it's for a different reason.

Reply
[-]David Johnston4mo*10

Your natural latents seem to be quite related to the common construction IID variables conditional on a latent - in fact, all of your examples are IID variables (or "bundles" of IID variables) conditional on that latent. Can you give me an interesting example of a natural latent that is not basically the conditionally IID case?

(I was wondering if the extensive literature on the correspondence between De Finetti type symmetries and conditional IID representations is of any help to your problem. I'm not entirely sure if it is, given that mostly addresses the issue of getting from a symmetry to a conditional independence, whereas you want to get from one conditional independence to another, but it's plausible some of the methods are applicable)

Reply
[-]johnswentworth4mo20

A natural latent is, by definition, a latent which satisfies two properties. The first is that the observables are IID conditional on the latent, i.e. the common construction you're talking about. That property by itself doesn't buy us much of interest, for our purposes, but in combination with the other property required for a natural latent, it buys quite a lot.

Reply
[-]David Johnston4mo65

Wait, I thought the first property was just independence, not also identically distributed.

In principle I could have e.g. two biased coins with their biases different but deterministically dependent.

Reply
[-]johnswentworth4mo40

Oh, you're right. Man, I was really not paying attention before bed last night! Apologies, you deserve somewhat less tired-brain responses than that.

Reply
[+][comment deleted]4mo-13-1
Moderation Log
More from johnswentworth
View more
Curated and popular this week
24Comments
Deleted by lillybaeum, 05/04/2025
Reason: This was stupid.

[EDIT Aug 22 2025: This bounty is now closed! We have solved it ourselves and will accept a 500$ bonus for doing so.]


Our posts on natural latents have involved two distinct definitions, which we call "stochastic" and "deterministic" natural latents. We conjecture that, whenever there exists a stochastic natural latent (to within some approximation), there also exists a deterministic natural latent (to within a comparable approximation). We are offering a $500 bounty to prove this conjecture.

Some Intuition From The Exact Case

In the exact case, in order for a natural latent to exist over random variables X1,X2, the distribution has to look roughly like this:

Each value of X1 and each value of X2 occurs in only one "block", and within the "blocks", X1 and X2 are independent. In that case, we can take the (exact) natural latent to be a block label.

Notably, that block label is a deterministic function of X.

However, we can also construct other natural latents for this system: we simply append some independent random noise to the block label. That natural latent is not a deterministic function of X; it's a "stochastic" natural latent.

In the exact case, if a stochastic natural latent exists, then the distribution must have the form pictured above, and therefore the block label is a deterministic natural latent. In other words: in the exact case, if a stochastic natural latent exists, then a deterministic natural latent also exists. The goal of the $500 bounty is to prove that this still holds in the approximate case.

Approximation Adds Qualitatively New Behavior

If you want to tackle the bounty problem, you should probably consider a distribution like this:

😏



Distributions like this can have approximate natural latents, while being qualitatively different from the exact picture. A concrete example is the biased die: X1 and X2 are each 500 rolls of a biased die of unknown bias (with some reasonable prior on the bias). The bias itself will typically be an approximate stochastic natural latent, but the lower-order bits of the bias are not approximately deterministic given X (i.e. they have high entropy given X).

The Problem

"Stochastic" Natural Latents

Stochastic natural latents were introduced in the original Natural Latents post. Any latent Λ over random variables X1,X2 is defined to be a stochastic natural latent when it satisfies these diagrams:

... and Λ is an approximate stochastic natural latent (with error ϵ) when it satisfies the approximate versions of those diagrams to within ϵ, i.e.

ϵ≥DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ])

ϵ≥DKL(P[X,Λ]||P[X2]P[X1|X2]P[Λ|X1])

ϵ≥DKL(P[X,Λ]||P[X1]P[X2|X1]P[Λ|X2])

Key thing to note: if Λ satisfies these conditions, then we can create another stochastic natural latent Λ′ by simply appending some random noise to Λ, independent of X. This shows that Λ can, in general, contain arbitrary amounts of irrelevant noise while still satisfying the stochastic natural latent conditions.

"Deterministic" Natural Latents

Deterministic natural latents were introduced in a post by the same name. Any latent Λ over random variables X1,X2 is defined to be a deterministic natural latent when it satisfies these diagrams:

... and Λ is an approximate deterministic natural latent (with error ϵ) when it satisfies the approximate versions of those diagrams to within ϵ, i.e.

ϵ≥DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ])

ϵ≥H(Λ|X1)

ϵ≥H(Λ|X2)

See the linked post for explanation of a variable appearing multiple times in a diagram, and how the approximation conditions for those diagrams simplify to entropy bounds.

Note that the deterministic natural latent conditions, either with or without approximation, imply the stochastic natural latent conditions; a deterministic natural latent is also a stochastic natural latent.

Also note that one can instead define an approximate deterministic natural latent via just one diagram, and this is also a fine starting point for purposes of this bounty:

What We Want For The Bounty

We'd like a proof that, if a stochastic natural latent exists over two variables X1,X2 to within approximation ϵ, then a deterministic natural latent exists over those two variables to within approximation roughly ϵ. When we say "roughly", e.g. 2ϵ or 3ϵ would be fine; it may be a judgement call on our part if the bound is much larger than that. 

We're probably not interested in bounds which don't scale to zero as ϵ goes to zero, though we could maybe make an exception if e.g. there's some way of amortizing costs across many systems such that costs go to zero-per-system in aggregate (though we don't expect the problem to require those sorts of tricks).

Bounds should be global, i.e. apply even when ϵ is large. We're not interested in e.g. first or second order approximations for small ϵ unless they provably apply globally.

We might also award some fraction of the bounty for a counterexample. That would be much more of a judgement call, depending on how thoroughly the counterexample kills hope of any conjecture vaguely along these lines.

In terms of rigor and allowable assumptions, roughly the level of rigor and assumptions in the posts linked above is fine.

Why We Want This

Deterministic natural latents are a lot cleaner both conceptually and mathematically than stochastic natural latents. Alas, they're less general... unless this conjecture turns out to be true, in which case they're not less general. That sure would be nice.

Mentioned in
115(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)
73$500 + $500 Bounty Problem: Does An (Approximately) Deterministic Maximal Redund Always Exist?
42Apply for the 2025 Dovetail fellowship