LESSWRONG
LW

Bounties & Prizes (active)World Modeling
Frontpage

73

$500 + $500 Bounty Problem: Does An (Approximately) Deterministic Maximal Redund Always Exist?

by johnswentworth, David Lorell
6th May 2025
3 min read
16

73

Bounties & Prizes (active)World Modeling
Frontpage

73

$500 + $500 Bounty Problem: Does An (Approximately) Deterministic Maximal Redund Always Exist?
9quetzal_rainbow
4johnswentworth
4Gurkenglas
4johnswentworth
4Gurkenglas
7David Lorell
5johnswentworth
6Thane Ruthenis
6Thane Ruthenis
2Thane Ruthenis
6XelaP
5johnswentworth
5Oliver Sourbut
3johnswentworth
4quetzal_rainbow
2Thane Ruthenis
New Comment
16 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:44 PM
[-]quetzal_rainbow4mo90

I dunno how much it's obvious for people who want to try for bounty, but I only now realized that you can express criteria for redund as inequality with mutual information and I find mutual information to be much nicer to work with, even if from pure convenience of notation. Proof: 

Let's take criterion for redund Γ w.r.t. of X1, X2

ϵ≥DKL(P[X1,X2,Γ]∥P[X1]P[X2|X1]P[Γ|X2])

expand expression for KL divergence:DKL(P[X1,X2,Γ]∥P[X1]P[X2|X1]P[Γ|X2])=∑x1,x2,γP(x1,x2,γ)logP(x1,x2,γ)P(x1)P(x2|x1)P(γ|x2)

expand joint distribution:

DKL(P[X1,X2,Γ]∥P[X1]P[X2|X1]P[Γ|X2])=∑x1,x2,γP(x1,x2,γ)logP(x1)P(x2|x1)P(γ|x1,x2)P(x1)P(x2|x1)P(γ|x2)

simplify:

DKL(P[X1,X2,Γ]∥P[X1]P[X2|X1]P[Γ|X2])=∑x1,x2,γP(x1,x2,γ)logP(γ|x1,x2)P(γ|x2)

which is a conditional mutual information:

∑x1,x2,γP(x1,x2,γ)logP(γ|x1,x2)P(γ|x2)=I(X1;Γ|X2)

what results in:

DKL(P[X1,X2,Γ]∥P[X1]P[X2|X1]P[Γ|X2])=I(X1;Γ|X2)≤ϵ

Reply1
[-]johnswentworth4mo42

Yup, and you can do that with all the diagrams in this particular post. In general, the DKL error for any of the diagrams X→Y→Z, X←Y←Z, or X←Y→Z is I(X;Z|Y) for any random variables X,Y,Z.

Notably, this includes as a special case the "approximately deterministic function" diagram X←Y→X; the error on that one is I(X;X|Y) which is H(X|Y).

Reply
[-]Gurkenglas4mo40

Then how do you choose between the three directed representatives? Is there some connotation, some coming-apart of concepts that would become apparent after generalization, or did you just pick X <- Y -> X because it's symmetric?

Reply
[-]johnswentworth4mo42

Good question. The three directed graphical representations directly expand to different but equivalent expressions, and sometimes we want to think in terms of one of those expressions over another.

The most common place this comes up for us: sometimes we want to think about a latent Γ in terms of its defining conditional distribution P[Γ|X]. When doing that, we try to write diagrams with Γ downstream, like e.g. X1→X2→Γ. Other times, we want to think of Γ in terms of the distribution P[X|Γ]. When thinking that way, we try to write diagrams with Γ upstream, e.g. Γ→X1→X2.

In the case of X←Y→X, the natural information-theoretic expression for its error is H(X|Y) or equivalently I(X;X|Y). Those condition on Y, so we want Y "upstream" in order to most directly express those quantities. You could derive the same errors from X←Y←X or X→Y→X, but if you write out those DKL errors and try to reduce them to H(X|Y), you'll probably find that it takes more steps.

Reply
[-]Gurkenglas4mo40

If the expressions cease to be equivalent in some natural generalization of this setting, then I recommend that you try to find a proof there, because the proof space should be narrower and thus easier to search.

Reply
[-]David Lorell4mo72

Note to bounty hunters, since it's come up twice: An "approximately deterministic function of X" is one where, conditional on X, the entropy of F(X) is very small, though possibly nonzero. I.e. you have very nearly pinned down the value that F(X) takes on once you learn X. For conceptual intuition on the graphical representation, X approximately mediating between F(X) and F(X) (two random variables which always take on the same value as each other) means as always that there is approximately no further update on the value of either given the other, (despite them being perfectly informative about the value of the other,) if one already knows the value of X.  See this for more.

Reply
[-]johnswentworth4mo50

(Note that F, in David's notation here, is a stochastic function in general.)

Reply1
[-]Thane Ruthenis3mo60

So, I ran into basically the same problem; turns out this setup is a pretty straightforward generalization of Gacs-Korner common information. I've spent a while trying to solve it, and I'm currently pessimistic that it's easily tractable. There are some outside-view reasons to think that, too: my understanding is that the field of information theory has been trying to crack the true name of multivariate redundant information for a while now, and all they have are special cases (Gacs-Korner) and adhockery. This isn't exactly the same problem, but I expect it runs into the same convergent subproblems...

But maximal redunds are pretty central to the whole abstraction business, so getting at least some confirmation seems prudent. Have you tried running a dirty-but-practical empirical check on this? Basic idea:

  1. Set up a wide variety of sets of random variables X={X1,…,Xn}.
  2. For each set, train a function Ω whose entropy is maximized under ∀i:I(X−i;Ω|Xi)≤ϵ2 and H(Ω|X)≤ϵdet.
  3. Train sets of functions Rj, constrained to be approximately deterministic redunds, with the following objectives (either separate sets, or mix-and-match those objectives):
    1. Minimize ∑kI(Rj;Rk).
    2. Minimize I(Rj;Ω).
  4. Check whether Ω still manages to approximately screen off all Rj from X.

It seems plausible that the real-world abstractions distributions won't be more exotic than what we can contrive to set up in (1), and it should suffice to safeguard against the possibility that the intuitive idea of a maximal redund diverges from reality.

Reply1
[-]Thane Ruthenis3mo*60

Actually nevermind, I think I see a counterexample to the exact formulation in the OP. I haven't fully formalized it, so there may be holes, but here it is:

Imagine a "toxic piece of shared information": some type of information such that it inextricably ties together some shared bits of information S with unique bits of information U, making it impossible to define a variable that contains only the shared bits from it. If all variables in a set X are "toxic" in this way, then any redund R, even ones that straight-up copy one of the Xj, would necessarily have I(R;X−i|Xi)≥U for all Xi except Xj.

Suppose we fix a specific ϵ2, i. e., state that only variables with fewer unique bits than that qualify as redunds.

Now consider a set of variables An each variable Ani in which is a set of n independent clones of toxic-information-containing variables {X1i,…,Xni}. Any variable which is a redund over only one set of clones is a valid redund over An, and so there are n redunds like this, all mutually independent.

What would a maximal redund Ωn look like? Well, to properly screen all other redunds off, it would need to include most of the information from each individual independent redund, and so its I(Ωn;An−i|Ani) would scale with n (probably roughly equaling nU?). We can then set n arbitrarily high, so that this quantity shoots well past any fixed ϵ2. And if we discard some D bits from each redund to try and manage this, this would increasingly work against our ability to screen off redunds from An−i, and we can again scale n to make any Ωn that tries to keep its I(Ω;An−i|Ani) under control fail arbitrarily hard at the screen-off task for all-but-a-few redunds.

Concrete example: I think a binary symmetric channel should work here? I. e., set X={X1,X2}, where X2=XOR(X1,Z), X1 is a fair coin, and Z is a biased coin with bias p. We get H(X1)=H(X2)=1, I(X1;X2)=plog2p+(1−p)log2(1−p)=K, H(X1|X2)=H(X2|X1)=1−K. Then an R=X1 has I(X1;X2|R)=0, so it contains K shared bits, and I(R;X1|X2)=1−K, so it qualifies as a redund if we set ϵ2≥1−K. Then, per the construction above, we design a set whose variables are sets of those basic variables' independent clones, and watch the maximal redund explode.

The subtle caveat here is that I think we need to prove that for any variable Ω, decreasing the amount of unique information (i. e., driving I(Ω;X1|X2) down) trades off against shared information (i. e., I(X1;X2|Ω) ends up increasing), which leads to I(X1;R|Ω) growing. Specifically, we need to prove that this happens at some nontrivial, non-asymptoting rate that leads Ωn's ϵ1 in the cloned construction to scale with n, well past any small constant multiple of ϵ2.

The counter here would be a continuum of Ωs which can drive I(Ω;X1|X2) arbitrarily low at flatlining decrease in I(X1;X2|Ω). The full counter would prove that this is possible in the general case, not only in the BSC setting.

That it's impossible seems intuitively obvious (consider if that worked, what possibilities it'd unlock! the universe isn't that kind), but maybe annoying to prove and the result wouldn't be very exciting, so it's a low priority to me.[1] Still, throwing this out here in case you find this reasoning convincing.

(My guess is that "maximal redunds are possible for any sets of random variables" is too strong and we need to explicitly limit it to sets of variables that are approximately structured as sparse hierarchical graphs or something. This is probably okay, see my previous thoughts about the universe's physics selecting for that.)

  1. ^

    Here's o3's crack at it if you want a starting point. I've found it's sometimes helpful, but I'm currently fatigued from trying to paranoidally spot the place where it's trying to slip crap past me, so I've not verified it. Looks right at a glance, if you ignore some weird parts.

Reply
[-]Thane Ruthenis3mo20

Some more spitballing regarding how to fix this:

One thing here is that, in the relative sense, Ωn isn't any more "guilty" of caring about specific variables than any one Xij-copying R, given that it captures much more shared information about Anis than those Rs do. So, to suit, maybe we should allow ϵ3 to scale with the amount of shared information in the max-redund? Which would be uhh SH(Ω)=H(Ω)−H(Ω|X)−maxiI(Ω;X−i|Xi) or something.

Generalizing, maybe it should be "permitted" for redunds to contain more unique information the more shared information they have? Which would make the tuning parameter for "how pure of a redund R is" not maxiI(R;X−i|Xi) (aka your ϵ2), but SH(R)H(R), i. e., the fraction of the redund's total entropy[1] that is shared information. That makes intuitive sense to me.

  1. ^

    Or its X-relevant entropy? Subtract H(R|X) from the denominator too then.

Reply
[-]XelaP4mo*61

A minor request: can you link "approximate deterministic functions" to some of your posts about them? I know you've written on them, and elaborated on them in more detail elsewhere.

A couple questions:

  • Can you elaborate more on why you care? Preferably, at a level of elaboration/succinctness that's above the little you've said in the post, and below "read the linked posts and my other posts". I feel that knowing why you care will help with understanding what the theorem really means. I am probably going to read a fair amount of those posts anyways.

Edit: It is clear now why you care, however I still believe there may be people that would benefit from some elaboration in the post. For anyone wondering, the reason to care is in the first theorem in natural latents: the math. For an easy special case of the ideas, see the extremely easy to understand post on (approximately) deterministic natural latents, and note that by the "Dangly Bits Lemma" the redund condition is implied by the approximately determined condition.

For the intuition behind what the redund condition means, note that by the rerooting rule the diagram X -> Y -> R is the same (as in, equivalently constrains the probability distribution) X <- Y -> R, which means that once you know Y, being told X doesn't help you predict R. The redund condition is that + the version with X,Y swapped. This makes it clear why this captures "R is redundantly encoded in X and Y".

  • It sounds like you strongly expect the theorem to be true. How hard do you expect it to be? Does it feel like the sort of thing that can be done using basically the tools you've used?

  • How hard have you tried?

  • Do you have examples of the theorem? That is, do you have some variables X with some approximate redund Omega and a bunch of redunds Gamma such that [statements] hold? Or, better, do you have some example where you can find a universal redund Omega?

There are some reasons I can think of for why you'd perhaps want to explicitly refuse to answer some of those questions: to prevent anchoring when a blind mind might do better, to not reveal info that might (if the info turns out one way, assuming truthtelling) discourage attempts (like saying it's super hard and you tried really hard), and to uphold a general policy of not revealing such info (e.g. perhaps you think this theorem is super easy and you haven't spent 5 minutes on it, but plan to put bounties on other theorems that are harder).

For example, part of my inclination to try this is my sense that, compared to most conjectures, it's

  • Incredibly low background - I've known the requisite probability/information theory for a long time at this point
  • A precise, single theorem is asked for. Not as precise as could be, given the freedom of "reasonable", but compared to stuff like "figure out a theory of how blank works", it's concrete and small.
  • Very little effort han been put into it. No other conjecture I know of that I know enough to understand has had what looks like a single digit number of people working on it for not very long.
  • It intuitively seems like the sort of theorem that some random person on the internet that has the general mathematical ability ("maturity"/"proof ability" - the mostly tacit knowledge) but not a bunch of theoretical knowledge or even isn't very raw-computing-power smarts (but is probably high creativity/that spark of genius that is at the core of discovery and invention, which can be partially traded off for a lower chance at success).
Reply
[-]johnswentworth4mo50

Good questions. I apparently took too long to answer them, as it sounds like you've mostly found the answers yourself (well done!).

"Approximately deterministic functions" and their diagrammatic expression were indeed in Deterministic Natual Latents; I don't think we've talked about them much elsewhere. So you're not missing any other important source there.

On why we care, a recent central example which you might not have seen is our Illiad paper. The abstract reads:

Suppose two Bayesian agents each learn a generative model of the same environment. We will assume the two have converged on the predictive distribution (i.e. distribution over some observables in the environment), but may have different generative models containing different latent variables. Under what conditions can one agent guarantee that their latents can be faithfully expressed in terms of the other agent’s latents?

We give simple conditions under which such translation is guaranteed to be possible: the natural latent conditions. We also show that, absent further constraints, these are the most general conditions under which translatability is guaranteed.

(Bold added for emphasis.) That "faithfully expressed in terms of" part is key here. In the paper, we operationalize that as "Under what conditions can one agent guarantee that their latents are (approximately) a stochastic function of the other agent's latents?". But that requires dragging in stochastic functions, which are conceptually messier than plain old functions. As a result, there's a whole section in the paper called "Under what model?", and basically-all the theorems require an annoying "independent latents" assumption (which basically says that all the interactions between the two agents' latents are route through the environment).

We could cut all that conceptual mess and all those annoying assumptions out if we instead use (approximately) deterministic natural latents. Then, the core question would be operationalized as "Under what conditions can one agent guarantee that their latents are (approximately) a function of the other agent's latents?". Then it's just plain old functions, we don't need to worry conceptually about the "under what model?" question (because there's approximately only one way to put all the latents in one model), and the "independent latents" conditions can all be removed (because different deterministic functions of X are always independent given X). All the messiest and least-compelling parts of the paper could be dropped.

... but the cost would be generality. Stochastic functions are more general than functions, and much more relevant to Bayesians' world models. So narrowing from stochastic functions to functions would probably mean "assuming our way out of reality"... unless it turns out that (Approximately) Deterministic Natural Latents Are All You Need, which was the subject of our previous bounty post.

The conjecture in this bounty post is even stronger; it would imply that one, and therefore also allow us to simplify the paper a lot.

Reply
[-]Oliver Sourbut4mo*50

nit: I found your graphical quantifiers and implications a bit confusing to read. It's the middle condition that looks weird.

If I've understood, I think you want (given a set of Xi, there exists Ω with all of):

∀i.[X¯i→Xi→Ω]∀Γ.(∀i.[X¯i→Xi→Γ])⟹[X→Ω→Γ]X(→Ω)→Ω

(the third line is just my crude latex rendition of your third diagram)

Is that order of quantifiers and precedence correct?

In words, you want an Ω:

  1. which is a redund over the Xi
  2. where any Γ which is a redund over the Xi is screened from X by Ω
  3. which is approximately deterministic on X

An unverified intuition (I'll unlikely work on this further): would the joint distribution of all candidate Γs work as Ω? It screens off the X from any given Γ, right (90% confidence)? And it is itself a redund over the X I think (again 90% confidence ish)? I don't really grok what it means to be approximately deterministic but it feels like this should be? Overall confidence 30% ish, deflating for outside and definedness reasons.

Reply1
[-]johnswentworth4mo30

Yup, that's exactly the right idea, and indeed constructing Ω by just taking a tuple of all the redund Γ's (or a sufficient statistic for that tuple) is a natural starting point which works straightforwardly in the exact case. In the approximate case, the construction needs to be modified - for instance, regarding the third condition, that tuple of Γ's (or a sufficient statistic for it) will have much more entropy conditional on X than any individual Γ.

Reply
[-]quetzal_rainbow3mo40

Probably (I didn't dig deep yet) relevant impossibility result.

Reply
[-]Thane Ruthenis3mo20

That seems interesting! Though my understanding is that it's not directly applicable here. The maximal redund only needs to contain all redundant information that can be contained in any other redund, it doesn't attempt to capture all redundant information in total. So if there's some constraint on what types of redundancy can be captured by any such variable, that's fine. 

Reply
Moderation Log
More from johnswentworth
View more
Curated and popular this week
16Comments

[EDIT August 22 2025: This bounty is now 500$ as the earlier bounty has now been claimed. (By us.)]


A lot of our work involves "redunds".[1] A random variable Γ is a(n exact) redund over two random variables X1,X2 exactly when both

X1→X2→Γ

X2→X1→Γ

Conceptually, these two diagrams say that X1 gives exactly the same information about Γ as all of X, and X2 gives exactly the same information about Γ as all of X; whatever information X contains about Γ is redundantly represented in X1 and X2. Unpacking the diagrammatic notation and simplifying a little, the diagrams say P[Γ|X1]=P[Γ|X2]=P[Γ|X] for all X such that P[X]>0.

The exact redundancy conditions are too restrictive to be of much practical relevance, but we are more interested in approximate redunds. Approximate redunds are defined by approximate versions of the same two diagrams:

Unpacking the diagrammatic notation, these two diagrams say

ϵred≥DKL(P[X,Γ]||P[X1]P[X2|X1]P[Γ|X2])

ϵred≥DKL(P[X,Γ]||P[X2]P[X1|X2]P[Γ|X1])

This bounty problem is about the existence of a(n approximate) maximal redund Ω: a redund which contains (approximately) all the information about X contained in any other (approximate) redund. Diagrammatically, a maximal redund Ω satisfies:

Finally, we'd like our maximal redund Ω to be an approximately deterministic function of X, i.e. ϵdet≥H(Ω|X), which is equivalent to the diagram

What We Want For The Bounty

This bounty pays out $500 for establishing that there always, over any two random variables X1, X2, exists an Ω satisfying the requirements above, i.e.

... with reasonable error bounds.

What counts as reasonable error bounds? We're expecting all the other ϵ's to scale with ϵ2 in the diagrams above, and ideally be small constant multiples of ϵ2. In other words: the higher the approximation error on allowable redunds Γ, the more approximation error we allow for all the conditions on Ω. Reasonableness of error bound is partially a judgement call on our part, but small constant (like e.g. 2 or 3) multiples of ϵ2 would definitely suffice.

Bounds should be global, i.e. apply even when ϵ2 is large. We're not interested in e.g. first or second order approximations for small ϵ2 unless they provably apply globally.

If the error bounds are sufficiently reasonable, a solution to this problem will also solve our other open bounty problem, paying out another $500, for a total of $1000.

If people post major intermediate steps which are load-bearing for an eventual solution, we might allocate the bounty based on our judgement of different peoples' relative contributions; the hope is to incentivize sharing intermediate results.

We might also partially pay out for counterexamples. That's a judgement call depending on how thoroughly the counterexample kills hope of any theorem vaguely along the lines intended.

In terms of rigor and allowable assumptions, roughly the level of rigor and assumptions in the posts linked above is fine.

Some Intuition From The Exact Case

A proof in the exact case is straightforward; we'll walk through it here to build some intuition.

The exact redundancy conditions say

P[Γ|X]=P[Γ|X1]=P[Γ|X2]

wherever P[X]>0. So, we can form a bipartite graph, where each vertex represents a value x1 of X1 or a value x2 of X2, and two vertices are connected iff P[X1=x1,X2=x2]>0. Then the definition of redundancy  says that, for ANY redund Γ, P[Γ|X1] and P[Γ|X2] are the same for ANY values of X1 and X2 (respectively) in the same connected component.

So, let f(X) be defined as the connected component in which X falls. For any redund Γ, P[Γ|X] must be a function of f(X) (since two X-values in the same connected component have the same P[Γ|X]). Thus

X→f(X)→Γ

f(X) is therefore a(n exactly) deterministic function of X, and (exactly) mediates between X and any redund Γ. Thus, f(X) satisfies our requirements, in the exact case (i.e. when ϵ2=0).

Alas, this proof does not easily generalize to the approximate case.

Why We Want This

This is a conjecture which we've had on the back burner for a while, and it would be relevant in many places. It would solve our other open bounty problem, it would likely be our starting point for a new version of the Telephone Theorem which more explicitly connects to the natural latents framework, and it would directly simplify the problem of computing natural latents numerically. Probably there are other use-cases which we're not thinking about right at the moment.

 

  1. ^

    A lot of our work involves "redunds".

Mentioned in
42Apply for the 2025 Dovetail fellowship