A lot of our work involves "redunds".[1] A random variable  is a(n exact) redund over two random variables  exactly when both

Conceptually, these two diagrams say that  gives exactly the same information about  as all of , and  gives exactly the same information about  as all of ; whatever information  contains about  is redundantly represented in  and . Unpacking the diagrammatic notation and simplifying a little, the diagrams say  for all  such that .

The exact redundancy conditions are too restrictive to be of much practical relevance, but we are more interested in approximate redunds. Approximate redunds are defined by approximate versions of the same two diagrams:

Unpacking the diagrammatic notation, these two diagrams say

This bounty problem is about the existence of a(n approximate) maximal redund : a redund which contains (approximately) all the information about  contained in any other (approximate) redund. Diagrammatically, a maximal redund  satisfies:

Finally, we'd like our maximal redund  to be an approximately deterministic function of , i.e. , which is equivalent to the diagram

What We Want For The Bounty

This bounty pays out $500 for establishing that there always, over any two random variables , exists an  satisfying the requirements above, i.e.

... with reasonable error bounds.

What counts as reasonable error bounds? We're expecting all the other 's to scale with  in the diagrams above, and ideally be small constant multiples of . In other words: the higher the approximation error on allowable redunds , the more approximation error we allow for all the conditions on . Reasonableness of error bound is partially a judgement call on our part, but small constant (like e.g. 2 or 3) multiples of  would definitely suffice.

Bounds should be global, i.e. apply even when  is large. We're not interested in e.g. first or second order approximations for small  unless they provably apply globally.

If the error bounds are sufficiently reasonable, a solution to this problem will also solve our other open bounty problem, paying out another $500, for a total of $1000.

If people post major intermediate steps which are load-bearing for an eventual solution, we might allocate the bounty based on our judgement of different peoples' relative contributions; the hope is to incentivize sharing intermediate results.

We might also partially pay out for counterexamples. That's a judgement call depending on how thoroughly the counterexample kills hope of any theorem vaguely along the lines intended.

In terms of rigor and allowable assumptions, roughly the level of rigor and assumptions in the posts linked above is fine.

Some Intuition From The Exact Case

A proof in the exact case is straightforward; we'll walk through it here to build some intuition.

The exact redundancy conditions say

wherever . So, we can form a bipartite graph, where each vertex represents a value  of  or a value  of , and two vertices are connected iff . Then the definition of redundancy  says that, for ANY redund  and  are the same for ANY values of  and  (respectively) in the same connected component.

So, let  be defined as the connected component in which  falls. For any redund  must be a function of  (since two -values in the same connected component have the same ). Thus

 is therefore a(n exactly) deterministic function of , and (exactly) mediates between X and any redund . Thus,  satisfies our requirements, in the exact case (i.e. when ).

Alas, this proof does not easily generalize to the approximate case.

Why We Want This

This is a conjecture which we've had on the back burner for a while, and it would be relevant in many places. It would solve our other open bounty problem, it would likely be our starting point for a new version of the Telephone Theorem which more explicitly connects to the natural latents framework, and it would directly simplify the problem of computing natural latents numerically. Probably there are other use-cases which we're not thinking about right at the moment.

 

  1. ^

    A lot of our work involves "redunds".

New Comment
11 comments, sorted by Click to highlight new comments since:

I dunno how much it's obvious for people who want to try for bounty, but I only now realized that you can express criteria for redund as inequality with mutual information and I find mutual information to be much nicer to work with, even if from pure convenience of notation. Proof: 

Let's take criterion for redund  w.r.t. of 

expand expression for KL divergence:

expand joint distribution:

simplify:

which is a conditional mutual information:

what results in:

Yup, and you can do that with all the diagrams in this particular post. In general, the  error for any of the diagrams , or  is  for any random variables .

Notably, this includes as a special case the "approximately deterministic function" diagram ; the error on that one is  which is .

Then how do you choose between the three directed representatives? Is there some connotation, some coming-apart of concepts that would become apparent after generalization, or did you just pick X <- Y -> X because it's symmetric?

Good question. The three directed graphical representations directly expand to different but equivalent expressions, and sometimes we want to think in terms of one of those expressions over another.

The most common place this comes up for us: sometimes we want to think about a latent  in terms of its defining conditional distribution . When doing that, we try to write diagrams with  downstream, like e.g. . Other times, we want to think of  in terms of the distribution . When thinking that way, we try to write diagrams with  upstream, e.g. .

In the case of , the natural information-theoretic expression for its error is  or equivalently . Those condition on , so we want  "upstream" in order to most directly express those quantities. You could derive the same errors from  or , but if you write out those  errors and try to reduce them to , you'll probably find that it takes more steps.

If the expressions cease to be equivalent in some natural generalization of this setting, then I recommend that you try to find a proof there, because the proof space should be narrower and thus easier to search.

Note to bounty hunters, since it's come up twice: An "approximately deterministic function of X" is one where, conditional on X, the entropy of F(X) is very small, though possibly nonzero. I.e. you have very nearly pinned down the value that F(X) takes on once you learn X. For conceptual intuition on the graphical representation, X approximately mediating between F(X) and F(X) (two random variables which always take on the same value as each other) means as always that there is approximately no further update on the value of either given the other, (despite them being perfectly informative about the value of the other,) if one already knows the value of X.  See this for more.

(Note that F, in David's notation here, is a stochastic function in general.)

[-]XelaP*61

A minor request: can you link "approximate deterministic functions" to some of your posts about them? I know you've written on them, and elaborated on them in more detail elsewhere.

A couple questions:

  • Can you elaborate more on why you care? Preferably, at a level of elaboration/succinctness that's above the little you've said in the post, and below "read the linked posts and my other posts". I feel that knowing why you care will help with understanding what the theorem really means. I am probably going to read a fair amount of those posts anyways.

Edit: It is clear now why you care, however I still believe there may be people that would benefit from some elaboration in the post. For anyone wondering, the reason to care is in the first theorem in natural latents: the math. For an easy special case of the ideas, see the extremely easy to understand post on (approximately) deterministic natural latents, and note that by the "Dangly Bits Lemma" the redund condition is implied by the approximately determined condition.

For the intuition behind what the redund condition means, note that by the rerooting rule the diagram X -> Y -> R is the same (as in, equivalently constrains the probability distribution) X <- Y -> R, which means that once you know Y, being told X doesn't help you predict R. The redund condition is that + the version with X,Y swapped. This makes it clear why this captures "R is redundantly encoded in X and Y".

  • It sounds like you strongly expect the theorem to be true. How hard do you expect it to be? Does it feel like the sort of thing that can be done using basically the tools you've used?

  • How hard have you tried?

  • Do you have examples of the theorem? That is, do you have some variables X with some approximate redund Omega and a bunch of redunds Gamma such that [statements] hold? Or, better, do you have some example where you can find a universal redund Omega?

There are some reasons I can think of for why you'd perhaps want to explicitly refuse to answer some of those questions: to prevent anchoring when a blind mind might do better, to not reveal info that might (if the info turns out one way, assuming truthtelling) discourage attempts (like saying it's super hard and you tried really hard), and to uphold a general policy of not revealing such info (e.g. perhaps you think this theorem is super easy and you haven't spent 5 minutes on it, but plan to put bounties on other theorems that are harder).

For example, part of my inclination to try this is my sense that, compared to most conjectures, it's

  • Incredibly low background - I've known the requisite probability/information theory for a long time at this point
  • A precise, single theorem is asked for. Not as precise as could be, given the freedom of "reasonable", but compared to stuff like "figure out a theory of how blank works", it's concrete and small.
  • Very little effort han been put into it. No other conjecture I know of that I know enough to understand has had what looks like a single digit number of people working on it for not very long.
  • It intuitively seems like the sort of theorem that some random person on the internet that has the general mathematical ability ("maturity"/"proof ability" - the mostly tacit knowledge) but not a bunch of theoretical knowledge or even isn't very raw-computing-power smarts (but is probably high creativity/that spark of genius that is at the core of discovery and invention, which can be partially traded off for a lower chance at success).

Good questions. I apparently took too long to answer them, as it sounds like you've mostly found the answers yourself (well done!).

"Approximately deterministic functions" and their diagrammatic expression were indeed in Deterministic Natual Latents; I don't think we've talked about them much elsewhere. So you're not missing any other important source there.

On why we care, a recent central example which you might not have seen is our Illiad paper. The abstract reads:

Suppose two Bayesian agents each learn a generative model of the same environment. We will assume the two have converged on the predictive distribution (i.e. distribution over some observables in the environment), but may have different generative models containing different latent variables. Under what conditions can one agent guarantee that their latents can be faithfully expressed in terms of the other agent’s latents?

We give simple conditions under which such translation is guaranteed to be possible: the natural latent conditions. We also show that, absent further constraints, these are the most general conditions under which translatability is guaranteed.

(Bold added for emphasis.) That "faithfully expressed in terms of" part is key here. In the paper, we operationalize that as "Under what conditions can one agent guarantee that their latents are (approximately) a stochastic function of the other agent's latents?". But that requires dragging in stochastic functions, which are conceptually messier than plain old functions. As a result, there's a whole section in the paper called "Under what model?", and basically-all the theorems require an annoying "independent latents" assumption (which basically says that all the interactions between the two agents' latents are route through the environment).

We could cut all that conceptual mess and all those annoying assumptions out if we instead use (approximately) deterministic natural latents. Then, the core question would be operationalized as "Under what conditions can one agent guarantee that their latents are (approximately) a function of the other agent's latents?". Then it's just plain old functions, we don't need to worry conceptually about the "under what model?" question (because there's approximately only one way to put all the latents in one model), and the "independent latents" conditions can all be removed (because different deterministic functions of X are always independent given X). All the messiest and least-compelling parts of the paper could be dropped.

... but the cost would be generality. Stochastic functions are more general than functions, and much more relevant to Bayesians' world models. So narrowing from stochastic functions to functions would probably mean "assuming our way out of reality"... unless it turns out that (Approximately) Deterministic Natural Latents Are All You Need, which was the subject of our previous bounty post.

The conjecture in this bounty post is even stronger; it would imply that one, and therefore also allow us to simplify the paper a lot.

nit: I found your graphical quantifiers and implications a bit confusing to read. It's the middle condition that looks weird.

If I've understood, I think you want (given a set of , there exists with all of):

(the third line is just my crude latex rendition of your third diagram)

Is that order of quantifiers and precedence correct?

In words, you want an :

  1. which is a redund over the
  2. where any which is a redund over the is screened from by
  3. which is approximately deterministic on

An unverified intuition (I'll unlikely work on this further): would the joint distribution of all candidate s work as ? It screens off the from any given , right (90% confidence)? And it is itself a redund over the I think (again 90% confidence ish)? I don't really grok what it means to be approximately deterministic but it feels like this should be? Overall confidence 30% ish, deflating for outside and definedness reasons.

Yup, that's exactly the right idea, and indeed constructing  by just taking a tuple of all the redund 's (or a sufficient statistic for that tuple) is a natural starting point which works straightforwardly in the exact case. In the approximate case, the construction needs to be modified - for instance, regarding the third condition, that tuple of 's (or a sufficient statistic for it) will have much more entropy conditional on  than any individual .

Curated and popular this week