I dunno how much it's obvious for people who want to try for bounty, but I only now realized that you can express criteria for redund as inequality with mutual information and I find mutual information to be much nicer to work with, even if from pure convenience of notation. Proof:
Let's take criterion for redund w.r.t. of ,
expand expression for KL divergence:
expand joint distribution:
simplify:
which is a conditional mutual information:
what results in:
Yup, and you can do that with all the diagrams in this particular post. In general, the error for any of the diagrams , , or is for any random variables .
Notably, this includes as a special case the "approximately deterministic function" diagram ; the error on that one is which is .
Then how do you choose between the three directed representatives? Is there some connotation, some coming-apart of concepts that would become apparent after generalization, or did you just pick X <- Y -> X because it's symmetric?
Good question. The three directed graphical representations directly expand to different but equivalent expressions, and sometimes we want to think in terms of one of those expressions over another.
The most common place this comes up for us: sometimes we want to think about a latent in terms of its defining conditional distribution . When doing that, we try to write diagrams with downstream, like e.g. . Other times, we want to think of in terms of the distribution . When thinking that way, we try to write diagrams with upstream, e.g. .
In the case of , the natural information-theoretic expression for its error is or equivalently . Those condition on , so we want "upstream" in order to most directly express those quantities. You could derive the same errors from or , but if you write out those errors and try to reduce them to , you'll probably find that it takes more steps.
If the expressions cease to be equivalent in some natural generalization of this setting, then I recommend that you try to find a proof there, because the proof space should be narrower and thus easier to search.
Note to bounty hunters, since it's come up twice: An "approximately deterministic function of X" is one where, conditional on X, the entropy of F(X) is very small, though possibly nonzero. I.e. you have very nearly pinned down the value that F(X) takes on once you learn X. For conceptual intuition on the graphical representation, X approximately mediating between F(X) and F(X) (two random variables which always take on the same value as each other) means as always that there is approximately no further update on the value of either given the other, (despite them being perfectly informative about the value of the other,) if one already knows the value of X. See this for more.
A minor request: can you link "approximate deterministic functions" to some of your posts about them? I know you've written on them, and elaborated on them in more detail elsewhere.
A couple questions:
Edit: It is clear now why you care, however I still believe there may be people that would benefit from some elaboration in the post. For anyone wondering, the reason to care is in the first theorem in natural latents: the math. For an easy special case of the ideas, see the extremely easy to understand post on (approximately) deterministic natural latents, and note that by the "Dangly Bits Lemma" the redund condition is implied by the approximately determined condition.
For the intuition behind what the redund condition means, note that by the rerooting rule the diagram X -> Y -> R is the same (as in, equivalently constrains the probability distribution) X <- Y -> R, which means that once you know Y, being told X doesn't help you predict R. The redund condition is that + the version with X,Y swapped. This makes it clear why this captures "R is redundantly encoded in X and Y".
It sounds like you strongly expect the theorem to be true. How hard do you expect it to be? Does it feel like the sort of thing that can be done using basically the tools you've used?
How hard have you tried?
Do you have examples of the theorem? That is, do you have some variables X with some approximate redund Omega and a bunch of redunds Gamma such that [statements] hold? Or, better, do you have some example where you can find a universal redund Omega?
There are some reasons I can think of for why you'd perhaps want to explicitly refuse to answer some of those questions: to prevent anchoring when a blind mind might do better, to not reveal info that might (if the info turns out one way, assuming truthtelling) discourage attempts (like saying it's super hard and you tried really hard), and to uphold a general policy of not revealing such info (e.g. perhaps you think this theorem is super easy and you haven't spent 5 minutes on it, but plan to put bounties on other theorems that are harder).
For example, part of my inclination to try this is my sense that, compared to most conjectures, it's
Good questions. I apparently took too long to answer them, as it sounds like you've mostly found the answers yourself (well done!).
"Approximately deterministic functions" and their diagrammatic expression were indeed in Deterministic Natual Latents; I don't think we've talked about them much elsewhere. So you're not missing any other important source there.
On why we care, a recent central example which you might not have seen is our Illiad paper. The abstract reads:
Suppose two Bayesian agents each learn a generative model of the same environment. We will assume the two have converged on the predictive distribution (i.e. distribution over some observables in the environment), but may have different generative models containing different latent variables. Under what conditions can one agent guarantee that their latents can be faithfully expressed in terms of the other agent’s latents?
We give simple conditions under which such translation is guaranteed to be possible: the natural latent conditions. We also show that, absent further constraints, these are the most general conditions under which translatability is guaranteed.
(Bold added for emphasis.) That "faithfully expressed in terms of" part is key here. In the paper, we operationalize that as "Under what conditions can one agent guarantee that their latents are (approximately) a stochastic function of the other agent's latents?". But that requires dragging in stochastic functions, which are conceptually messier than plain old functions. As a result, there's a whole section in the paper called "Under what model?", and basically-all the theorems require an annoying "independent latents" assumption (which basically says that all the interactions between the two agents' latents are route through the environment).
We could cut all that conceptual mess and all those annoying assumptions out if we instead use (approximately) deterministic natural latents. Then, the core question would be operationalized as "Under what conditions can one agent guarantee that their latents are (approximately) a function of the other agent's latents?". Then it's just plain old functions, we don't need to worry conceptually about the "under what model?" question (because there's approximately only one way to put all the latents in one model), and the "independent latents" conditions can all be removed (because different deterministic functions of X are always independent given X). All the messiest and least-compelling parts of the paper could be dropped.
... but the cost would be generality. Stochastic functions are more general than functions, and much more relevant to Bayesians' world models. So narrowing from stochastic functions to functions would probably mean "assuming our way out of reality"... unless it turns out that (Approximately) Deterministic Natural Latents Are All You Need, which was the subject of our previous bounty post.
The conjecture in this bounty post is even stronger; it would imply that one, and therefore also allow us to simplify the paper a lot.
nit: I found your graphical quantifiers and implications a bit confusing to read. It's the middle condition that looks weird.
If I've understood, I think you want (given a set of , there exists with all of):
(the third line is just my crude latex rendition of your third diagram)
Is that order of quantifiers and precedence correct?
In words, you want an :
An unverified intuition (I'll unlikely work on this further): would the joint distribution of all candidate s work as ? It screens off the from any given , right (90% confidence)? And it is itself a redund over the I think (again 90% confidence ish)? I don't really grok what it means to be approximately deterministic but it feels like this should be? Overall confidence 30% ish, deflating for outside and definedness reasons.
Yup, that's exactly the right idea, and indeed constructing by just taking a tuple of all the redund 's (or a sufficient statistic for that tuple) is a natural starting point which works straightforwardly in the exact case. In the approximate case, the construction needs to be modified - for instance, regarding the third condition, that tuple of 's (or a sufficient statistic for it) will have much more entropy conditional on than any individual .
A lot of our work involves "redunds".[1] A random variable Γ is a(n exact) redund over two random variables X1,X2 exactly when both
X1→X2→Γ
X2→X1→Γ
Conceptually, these two diagrams say that X1 gives exactly the same information about Γ as all of X, and X2 gives exactly the same information about Γ as all of X; whatever information X contains about Γ is redundantly represented in X1 and X2. Unpacking the diagrammatic notation and simplifying a little, the diagrams say P[Γ|X1]=P[Γ|X2]=P[Γ|X] for all X such that P[X]>0.
The exact redundancy conditions are too restrictive to be of much practical relevance, but we are more interested in approximate redunds. Approximate redunds are defined by approximate versions of the same two diagrams:
Unpacking the diagrammatic notation, these two diagrams say
ϵred≥DKL(P[X,Γ]||P[X1]P[X2|X1]P[Γ|X2])
ϵred≥DKL(P[X,Γ]||P[X2]P[X1|X2]P[Γ|X1])
This bounty problem is about the existence of a(n approximate) maximal redund Ω: a redund which contains (approximately) all the information about X contained in any other (approximate) redund. Diagrammatically, a maximal redund Ω satisfies:
Finally, we'd like our maximal redund Ω to be an approximately deterministic function of X, i.e. ϵdet≥H(Ω|X), which is equivalent to the diagram
What We Want For The Bounty
This bounty pays out $500 for establishing that there always, over any two random variables X1, X2, exists an Ω satisfying the requirements above, i.e.
... with reasonable error bounds.
What counts as reasonable error bounds? We're expecting all the other ϵ's to scale with ϵ2 in the diagrams above, and ideally be small constant multiples of ϵ2. In other words: the higher the approximation error on allowable redunds Γ, the more approximation error we allow for all the conditions on Ω. Reasonableness of error bound is partially a judgement call on our part, but small constant (like e.g. 2 or 3) multiples of ϵ2 would definitely suffice.
Bounds should be global, i.e. apply even when ϵ2 is large. We're not interested in e.g. first or second order approximations for small ϵ2 unless they provably apply globally.
If the error bounds are sufficiently reasonable, a solution to this problem will also solve our other open bounty problem, paying out another $500, for a total of $1000.
If people post major intermediate steps which are load-bearing for an eventual solution, we might allocate the bounty based on our judgement of different peoples' relative contributions; the hope is to incentivize sharing intermediate results.
We might also partially pay out for counterexamples. That's a judgement call depending on how thoroughly the counterexample kills hope of any theorem vaguely along the lines intended.
In terms of rigor and allowable assumptions, roughly the level of rigor and assumptions in the posts linked above is fine.
Some Intuition From The Exact Case
A proof in the exact case is straightforward; we'll walk through it here to build some intuition.
The exact redundancy conditions say
P[Γ|X]=P[Γ|X1]=P[Γ|X2]
wherever P[X]>0. So, we can form a bipartite graph, where each vertex represents a value x1 of X1 or a value x2 of X2, and two vertices are connected iff P[X1=x1,X2=x2]>0. Then the definition of redundancy says that, for ANY redund Γ, P[Γ|X1] and P[Γ|X2] are the same for ANY values of X1 and X2 (respectively) in the same connected component.
So, let f(X) be defined as the connected component in which X falls. For any redund Γ, P[Γ|X] must be a function of f(X) (since two X-values in the same connected component have the same P[Γ|X]). Thus
X→f(X)→Γ
f(X) is therefore a(n exactly) deterministic function of X, and (exactly) mediates between X and any redund Γ. Thus, f(X) satisfies our requirements, in the exact case (i.e. when ϵ2=0).
Alas, this proof does not easily generalize to the approximate case.
Why We Want This
This is a conjecture which we've had on the back burner for a while, and it would be relevant in many places. It would solve our other open bounty problem, it would likely be our starting point for a new version of the Telephone Theorem which more explicitly connects to the natural latents framework, and it would directly simplify the problem of computing natural latents numerically. Probably there are other use-cases which we're not thinking about right at the moment.
A lot of our work involves "redunds".