$500 + $500 Bounty Problem: Does An (Approximately) Deterministic Maximal Redund Always Exist?

johnswentworth; David Lorell

I dunno how much it's obvious for people who want to try for bounty, but I only now realized that you can express criteria for redund as inequality with mutual information and I find mutual information to be much nicer to work with, even if from pure convenience of notation. Proof:

Let's take criterion for redund w.r.t. of $X_{1}$ , $X_{2}$

$ϵ \geq D_{K L} (P [X_{1}, X_{2}, Γ] ∥ P [X_{1}] P [X_{2} | X_{1}] P [Γ | X_{2}])$

expand expression for KL divergence: $D_{K L} (P [X_{1}, X_{2}, Γ] ∥ P [X_{1}] P [X_{2} | X_{1}] P [Γ | X_{2}]) = \sum_{x_{1}, x_{2}, γ} P (x_{1}, x_{2}, γ) log \frac{P (x_{1}, x_{2}, γ)}{P (x_{1}) P (x_{2} | x_{1}) P (γ | x_{2})}$

expand joint distribution:

$D_{K L} (P [X_{1}, X_{2}, Γ] ∥ P [X_{1}] P [X_{2} | X_{1}] P [Γ | X_{2}]) = \sum_{x_{1}, x_{2}, γ} P (x_{1}, x_{2}, γ) log \frac{P (x_{1}) P (x_{2} | x_{1}) P (γ | x_{1}, x_{2})}{P (x_{1}) P (x_{2} | x_{1}) P (γ | x_{2})}$

simplify:

$D_{K L} (P [X_{1}, X_{2}, Γ] ∥ P [X_{1}] P [X_{2} | X_{1}] P [Γ | X_{2}]) = \sum_{x_{1}, x_{2}, γ} P (x_{1}, x_{2}, γ) log \frac{P (γ | x_{1}, x_{2})}{P (γ | x_{2})}$

which is a conditional mutual information:

$\sum_{x_{1}, x_{2}, γ} P (x_{1}, x_{2}, γ) log \frac{P (γ | x_{1}, x_{2})}{P (γ | x_{2})} = I (X_{1}; Γ | X_{2})$

what results in:

$D_{K L} (P [X_{1}, X_{2}, Γ] ∥ P [X_{1}] P [X_{2} | X_{1}] P [Γ | X_{2}]) = I (X_{1}; Γ | X_{2}) \leq ϵ$

[-]johnswentworth4mo42

Yup, and you can do that with all the diagrams in this particular post. In general, the error for any of the diagrams $X \to Y \to Z$ , $X \leftarrow Y \leftarrow Z$ , or $X \leftarrow Y \to Z$ is $I (X; Z | Y)$ for any random variables $X, Y, Z$ .

Notably, this includes as a special case the "approximately deterministic function" diagram $X \leftarrow Y \to X$ ; the error on that one is $I (X; X | Y)$ which is $H (X | Y)$ .

[-]Gurkenglas4mo40

Then how do you choose between the three directed representatives? Is there some connotation, some coming-apart of concepts that would become apparent after generalization, or did you just pick X <- Y -> X because it's symmetric?

[-]johnswentworth4mo42

Good question. The three directed graphical representations directly expand to different but equivalent expressions, and sometimes we want to think in terms of one of those expressions over another.

The most common place this comes up for us: sometimes we want to think about a latent in terms of its defining conditional distribution $P [Γ | X]$ . When doing that, we try to write diagrams with $Γ$ downstream, like e.g. $X_{1} \to X_{2} \to Γ$ . Other times, we want to think of $Γ$ in terms of the distribution $P [X | Γ]$ . When thinking that way, we try to write diagrams with $Γ$ upstream, e.g. $Γ \to X_{1} \to X_{2}$ .

In the case of $X \leftarrow Y \to X$ , the natural information-theoretic expression for its error is $H (X | Y)$ or equivalently $I (X; X | Y)$ . Those condition on $Y$ , so we want $Y$ "upstream" in order to most directly express those quantities. You could derive the same errors from $X \leftarrow Y \leftarrow X$ or $X \to Y \to X$ , but if you write out those $D_{K L}$ errors and try to reduce them to $H (X | Y)$ , you'll probably find that it takes more steps.

[-]Gurkenglas4mo40

If the expressions cease to be equivalent in some natural generalization of this setting, then I recommend that you try to find a proof there, because the proof space should be narrower and thus easier to search.

[-]David Lorell4mo72

Note to bounty hunters, since it's come up twice: An "approximately deterministic function of X" is one where, conditional on X, the entropy of F(X) is very small, though possibly nonzero. I.e. you have very nearly pinned down the value that F(X) takes on once you learn X. For conceptual intuition on the graphical representation, X approximately mediating between F(X) and F(X) (two random variables which always take on the same value as each other) means as always that there is approximately no further update on the value of either given the other, (despite them being perfectly informative about the value of the other,) if one already knows the value of X. See this for more.

[-]johnswentworth4mo50

(Note that F, in David's notation here, is a stochastic function in general.)

[-]Thane Ruthenis3mo60

So, I ran into basically the same problem; turns out this setup is a pretty straightforward generalization of Gacs-Korner common information. I've spent a while trying to solve it, and I'm currently pessimistic that it's easily tractable. There are some outside-view reasons to think that, too: my understanding is that the field of information theory has been trying to crack the true name of multivariate redundant information for a while now, and all they have are special cases (Gacs-Korner) and adhockery. This isn't exactly the same problem, but I expect it runs into the same convergent subproblems...

But maximal redunds are pretty central to the whole abstraction business, so getting at least some confirmation seems prudent. Have you tried running a dirty-but-practical empirical check on this? Basic idea:

Set up a wide variety of sets of random variables .
For each set, train a function $Ω$ whose entropy is maximized under $\forall i : I (X_{- i}; Ω | X_{i}) \leq ϵ_{2}$ and $H (Ω | X) \leq ϵ_{det}$ .
Train sets of functions $R_{j}$ , constrained to be approximately deterministic redunds, with the following objectives (either separate sets, or mix-and-match those objectives):
1. Minimize $\sum_{k} I (R_{j}; R_{k})$ .
2. Minimize $I (R_{j}; Ω)$ .
Check whether $Ω$ still manages to approximately screen off all $R_{j}$ from $X$ .

It seems plausible that the real-world abstractions distributions won't be more exotic than what we can contrive to set up in (1), and it should suffice to safeguard against the possibility that the intuitive idea of a maximal redund diverges from reality.

[-]Thane Ruthenis3mo*60

Actually nevermind, I think I see a counterexample to the exact formulation in the OP. I haven't fully formalized it, so there may be holes, but here it is:

Imagine a "toxic piece of shared information": some type of information such that it inextricably ties together some shared bits of information with unique bits of information $U$ , making it impossible to define a variable that contains only the shared bits from it. If all variables in a set $X$ are "toxic" in this way, then any redund $R$ , even ones that straight-up copy one of the $X_{j}$ , would necessarily have $I (R; X_{- i} | X_{i}) \geq U$ for all $X_{i}$ except $X_{j}$ .

Suppose we fix a specific $ϵ_{2}$ , i. e., state that only variables with fewer unique bits than that qualify as redunds.

Now consider a set of variables $A^{n}$ each variable $A_{i}^{n}$ in which is a set of $n$ independent clones of toxic-information-containing variables ${X_{i}^{1}, \dots, X_{i}^{n}}$ . Any variable which is a redund over only one set of clones is a valid redund over $A^{n}$ , and so there are $n$ redunds like this, all mutually independent.

What would a maximal redund $Ω^{n}$ look like? Well, to properly screen all other redunds off, it would need to include most of the information from each individual independent redund, and so its $I (Ω^{n}; A_{- i}^{n} | A_{i}^{n})$ would scale with $n$ (probably roughly equaling $n U$ ?). We can then set $n$ arbitrarily high, so that this quantity shoots well past any fixed $ϵ_{2}$ . And if we discard some $D$ bits from each redund to try and manage this, this would increasingly work against our ability to screen off redunds from $A_{- i}^{n}$ , and we can again scale $n$ to make any $Ω^{n}$ that tries to keep its $I (Ω; A_{- i}^{n} | A_{i}^{n})$ under control fail arbitrarily hard at the screen-off task for all-but-a-few redunds.

Concrete example: I think a binary symmetric channel should work here? I. e., set $X = {X_{1}, X_{2}}$ , where $X_{2} = XOR (X_{1}, Z)$ , $X_{1}$ is a fair coin, and $Z$ is a biased coin with bias $p$ . We get $H (X_{1}) = H (X_{2}) = 1$ , $I (X_{1}; X_{2}) = p {log}_{2} p + (1 - p) {log}_{2} (1 - p) = K$ , $H (X_{1} | X_{2}) = H (X_{2} | X_{1}) = 1 - K$ . Then an $R = X_{1}$ has $I (X_{1}; X_{2} | R) = 0$ , so it contains $K$ shared bits, and $I (R; X_{1} | X_{2}) = 1 - K$ , so it qualifies as a redund if we set $ϵ_{2} \geq 1 - K$ . Then, per the construction above, we design a set whose variables are sets of those basic variables' independent clones, and watch the maximal redund explode.

The subtle caveat here is that I think we need to prove that for any variable $Ω$ , decreasing the amount of unique information (i. e., driving $I (Ω; X_{1} | X_{2})$ down) trades off against shared information (i. e., $I (X_{1}; X_{2} | Ω)$ ends up increasing), which leads to $I (X_{1}; R | Ω)$ growing. Specifically, we need to prove that this happens at some nontrivial, non-asymptoting rate that leads $Ω^{n}$ 's $ϵ_{1}$ in the cloned construction to scale with $n$ , well past any small constant multiple of $ϵ_{2}$ .

The counter here would be a continuum of $Ω$ s which can drive $I (Ω; X_{1} | X_{2})$ arbitrarily low at flatlining decrease in $I (X_{1}; X_{2} | Ω)$ . The full counter would prove that this is possible in the general case, not only in the BSC setting.

That it's impossible seems intuitively obvious (consider if that worked, what possibilities it'd unlock! the universe isn't that kind), but maybe annoying to prove and the result wouldn't be very exciting, so it's a low priority to me.^[1] Still, throwing this out here in case you find this reasoning convincing.

(My guess is that "maximal redunds are possible for any sets of random variables" is too strong and we need to explicitly limit it to sets of variables that are approximately structured as sparse hierarchical graphs or something. This is probably okay, see my previous thoughts about the universe's physics selecting for that.)

^{^}
Here's o3's crack at it if you want a starting point. I've found it's sometimes helpful, but I'm currently fatigued from trying to paranoidally spot the place where it's trying to slip crap past me, so I've not verified it. Looks right at a glance, if you ignore some weird parts.

[-]Thane Ruthenis3mo20

Some more spitballing regarding how to fix this:

One thing here is that, in the relative sense, isn't any more "guilty" of caring about specific variables than any one $X_{j}^{i}$ -copying $R$ , given that it captures much more shared information about $A_{i}^{n}$ s than those $R$ s do. So, to suit, maybe we should allow $ϵ_{3}$ to scale with the amount of shared information in the max-redund? Which would be uhh $S_{H} (Ω) = H (Ω) - H (Ω | X) - {max}_{i} I (Ω; X_{- i} | X_{i})$ or something.

Generalizing, maybe it should be "permitted" for redunds to contain more unique information the more shared information they have? Which would make the tuning parameter for "how pure of a redund $R$ is" not ${max}_{i} I (R; X_{- i} | X_{i})$ (aka your $ϵ_{2}$ ), but $\frac{S_{H} (R)}{H (R)}$ , i. e., the fraction of the redund's total entropy^[1] that is shared information. That makes intuitive sense to me.

^{^}
Or its $X$ -relevant entropy? Subtract $H (R | X)$ from the denominator too then.

[-]XelaP4mo*61

A minor request: can you link "approximate deterministic functions" to some of your posts about them? I know you've written on them, and elaborated on them in more detail elsewhere.

A couple questions:

Can you elaborate more on why you care? Preferably, at a level of elaboration/succinctness that's above the little you've said in the post, and below "read the linked posts and my other posts". I feel that knowing why you care will help with understanding what the theorem really means. I am probably going to read a fair amount of those posts anyways.

Edit: It is clear now why you care, however I still believe there may be people that would benefit from some elaboration in the post. For anyone wondering, the reason to care is in the first theorem in natural latents: the math. For an easy special case of the ideas, see the extremely easy to understand post on (approximately) deterministic natural latents, and note that by the "Dangly Bits Lemma" the redund condition is implied by the approximately determined condition.

For the intuition behind what the redund condition means, note that by the rerooting rule the diagram X -> Y -> R is the same (as in, equivalently constrains the probability distribution) X <- Y -> R, which means that once you know Y, being told X doesn't help you predict R. The redund condition is that + the version with X,Y swapped. This makes it clear why this captures "R is redundantly encoded in X and Y".

It sounds like you strongly expect the theorem to be true. How hard do you expect it to be? Does it feel like the sort of thing that can be done using basically the tools you've used?
How hard have you tried?
Do you have examples of the theorem? That is, do you have some variables X with some approximate redund Omega and a bunch of redunds Gamma such that [statements] hold? Or, better, do you have some example where you can find a universal redund Omega?

There are some reasons I can think of for why you'd perhaps want to explicitly refuse to answer some of those questions: to prevent anchoring when a blind mind might do better, to not reveal info that might (if the info turns out one way, assuming truthtelling) discourage attempts (like saying it's super hard and you tried really hard), and to uphold a general policy of not revealing such info (e.g. perhaps you think this theorem is super easy and you haven't spent 5 minutes on it, but plan to put bounties on other theorems that are harder).

For example, part of my inclination to try this is my sense that, compared to most conjectures, it's

Incredibly low background - I've known the requisite probability/information theory for a long time at this point
A precise, single theorem is asked for. Not as precise as could be, given the freedom of "reasonable", but compared to stuff like "figure out a theory of how blank works", it's concrete and small.
Very little effort han been put into it. No other conjecture I know of that I know enough to understand has had what looks like a single digit number of people working on it for not very long.
It intuitively seems like the sort of theorem that some random person on the internet that has the general mathematical ability ("maturity"/"proof ability" - the mostly tacit knowledge) but not a bunch of theoretical knowledge or even isn't very raw-computing-power smarts (but is probably high creativity/that spark of genius that is at the core of discovery and invention, which can be partially traded off for a lower chance at success).

[-]johnswentworth4mo50

Good questions. I apparently took too long to answer them, as it sounds like you've mostly found the answers yourself (well done!).

"Approximately deterministic functions" and their diagrammatic expression were indeed in Deterministic Natual Latents; I don't think we've talked about them much elsewhere. So you're not missing any other important source there.

On why we care, a recent central example which you might not have seen is our Illiad paper. The abstract reads:

Suppose two Bayesian agents each learn a generative model of the same environment. We will assume the two have converged on the predictive distribution (i.e. distribution over some observables in the environment), but may have different generative models containing different latent variables. Under what conditions can one agent guarantee that their latents can be faithfully expressed in terms of the other agent’s latents?
We give simple conditions under which such translation is guaranteed to be possible: the natural latent conditions. We also show that, absent further constraints, these are the most general conditions under which translatability is guaranteed.

(Bold added for emphasis.) That "faithfully expressed in terms of" part is key here. In the paper, we operationalize that as "Under what conditions can one agent guarantee that their latents are (approximately) a stochastic function of the other agent's latents?". But that requires dragging in stochastic functions, which are conceptually messier than plain old functions. As a result, there's a whole section in the paper called "Under what model?", and basically-all the theorems require an annoying "independent latents" assumption (which basically says that all the interactions between the two agents' latents are route through the environment).

We could cut all that conceptual mess and all those annoying assumptions out if we instead use (approximately) deterministic natural latents. Then, the core question would be operationalized as "Under what conditions can one agent guarantee that their latents are (approximately) a function of the other agent's latents?". Then it's just plain old functions, we don't need to worry conceptually about the "under what model?" question (because there's approximately only one way to put all the latents in one model), and the "independent latents" conditions can all be removed (because different deterministic functions of X are always independent given X). All the messiest and least-compelling parts of the paper could be dropped.

... but the cost would be generality. Stochastic functions are more general than functions, and much more relevant to Bayesians' world models. So narrowing from stochastic functions to functions would probably mean "assuming our way out of reality"... unless it turns out that (Approximately) Deterministic Natural Latents Are All You Need, which was the subject of our previous bounty post.

The conjecture in this bounty post is even stronger; it would imply that one, and therefore also allow us to simplify the paper a lot.

[-]Oliver Sourbut4mo*50

nit: I found your graphical quantifiers and implications a bit confusing to read. It's the middle condition that looks weird.

If I've understood, I think you want (given a set of , there exists $Ω$ with all of):

$\forall i . [X_{¯ i} \to X_{i} \to Ω] \forall Γ . (\forall i . [X_{¯ i} \to X_{i} \to Γ]) ⟹ [X \to Ω \to Γ] X (\to Ω) \to Ω$

(the third line is just my crude latex rendition of your third diagram)

Is that order of quantifiers and precedence correct?

In words, you want an $Ω$ :

which is a redund over the $X_{i}$
where any $Γ$ which is a redund over the $X_{i}$ is screened from $X$ by $Ω$
which is approximately deterministic on $X$

An unverified intuition (I'll unlikely work on this further): would the joint distribution of all candidate $Γ$ s work as $Ω$ ? It screens off the $X$ from any given $Γ$ , right (90% confidence)? And it is itself a redund over the $X$ I think (again 90% confidence ish)? I don't really grok what it means to be approximately deterministic but it feels like this should be? Overall confidence 30% ish, deflating for outside and definedness reasons.

[-]johnswentworth4mo30

Yup, that's exactly the right idea, and indeed constructing by just taking a tuple of all the redund $Γ$ 's (or a sufficient statistic for that tuple) is a natural starting point which works straightforwardly in the exact case. In the approximate case, the construction needs to be modified - for instance, regarding the third condition, that tuple of $Γ$ 's (or a sufficient statistic for it) will have much more entropy conditional on $X$ than any individual $Γ$ .

[-]quetzal_rainbow3mo40

Probably (I didn't dig deep yet) relevant impossibility result.

[-]Thane Ruthenis3mo20

That seems interesting! Though my understanding is that it's not directly applicable here. The maximal redund only needs to contain all redundant information that can be contained in any other redund, it doesn't attempt to capture all redundant information in total. So if there's some constraint on what types of redundancy can be captured by any such variable, that's fine.

LESSWRONG
LW

73

$500 + $500 Bounty Problem: Does An (Approximately) Deterministic Maximal Redund Always Exist?

73

73

What We Want For The Bounty

Some Intuition From The Exact Case

Why We Want This