Maxent and Abstractions: Current Best Arguments

johnswentworth

This post is not-very-distilled and doesn’t contain much background; it’s intended for people who already have the context of at least these four posts. I’m putting it up mainly as a reference for people who might want to work directly on the math of natural abstractions, and as a technical reference post.

There’s various hints that, in most real-world cases, the distribution of low-level state given high-level natural abstractions should take the form of a maximum entropy distribution, in which:

The “features” are sums over local terms, and
The high-level variables are (isomorphic to) the Lagrange multipliers

More formally: we have a low-level causal model (aka Bayes net) . Given the high-level variables $X^{H}$ , the distribution of low-level variable values should look like

$P [X^{L} | X^{H}] = \frac{1}{Z} P [X^{L}] e^{λ^{T} (X^{H}) \sum_{i} f_{i} (X_{i}^{L}, X_{p a (i)}^{L})}$

… i.e. the maximum-entropy distribution subject to constraints of the form $E [\sum_{i} f_{i} (X_{i}^{L}, X_{p a (i)}^{L}) | X^{H}] = μ (X^{H})$ . (Note: $λ$ , $f_{i}$ , and $μ$ are all vector-valued.)

This is the sort of form we see in statistical mechanics. It’s also the form which the generalized Koopman-Pitman-Darmois (gKPD) theorem seems to hint at.

I don’t yet have a fully-satisfying general argument that this is the main form which abstractions should take, but I have two partial arguments. This post will go over both of them.

Maxent Telephone Argument

Two different nested layers of Markov blankets on the same underlying causal DAG

Quick recap of the Telephone Theorem: information about some variable $X$ passes through a nested sequence of Markov blankets $M_{1}, M_{2}, \dots$ . Information about $X$ can only be lost as it propagates. In the limit, all information is either perfectly conserved or completely lost. Mathematically, in the limit $P [X | M_{n}] = P [X | F_{n} (M_{n})]$ for some $F$ such that $F_{n} (M_{n}) = F_{n + 1} (M_{n + 1})$ with probability approaching 1 as $n \to \infty$ ; $F$ is the perfectly-conserved-in-the-limit information carrier.

In this setup, we can also argue that the limiting distribution $l i m_{n \to \infty} P [X | M_{n}]$ should have a maxent form. (Note: this is a hand-wavy argument, not a proper proof.)

Think about how the distribution $(x \mapsto P [X = x | M_{n}])$ transforms as we increment $n$ by 1. We have

$P [X | M_{n + 1}] = \sum_{M_{n}} P [X | M_{n}] P [M_{n} | M_{n + 1}]$

First key property of this transformation: it’s a convex combination for each $M_{n + 1}$ value, i.e. it’s mixing. Mixing, in general, cannot decrease the entropy of a distribution, only increase it or leave it the same. So, the entropy of $P [X | M_{n}]$ will not decrease with $n$ .

When will the entropy stay the same? Well, our transformation may perfectly conserve some quantities. Since the transformation is linear, those quantities should have the form $\sum_{X} f (X) P [X | M_{n}]$ for some $f$ , i.e. they’re expected values. They’re conserved when $E [f (X) | M_{n}] = E [f (X) | M_{n + 1}]$ with probability 1.

Intuitively, we’d expect the entropy of everything except the conserved quantities to strictly increase. So, we’d expect the distribution $P [X | M_{n}]$ to approach maximum entropy subject to constraints of the form $E [f (X) | M_{n}] = μ (M_{n})$ , where $E [f (X) | M_{n}] = E [f (X) | M_{n + 1}]$ with probability 1 (at least in the limit of large $n$ ). Thus, we have the maxent form

$P [X | M_{n}] = \frac{1}{Z} P [X] e^{λ^{T} (M_{n}) f (X)}$

(Note on the $P [X]$ in there: I’m actually maximizing relative entropy, relative to the prior on $X$ , which is almost always what one should actually do when maximizing entropy. That results in a $P [X]$ term. We should find that $E [l n P [X] | M_{n}]$ is a conserved quantity anyway, so it shouldn’t actually matter whether we include the $P [X]$ multiplier or not; we’ll get the same answer either way.)

Shortcomings of This Argument

Obviously it’s a bit handwavy. Other than that, the main issue is that the Telephone Theorem doesn’t really leverage the spatial distribution of information; information only propagates along a single dimension. As a result, there’s not really a way to talk about the conserved $f$ ’s being a sum over local terms, i.e. $f (X) = \sum_{i} f_{i} (X_{i}, X_{p a (i)})$ .

Despite the handwaviness, it’s an easy result to verify computationally for small systems, and I have checked that it works.

Resampling + gKPD Argument

Another approach is to start from the redundancy + resampling formulation of abstractions. In this approach, we run an MCMC process on our causal model. Any information which is highly redundant in the system - i.e. the natural abstractions - is near-perfectly conserved under resampling a single variable at a time; other information is all wiped out. Call the initial (low-level) state of the MCMC process $X^{0}$ , and the final state $X$ . Then we have

$P [X | X^{0}] = P [X | F (X^{0})] = P [X | F (X)] P [F (X) | F (X^{0})] = \frac{1}{Z} P [X] I [F (X) = F (X^{0})]$

… where $F$ is conserved by the resampling process with probability 1.

It turns out that $P [X | X^{0}]$ factors over the same DAG as the underlying causal model:

$P [X | X^{0}] = \prod_{i} P [X_{i} | X_{p a (i)}, X^{0}]$

If the conserved quantities $F (X)$ are much lower-dimensional than $X$ itself, then we can apply the gKPD theorem: we have a factorization of $P [X | X^{0}]$ , we have a low-dimensional summary statistic $F (X)$ which summarizes all the info in $X$ relevant to $X^{0}$ , so the gKPD theorem says that the distribution must have the form

$P [X | X^{0}] = \frac{1}{Z} e^{λ^{T} (X^{0}) \sum_{i \notin E} f_{i} (X_{i}, X_{p a (i)})} \prod_{i \notin E} P [X_{i} | X_{p a (i)}, X^{0} = (X^{0})^{*}] \prod_{i \in E} P [X_{i} | X_{p a (i)}, X^{0} = X^{0}]$

… where $E$ is a relatively-small set of “exceptional” indices, and $(X^{0})^{*}$ is some fixed reference value of $X^{0}$ . This is slightly different from our intended form - there’s the exception terms, and we have $\prod_{i \notin E} P [X_{i} | X_{p a (i)}, X^{0} = (X^{0})^{*}]$ rather than just $\prod_{i \notin E} P [X_{i} | X_{p a (i)}]$ . The latter problem is easily fixed by absorbing $\prod_{i \notin E} \frac{P [X_{i} | X_{p a (i)}, X^{0} = (X^{0})^{*}]}{P [X_{i} | X_{p a (i)}]}$ into $f$ (at the cost of possibly increasing the summary dimension by 1), so that’s not really an issue, but the exception terms are annoying. Absorbing and assuming (for convenience) no exception terms, we get the desired form:

$P [X | X^{0}] = \frac{1}{Z} e^{λ^{T} (X^{0}) \sum_{i} f_{i} (X_{i}, X_{p a (i)})} P [X]$

Note that this is maxentropic subject to constraints of the form $E [\sum_{i} f_{i} (X_{i}, X_{p a (i)}) | X^{0}] = μ (X^{0})$ . Since the summary statistic $F (X) = \sum_{i} f_{i} (X_{i}, X_{p a (i)})$ is conserved by the resampling process, we must have $μ (X^{0}) = \sum_{i} f_{i} (X_{i}^{0}, X_{p a (i)}^{0})$ , so the conservation equation is

$E [\sum_{i} f_{i} (X_{i}, X_{p a (i)}) | X^{0}] = \sum_{i} f_{i} (X_{i}^{0}, X_{p a (i)}^{0})$

Shortcomings of This Argument

Obviously there’s the exception terms. Other than that, the main issue with this argument is an issue with the resampling approach more generally: once we allow approximation, it’s not clear that the natural abstractions from the resampling formulation are the same natural abstractions which make the Telephone Theorem work. Both are independently useful: information dropping to zero at a distance is an easy property to leverage for planning/inference, and knowing the quantities conserved by MCMC makes MCMC-based planning and inference much more scalable. And in the limit of perfect conservation and infinite “distance”, the two match. But it’s not clear whether they match under realistic approximations, and I don’t yet have efficient methods to compute the natural abstractions both ways in large systems in order to check.

That said, resampling + gKPD does give us basically the result we want, at least for redundancy/resampling-based natural abstractions.

[-]adamShimi2yΩ7100

I followed approximately the technical discussion, and now I'm wondering what that would buy us if you are correct.

Max entropy distributions seem nicely behaved and well-studied, so maybe we get some computations, properties, derivation for free? (Basically applying a productive frame to the problem of abstraction)
It would reduce computing the influence of the summary statistics on the model to computing the constraints, as I'm guessing that this is the hard part in computing the max entropy distribution (?)

Are these correct, and what am I missing?

[-]johnswentworth2yΩ6100

That's basically correct; the main immediate gain is that it makes it much easier to compute abstractions and compute using abstractions.

One additional piece is that it hints towards a probably-more-fundamental derivation of the theorems in which maximum entropy plays a more central role. The maximum entropy Telephone Theorem already does that, but the resampling + gKPD approach routes awkwardly through gKPD instead; there's probably a nice way to do it directly via constrained maximization of entropy. That, in turn, would probably yield stronger and simpler theorems.

LESSWRONG
LW

33

Maxent and Abstractions: Current Best Arguments

33

Ω 18

Maxent Telephone Argument

Shortcomings of This Argument

Resampling + gKPD Argument

Shortcomings of This Argument

33

Ω 18