Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is not-very-distilled and doesn’t contain much background; it’s intended for people who already have the context of at least these four posts. I’m putting it up mainly as a reference for people who might want to work directly on the math of natural abstractions, and as a technical reference post.

There’s various hints that, in most real-world cases, the distribution of low-level state given high-level natural abstractions should take the form of a maximum entropy distribution, in which:

  • The “features” are sums over local terms, and
  • The high-level variables are (isomorphic to) the Lagrange multipliers

More formally: we have a low-level causal model (aka Bayes net) . Given the high-level variables , the distribution of low-level variable values should look like

… i.e. the maximum-entropy distribution subject to constraints of the form . (Note: , and  are all vector-valued.)

This is the sort of form we see in statistical mechanics. It’s also the form which the generalized Koopman-Pitman-Darmois (gKPD) theorem seems to hint at.

I don’t yet have a fully-satisfying general argument that this is the main form which abstractions should take, but I have two partial arguments. This post will go over both of them.

Maxent Telephone Argument

Two different nested layers of Markov blankets on the same underlying causal DAG

Quick recap of the Telephone Theorem: information about some variable  passes through a nested sequence of Markov blankets . Information about  can only be lost as it propagates. In the limit, all information is either perfectly conserved or completely lost. Mathematically, in the limit  for some  such that  with probability approaching 1 as  is the perfectly-conserved-in-the-limit information carrier.

In this setup, we can also argue that the limiting distribution  should have a maxent form. (Note: this is a hand-wavy argument, not a proper proof.)

Think about how the distribution  transforms as we increment  by 1. We have

First key property of this transformation: it’s a convex combination for each  value, i.e. it’s mixing. Mixing, in general, cannot decrease the entropy of a distribution, only increase it or leave it the same. So, the entropy of  will not decrease with .

When will the entropy stay the same? Well, our transformation may perfectly conserve some quantities. Since the transformation is linear, those quantities should have the form  for some , i.e. they’re expected values. They’re conserved when  with probability 1.

Intuitively, we’d expect the entropy of everything except the conserved quantities to strictly increase. So, we’d expect the distribution  to approach maximum entropy subject to constraints of the form , where  with probability 1 (at least in the limit of large ). Thus, we have the maxent form

(Note on the  in there: I’m actually maximizing relative entropy, relative to the prior on , which is almost always what one should actually do when maximizing entropy. That results in a  term. We should find that  is a conserved quantity anyway, so it shouldn’t actually matter whether we include the  multiplier or not; we’ll get the same answer either way.)

Shortcomings of This Argument

Obviously it’s a bit handwavy. Other than that, the main issue is that the Telephone Theorem doesn’t really leverage the spatial distribution of information; information only propagates along a single dimension. As a result, there’s not really a way to talk about the conserved ’s being a sum over local terms, i.e. .

Despite the handwaviness, it’s an easy result to verify computationally for small systems, and I have checked that it works.

Resampling + gKPD Argument

Another approach is to start from the redundancy + resampling formulation of abstractions. In this approach, we run an MCMC process on our causal model. Any information which is highly redundant in the system - i.e. the natural abstractions - is near-perfectly conserved under resampling a single variable at a time; other information is all wiped out. Call the initial (low-level) state of the MCMC process , and the final state . Then we have

… where  is conserved by the resampling process with probability 1.

It turns out that  factors over the same DAG as the underlying causal model:

If the conserved quantities  are much lower-dimensional than  itself, then we can apply the gKPD theorem: we have a factorization of , we have a low-dimensional summary statistic  which summarizes all the info in  relevant to , so the gKPD theorem says that the distribution must have the form

… where  is a relatively-small set of “exceptional” indices, and  is some fixed reference value of . This is slightly different from our intended form - there’s the exception terms, and we have  rather than just . The latter problem is easily fixed by absorbing  into  (at the cost of possibly increasing the summary dimension by 1), so that’s not really an issue, but the exception terms are annoying. Absorbing and assuming (for convenience) no exception terms, we get the desired form:

Note that this is maxentropic subject to constraints of the form . Since the summary statistic  is conserved by the resampling process, we must have , so the conservation equation is

Shortcomings of This Argument

Obviously there’s the exception terms. Other than that, the main issue with this argument is an issue with the resampling approach more generally: once we allow approximation, it’s not clear that the natural abstractions from the resampling formulation are the same natural abstractions which make the Telephone Theorem work. Both are independently useful: information dropping to zero at a distance is an easy property to leverage for planning/inference, and knowing the quantities conserved by MCMC makes MCMC-based planning and inference much more scalable. And in the limit of perfect conservation and infinite “distance”, the two match. But it’s not clear whether they match under realistic approximations, and I don’t yet have efficient methods to compute the natural abstractions both ways in large systems in order to check.

That said, resampling + gKPD does give us basically the result we want, at least for redundancy/resampling-based natural abstractions.

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 5:37 PM

I followed approximately the technical discussion, and now I'm wondering what that would buy us if you are correct.

  • Max entropy distributions seem nicely behaved and well-studied, so maybe we get some computations, properties, derivation for free? (Basically applying a productive frame to the problem of abstraction)
  • It would reduce computing the influence of the summary statistics on the model to computing the constraints, as I'm guessing that this is the hard part in computing the max entropy distribution (?)

Are these correct, and what am I missing?

That's basically correct; the main immediate gain is that it makes it much easier to compute abstractions and compute using abstractions.

One additional piece is that it hints towards a probably-more-fundamental derivation of the theorems in which maximum entropy plays a more central role. The maximum entropy Telephone Theorem already does that, but the resampling + gKPD approach routes awkwardly through gKPD instead; there's probably a nice way to do it directly via constrained maximization of entropy. That, in turn, would probably yield stronger and simpler theorems.