The Lightcone Theorem: A Better Foundation For Natural Abstraction?

[-]Rohin Shah2yΩ132316

The Lightcone Theorem says: conditional on , any sets of variables in $X$ which are a distance of at least $2 T$ apart in the graphical model are independent.

I am confused. This sounds to me like:

If you have sets of variables that start with no mutual information (conditioning on $X^{0}$ ), and they are so far away that nothing other than $X^{0}$ could have affected both of them (distance of at least $2 T$ ), then they continue to have no mutual information (independent).

Some things that I am confused about as a result:

I don't see why you are surprised, or why you would have said it wouldn't work for finite T. (It seems obviously true to me from the statement, which makes me think I'm missing some subtlety.)
I don't understand why the distribution of $X^{0}$ must be the same as the distribution of $X$ . It seems like it should hold for arbitrary $X^{0}$ .
I don't see why this is relevant for natural abstractions. To me, the interesting part about abstractions is that it is generally fine to keep track of a small amount of information, even though there is tons and tons of information that "could have" been relevant (and does affect outcomes but in a way that is "noise" rather than "signal"). But this theorem is only telling you that you can throw away information that could never possibly have been relevant.

[-]johnswentworth2yΩ792

If you have sets of variables that start with no mutual information (conditioning on ), and they are so far away that nothing other than $X^{0}$ could have affected both of them (distance of at least $2 T$ ), then they continue to have no mutual information (independent).

Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance $2 T$ implies that nothing other than $X^{0}$ could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious.

I don't understand why the distribution of $X^{0}$ must be the same as the distribution of $X$ . It seems like it should hold for arbitrary $X^{0}$ .

It does, but then $X^{T}$ doesn't have the same distribution as the original graphical model (unless we're running the sampler long enough to equilibrate). So we can't view $X^{0}$ as a latent generating that distribution.

But this theorem is only telling you that you can throw away information that could never possibly have been relevant.

Not quite - note that the resampler itself throws away a ton of information about $X^{0}$ while going from $X^{0}$ to $X^{T}$ . And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes.

So the reason this is interesting (for the thing you're pointing to) is not that it lets us ignore information from far-away parts of $X^{T}$ which could not possibly have been relevant given $X^{0}$ , but rather that we want to further throw away information from $X^{0}$ itself (while still maintaining conditional independence at a distance).

[-]Thane Ruthenis2yΩ8149

Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance implies that nothing other than $X^{0}$ could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious.

... I feel compelled to note that I'd pointed out a very similar thing a while ago.

Granted, that's not exactly the same formulation, and the devil's in the details.

[-]Rohin Shah2y*Ω560

Okay, that mostly makes sense.

note that the resampler itself throws away a ton of information about while going from $X^{0}$ to $X^{T}$ . And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes.

I agree this is true, but why does the Lightcone theorem matter for it?

It is also a theorem that a Gibbs resampler initialized at equilibrium will produce $X^{T}$ distributed according to $X$ , and as you say it's clear that the resampler throws away a ton of information about $X^{0}$ in computing it. Why not use that theorem as the basis for identifying the information to throw away? In other words, why not throw away information from $X^{0}$ while maintaining $X^{T} \sim X$ ?

EDIT: Actually, conditioned on $X^{0}$ , it is not the case that $X^{T}$ is distributed according to $X$ .

(Simple counterexample: Take a graphical model where node A can be 0 or 1 with equal probability, and A causes B through a chain of > 2T steps, such that we always have B = A for a true sample from X. In such a setting, for a true sample from X, B should be equally likely to be 0 or 1, but $B^{T} ∣ X^{0} = B^{0}$ , i.e. it is deterministic.)

Of course, this is a problem for both my proposal and for the Lightcone theorem -- in either case you can't view $X^{0}$ as a latent that generates $X$ (which seems to be the main motivation, though I'm still not quite sure why that's the motivation).

[-]johnswentworth2yΩ440

Sounds like we need to unpack what "viewing as a latent which generates $X$ " is supposed to mean.

I start with a distribution $P [X]$ . Let's say $X$ is a bunch of rolls of a biased die, of unknown bias. But I don't know that's what $X$ is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow "recover" the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as $P [X] = \sum_{i} P [X_{i} | Λ] P [Λ]$ , where $Λ$ is the bias in this case. Then when reasoning/updating, we can usually just think about how an individual die-roll interacts with $Λ$ , rather than all the other rolls, which is useful insofar as $Λ$ is much smaller than all the rolls.

Note that $P [X | Λ]$ is not supposed to match $P [X]$ ; then the representation would be useless. It's the marginal $\sum_{i} P [X_{i} | Λ] P [Λ]$ which is supposed to match $P [X]$ .

The lightcone theorem lets us do something similar. Rather all the $X_{i}$ 's being independent given $Λ$ , only those $X_{i}$ 's sufficiently far apart are independent, but the concept is otherwise similar. We express $P [X]$ as $\sum_{X^{0}} P [X | X^{0}] P [X^{0}]$ (or, really, $\sum_{Λ} P [X | Λ] P [Λ]$ , where $Λ$ summarizes info in $X^{0}$ relevant to $X$ , which is hopefully much smaller than all of $X$ ).

[-]Rohin Shah2yΩ440

Okay, I understand how that addresses my edit.

I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don't really have any concrete questions at the moment.

[-]Thane Ruthenis2yΩ340

I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques)

My impression is that it being a concrete example is the why. "What is the right framework to use?" and "what is the environment-structure in which natural abstractions can be defined?" are core questions of this research agenda, and this sort of multi-layer locality-including causal model is one potential answer.

The fact that it loops-in the speed of causal influence is also suggestive — it seems fundamental to the structure of our universe, crops up in a lot of places, so the proposition that natural abstractions are somehow downstream of it is interesting.

[-]Brandeis2y*161

I think it might be useful to mention an analogy between your considerations and actual particle physics, where people are stuck with a functionally similar problem. They have tried (and so far failed) to make much progress, but perhaps you can find some inspiration from studying their attempts.

The most immediate shortcoming of the Telephone Theorem and the resampling argument is that they talk about behavior in infinite limits. To use them, either we need to have an infinitely large graphical model, or we need to take an approximation.

In particle physics, there is a quantity called the Scattering matrix; loosely speaking, the S-matrix connects a number of asymptotically free "in" states to a number of asymptotically free "out" states, where "in" means the state is projected to the infinite past, and "out" means projected to the infinite future. For example, if I were trying to describe a 2->2 electron scattering process, I would take two electrons "in" the far past, two electrons "out" in the far future, and sandwich an S-matrix between the two states which contains a bunch of "interaction" information, in particular about the probability (we're considering quantum mechanical entities) of such a process happening.

long-range interactions in a probabilistic graphical model (in the long-range limit) are mediated by quantities which are conserved (in the long-range limit).

The S-matrix can also be almost completely constrained by global symmetries (by Noether's theorem, these imply conserved quantities) using what's known as Bootstrapping. The entries of the S-matrix themselves are Lorentz invariant, so light-cone type causality is baked into the formalism.

In physics, it's perfectly fine to take these infinite limits if the background space-time has the appropriate asymptotic conditions i.e there exists a good definition of what constitutes the far past/future. This is great for particle physics experiments, where the scales are so small that the background spacetime is practically flat, and you can take these limits safely. The trouble is that when we scale up, we seem to live in an expanding universe (de-Sitter space) whose geometry doesn't support the taking of such limits. It's an open problem in physics to formulate something like an S-matrix on de Sitter space so that we can do particle physics on large scales.

People have tried all sorts of things (like what you have; splitting the universe up into a bunch of hypersurfaces X_i doing asymptotics there, and then somehow gluing), but they run into many technical problems like the initial data hypersurface not being properly Cauchy and finite entropy problems and so on.

[-]Alexander Gietelink Oldenziel2y60

Do you think this is really the same problem such that these issues will be obstacles for John's approach to Natural Abstractions?

[-]philip_b2y136

Can you formulate the theorem statement in a precise and self-sufficient way that is usually used in textbooks and papers so that a reader can understand it just by reading it and looking up the used definitions?

[-]johnswentworth2y170

Let be the initial state of a Gibbs sampler on an undirected probabilistic graphical model, and $X^{T}$ be the final state. Assume the sampler is initialized in equilibrium, so the distribution of both $X^{0}$ and $X^{T}$ is the distribution given by the graphical model.

Take any subsets $X_{R_{1}}^{T}, . . ., X_{R_{m}}^{T}$ of $X^{T}$ , such that the variables in each subset are at least a distance $2 T$ away from the variables in the other subsets (with distance given by shortest path length in the graph). Then $X_{R_{1}}^{T}, . . ., X_{R_{m}}^{T}$ are all mutually independent given $X^{0}$ .

[-]Thane Ruthenis2yΩ580

Hmm. I may be currently looking at it from the wrong angle, but I'm skeptical that it's the right frame for defining abstractions. It seems to group low-level variables based on raw distance, rather than the detailed environment structure? Which seems like a very weak constraint. That is,

By further iteration, we can conclude that any number of sets of variables which are all separated by a distance of are independent given $X^{0}$ . That’s the full Lightcone Theorem.

We can make literally any choice of those sets subject to this condition: we can draw the boundaries any way we want. Which means the abstractions we'd recover are not going to be convergent: there's a free parameter of the boundary choice.

Ah, no, I suppose that part is supposed to be handled by whatever approximation process we define for $Λ$ ? That is, the "correct" definition of the "most minimal approximate summary" would implicitly constrain the possible choices of boundaries for which $Λ$ is equivalent to $X_{0}$ ?

The eigendecomposition/mesoscale-approximation/gKPD approaches seem like they might move in that direction, though I admit I don't see their implications at a first glance.

If we ignore the sketchy part - i.e. pretend that regions $X_{R_{1}}^{0}, . . ., X_{R_{m}}^{0}$ cover all of $X^{0}$ and are all independent given $X$ - then gKPD would say roughly: if $Λ$ can be represented as $n / 2$ dimensional or smaller

What's the $n / 2$ here? Is it meant to be $m / 2$ ?

[-]johnswentworth2yΩ480

Ah, no, I suppose that part is supposed to be handled by whatever approximation process we define for ? That is, the "correct" definition of the "most minimal approximate summary" would implicitly constrain the possible choices of boundaries for which $Λ$ is equivalent to $X_{0}$ ?

Almost. The hope/expectation is that different choices yield approximately the same $Λ$ , though still probably modulo some conditions (like e.g. sufficiently large $T$ ).

What's the $n / 2$ here? Is it meant to be $m / 2$ ?

System size, i.e. number of variables.

[-]Thane Ruthenis2yΩ471

By the way, do we need the proof of the theorem to be quite this involved? It seems we can just note that for for any two (sets of) variables , $X_{2}$ separated by distance $D$ , the earliest sampling-step at which their values can intermingle (= their lightcones intersect) is $D / 2$ (since even in the "fastest" case, they can't do better than moving towards each other at 1 variable per 1 sampling-step).

[-]johnswentworth2yΩ340

Yeah, that probably works.

[-]Thane Ruthenis2yΩ460

Almost. The hope/expectation is that different choices yield approximately the same , though still probably modulo some conditions (like e.g. sufficiently large $T$ ).

Can you elaborate on this expectation? Intuitively, $Λ$ should consist of a number of higher-level variables as well, and each of them should correspond to a specific set of lower-level variables: abstractions and the elements they abstract over. So for a given $Λ$ , there should be a specific "correct" way to draw the boundaries in the low-level system.

But if ~any way of drawing the boundaries yields the same $Λ$ , then what does this mean?

Or perhaps the "boundaries" in the mesoscale-approximation approach represent something other than the factorization of $X$ into individual abstractions?

[-]johnswentworth2yΩ340

is conceptually just the whole bag of abstractions (at a certain scale), unfactored.

[-]Thane Ruthenis2yΩ340

Sure, but isn't the goal of the whole agenda to show that does have a certain correct factorization, i. e. that abstractions are convergent?

I suppose it may be that any choice of low-level boundaries results in the same $Λ$ , but the $Λ$ itself has a canonical factorization, and going from $Λ$ back to $X^{T}$ reveals the corresponding canonical factorization of $X^{T}$ ? And then depending on how close the initial choice of boundaries was to the "correct" one, $Λ$ is easier or harder to compute (or there's something else about the right choice that makes it nice to use).

[-]johnswentworth2yΩ340

Yes, there is a story for a canonical factorization of , it's just separate from the story in this post.

[-]romeostevensit2yΩ460

Is there a good primer somewhere on how causal models interact with the standard model of physics?

[-]Nicolas Macé2y40

Perhaps of interest that people in quantum many-body physics have related results. One keyword is "scrambling". Like in your case, they have a network of interacting units, and since interactions are local they have a lightcone outside of which correlations are exactly zero.

They can say more than that: Because excitations typically propagate slower than the theoretical max speed (the speed of light or whatever thing is analogous) there's a region near the edge of the lightcone where correlations are almost zero. And then there's the bulk of correlations. They can say all sorts of things in the large time limit. For instance the correlation front typically starts having a universal shape if one waits for long enough. See e.g. this or that.

[-]Shmi2y40

I'm wondering if you are reinventing lattice waves., phonons and maybe even phase transitions in the Ising model.

[-]johnswentworth2y20

Phase transitions are definitely on the todo list of things to reinvent. Haven't thought about lattice waves or phonons; I generally haven't been assuming any symmetry (including time symmetry) in the Bayes net, which makes such concepts trickier to port over.

[-]Shmi2y20

I guess even without symmetry if one assumes finite interaction time, and the nearest-neighbor-only interaction, an analog of the light cone emerges from these two assumptions. As in, Nth neighbor is unaffected until the time Nt where t is the characteristic interaction time. But I assume you are claiming something much less trivial than that.

[-]Thane Ruthenis2yΩ120

Do you have any cached thoughts on the matter of "ontological inertia" of abstract objects? That is:

We usually think about abstract environments in terms of DAGs. In particular, ones without global time, and with no situations where we update-in-place a variable. A node in a DAG is a one-off.
However, that's not faithful to reality. In practice, objects have a continued existence, and a good abstract model should have a way to track e. g. the state of a particular human across "time"/the process of system evolution. But if "Alice" is a variable/node in our DAG, she only exists for an instant...
The model in this post deals with this by assuming that the entire causal structure is "copied" every timestep. So every timestep has an "Alice" variable, and is a function of $Alice (t)$ plus some neighbours...
But that's not right either. Structure does change; people move around (acquire new causal neighbours and lose old ones) and are born (new variables are introduced), etc.

I think we want our model of the environment to be "flexible" in the sense that it doesn't assume the graph structure gets copied over fully every timestep, but that it has some language for talking about "ontological inertia"/one variable being an "updated version" of another variable. But I'm not quite sure how to describe this relationship.

At the bare minimum, $Alice (t + 1)$ it has to be of same "type" as $Alice (t)$ (e. g., "human"), be directly causally connected to $Alice (t)$ , $Alice (t + 1)$ 's value has to be largely determined by $Alice (t)$ 's value... But that's not enough, because by this definition Alice's newborn child will probably also count as Alice.

Or maybe I'm overcomplicating this, and every variable in the model would just have an "identity" signifier baked-in? Such that $ID (Alice (t)) = ID (Alice (t + 1)) \neq ID (any-other-var (t + 1))$ ?

Going up or down the abstraction levels doesn't seem to help either. ( $Alice (t)$ isn't necessarily an abstraction over the same set of lower-level variables as $Alice (t + 1)$ , nor does she necessarily have the same relationship with the higher-level variables.)

Back to my question: do you have any cached thoughts on that?

^{^}

I’ve omitted from the post various standard things about Gibbs samplers, e.g. explaining why we can model the variables of the graphical model as the output of a Gibbs sampler, how big $T$ needs to be in order to resample all the variables at least once, how to generate $X^{0}$ from $X$ (rather than vice-versa), etc. Leave a question in the comments if you need more detail on that.

^{^}

Notation convention: capital-letter indices like $X_{A}$ indicate index-tuples, i.e. if $A = (1, 2, 3)$ then $X_{A} = (X_{1}, X_{2}, X_{3})$ .

LESSWRONG
LW

LESSWRONG
LW

69

The Lightcone Theorem: A Better Foundation For Natural Abstraction?

69

Ω 29

69

Ω 29

The Proof, In Pictures

How To Use The Lightcone Theorem?