Background: Testing The Natural Abstraction Hypothesis
Given a world-model, direct computation of natural abstractions basically amounts to figuring out which -values give the same -distributions , for various choices of the variables and in some large Bayes net. In general, this computation is #P-Hard. (That’s like NP-Hard, but Harder.) That doesn’t necessarily mean that it’s usually intractable in practice, but it means that we need to use heuristics and approximations right off the bat.
A linear-Guassian approximation is a natural starting point. It’s relatively simple - all of the relevant computations are standard matrix operations. But it’s also relatively general: it should be a good approximation for any system with smooth dynamics and small noise/uncertainty. For instance, if our system is a solid object made of molecules and sitting in a heat bath, then linear-Gaussian noise should be a good model. (And that’s a good physical picture to keep in mind for the rest of this post - solid object made of molecules, all the little chunks interacting elastically, but also buffeted about by thermal noise.)
So, we have some Bayes net in which each variable is Gaussian - a linear combination of its parents plus some IID Gaussian noise. For the underlying graph, I’m using a 2D Delaunay triangulation, mainly because it does a good job reflecting physically-realistic space/time structure. (In particular, as in physical space, most pairs of nodes are far apart - as opposed to the usual Erdos-Renyi random graph, where the distance between most points is logarithmic. We’ll revisit the choice of graph structure later.) Embedded in 2D, an example graph looks like this:
… although the actual experiments below will use a graph with about 500X more nodes (50k, whereas the visual above uses 100). Each of those nodes is a Gaussian random variable, with the arrows showing its parents in the Bayes net. Weights on the edges (i.e. coefficients of each parent) are all random normal, with variance ⅔ (another choice we’ll revisit later).
Let’s start with the basic experiment: if we pick two far-apart neighborhoods of nodes, and , can the information from which is relevant to fit into a much-lower-dimensional summary? In a Gaussian system, we can answer this by computing the covariance matrix , taking an SVD, and looking at the number of nonzero singular values. Each left singular vector with a nonzero singular value gives the weights of one variable in the “summary data” of which is relevant to . (Or, for information in relevant to , we can take the right singular vectors; the dimension of the summary is the same either way.)
We’ll use these three randomly-chosen neighborhoods of nodes (green is neighborhood 1, red is 2, purple is 3):
Here are the 10 largest singular values of the neighborhood 1 - neighborhood 2 covariance matrix:
array([5.98753213e+05, 1.21862101e+02, 1.91973783e-01, 1.03200621e-03, 2.01888771e-04, 1.05877548e-05, 5.34855466e-06, 4.94833571e-10, 2.55557198e-10, 2.17092875e-10])
The values of order 1e-10 or lower are definitely within numerical error of zero, and I expect the values of order 1e-3 or lower are as well, so we see at most 7 nonzero singular values and probably more like 3. (I know 1e-3 seems high for numerical error, but the relevant number here is the ratio between a given singular value and the largest singular value, so it’s really more like 1e-8. In principle a double provides 1e-16 precision, but in practice it’s common to lose half the bits in something like SVD, and 1e-8 is about how much I usually trust these things.) Meanwhile, the two neighborhoods contain 109 and 129 nodes, respectively.
To summarize all that: we picked out two neighborhoods of ~110 variables each in a 50k-node linear Gaussian network. To summarize all the information from one neighborhood relevant to the other, we need at most a 7-variable summary (which is also linear and Gaussian). That’s already a pretty interesting result - the folks at SFI would probably love it.
But it gets even better.
We can do the same thing with the neighborhood 1 - neighborhood 3 covariance matrix. The first 10 singular values are:
array([7.50028046e+11, 5.87974398e+10, 2.50629095e+08, 2.23881004e+03, 1.46035030e+01, 2.52752074e+00, 1.01656991e-01, 1.79916658e-02, 5.89831843e-04, 3.52074803e-04])
… and by the same criteria as before, we see at most 8 nonzero singular values, and probably more like 3. Now, we can compare the neighborhood-1-singular-vectors from our two SVDs. Simply computing the correlations between each pair of nonzero singular vectors from the two decompositions yields:
Key thing to notice here: the first two singular vectors are near-perfectly correlated! (Both correlations are around .996.) The correlations among the next few are also large, after which it drops off into noise. (I did say that everything after the first three singular components was probably mostly numerical noise.)
In English: not only can a three-variable summary of neighborhood 1 probably contain all the info relevant to either of the other two neighborhoods, but two of the three summary variables are the same.
This is exactly the sort of thing the natural abstraction hypothesis predicts: information relevant far away fits into a compact summary - the “abstract summary” - and roughly-the-same abstract summary is relevant even from different “vantage points”. Or, to put it differently, the abstraction is reasonably robust to changes in which “far away” neighborhood we pick.
Our 2D-embedded linear-Gaussian world abstracts well.
Of course, this was just an example from one run of the script, but the results are usually pretty similar - sometimes the summary has 2 or 4 or 5 dimensions rather than 3, sometimes more or fewer summary-dimensions align, but these results are qualitatively representative. The main case where it fails is when the RNG happens to pick two neighborhoods which are very close or overlap.
Is This Just The Markov Blanket?
Notice that our “neighborhoods” of variables include some nodes strictly in the interior of the neighborhood. The variables on the boundary of the neighborhood should form a Markov blanket - i.e. they are themselves an abstract summary for the whole neighborhood, and of course they’re the same regardless of which second neighborhood we pick. So perhaps our results are really just finding the variables on the neighborhood boundary?
The results above already suggest that this is not the case: we can see at a glance that the dimension of the summary is quite a bit smaller than the number of variables on the boundary. But let’s run another test: we’ll fix the point at the “center” of the neighborhood, then expand the neighborhood’s “radius” (in terms of undirected steps in the graph), and see how both the boundary size and the summary size grow.
Here are results from one typical run (number of boundary variables in blue, number of summary variables in red):
As we expand the neighborhood, the number of boundary-variables grows, but the number of summary variables stays flat. So there’s definitely more going on here than just the neighborhood’s Markov blanket (aka boundary variables) acting as an abstract summary.
The above toy model made two choices which one could imagine not generalizing, even in other sparse linear-Gaussian systems.
First, the graph structure. The random graphs most often used in mathematics are Erdos-Renyi (ER) random graphs. “Information relevant far away” doesn’t work very well on these, because nodes are almost never far apart. The number of steps between the two most-distant nodes in such graphs is logarithmic. So I wouldn’t expect environments with such structure to abstract well. On the other hand, I would usually expect such environments to be a poor model for systems in our physical world - the constraints of physical space-time make it rather difficult to have large numbers of variables all interacting with only logarithmic causal-distances between them. In practice, there are usually mediating variables with local interactions. Even the real-world systems which most closely resemble ER random graphs, like social networks or biochemical regulatory networks, tend to have much more “localized” connections than a true ER random graph - e.g. clusters in social networks or modules in biochemical networks.
(That said, I did try running the above experiments with ER graphs, and they work surprisingly well, considering that the neighborhoods are only a few steps apart on the graph. Summary sizes are more like 10 or 12 variables rather than 2 or 4, and alignment of the summary-variables is hit-and-miss.)
Second, the weights. There’s a phase shift phenomenon here: if the weights are sufficiently low, then noise wipes out all information over a distance. If the weights are sufficiently high, then the whole system ends up with one giant principal component. Both of these abstract very well, but in a boring-and-obvious way. The weights used in these experiments were chosen to be right at the boundary between these two behaviors, where more interesting phenomena could potentially take place. If the system abstracts well right at the critical point, then it should abstract well everywhere.
So What Could We Do With This?
I mainly intend to use this sort of model to look for theoretical insights into abstraction which will generalize beyond the linear-Gaussian case. Major goals include data structures for representing whole sets of abstractions on one environment, as well as general conditions under which the abstractions are learned by a model trained on the environment.
But there are potential direct use-cases for this kind of technique, other than looking for hints at more-general theory.
One example would be automatic discovery of abstractions in physical systems. As long as noise is small and the system is non-chaotic (so noise stays small), a sparse linear-Gaussian model should work well. (In practice, this mainly means the system needs to have some kind of friction in most degrees of freedom.) In this case, I’d expect the method used here to work directly: just compute covariances between far-apart neighborhoods of variables, and those covariances will likely be low-rank.
In principle, this method could also be directly applied to machine learning systems, although the range of applicability would be fairly narrow. It would only apply to systems where linear approximations work very well (or maybe quadratic approximations, which give a linear approximation of the gradient). And ideally the random variables would be Gaussian. And it should have a fairly large and deep computational graph, so most neighborhoods of variables are not too close together. If one just happened to have an ML model which met those criteria, we’d expect these results to carry over directly: compute the covariance matrix between far-apart neighborhoods of the random variables in the model, and those matrices should be low-rank. That sounds like a potentially-very-powerful transparency tool... if one had a model which fit the criteria. Hypothetically.
Build a random linear-Gaussian Bayes net with a random planar graph structure (in this case generated from a Delaunay mesh). Pick two far-apart neighborhoods of variables in this net, and use an SVD to compute the rank of their covariance matrix. Empirically, we find rank <10 even with 110-variable neighborhoods in a 50k-node network. In other words: a very small number of variables typically suffices to summarize all the information from a neighborhood of variables which is relevant far away. The system abstracts well.
Furthermore: the largest singular vectors are the same even for different choices of the second (far-away) neighborhood. Not only does a small summary suffice, but approximately the same small summary suffices. The abstractions are robust to changes in our “far away” reference neighborhood.