Progress update: synthetic models of natural data

aribrill

This post presents a brief progress update on the research I am doing as a part of the renormalization research group at Principles of Intelligence (PIBBSS). The code to generate synthetic datasets based on the percolation data model is available in this repository. It employs a newly developed algorithm to construct a dataset in a way that explicitly and iteratively reveals its innate hierarchical structure. Increasing the number of data points corresponds to representing the same dataset at a more fine-grained level of abstraction.

Introduction

Ambitious mechanistic interpretability requires understanding the structure that neural networks uncover from data. A quantitative theoretical model of natural data's organizing structure would be of great value for AI safety. In particular, it would allow researchers to build interpretability tools that decompose neural networks along their natural scales of abstraction, and to create principled synthetic datasets to validate and improve those tools.

A useful data structure model should reproduce natural data's empirical properties:

Sparse: relevant latent variables occur, and co-occur, rarely.
Hierarchical: these variables interact compositionally at many levels.
Low-dimensional: representations can be compressed because the space of valid samples is highly constrained.
Power-law-distributed: meaningful categories exist over many scales, with a long tail.

To this end, I'm investigating a data model based on high-dimensional percolation theory that describes statistically self-similar, sparse, and power-law-distributed data distributional structure. I originally developed this model to better understand neural scaling laws. In my current project, I'm creating concrete synthetic datasets based on the percolation model. Because these datasets have associated ground-truth latent features, I will explore the extent to which they can provide a testbed for developing improved interpretability tools. By applying the percolation model to interpretability, I also hope to test its predictive power, for example, by investigating whether similar failure modes (e.g. feature splitting) occur across synthetic and natural data distributions.

The motivation behind this research is to develop a simple, analytically tractable model of multiscale data structure that to the extent possible usefully predicts the structure of concepts learned by optimal AI systems. From the viewpoint of theoretical AI alignment, this research direction complements approaches that aim to develop a theory of concepts.

Percolation Theory

The branch of physics concerned with analyzing the properties of clusters of randomly occupied units on a lattice is called percolation theory (Stauffer & Aharony, 1994). In this framework, sites (or bonds) are occupied independently at random with probability , and connected sites form clusters. While direct numerical simulation of percolation on a high-dimensional lattice is intractable due to the curse of dimensionality, the high-dimensional problem is exactly solvable analytically. Clusters are vanishingly unlikely to have loops (in high dimensions, a random path doesn't self-intersect), and the problem can be approximated by modeling the lattice as an infinite tree^[1]. In particular, percolation clusters on a high-dimensional lattice (at or above the upper critical dimension $d \geq 6$ ) that are at or near criticality can be accurately modeled using the Bethe lattice, an infinite treelike graph in which each node has identical degree $z$ . For site or bond percolation on the Bethe lattice, the percolation threshold is $p_{c} = 1 / (z - 1)$ . Using the Bethe lattice as an approximate model of a hypercubic lattice of dimension $d$ gives $z = 2 d$ and $p_{c} = 1 / (2 d - 1)$ . A brief self-contained review based on standard references can be found in Brill (2025, App. A).

Algorithm

The repository implements an algorithm to simulate a data distribution modeled as a critical percolation cluster distribution on a large high-dimensional lattice, using an explicitly hierarchical approach. The algorithm consists of two stages. First, in the generation stage, a set of percolation clusters is generated iteratively. Each iteration represents a single "fine-graining" step in which a single data point (site) is decomposed into two related data points. The generation stage produces a set of undirected, treelike graphs representing the clusters, and a forest of binary latent features that denote each point's membership in a cluster or subcluster. Each point has an associated value that is a function of its latent subcluster membership features. Second, in the embedding stage, the graphs are embedded into a vector space following a branching random walk.

In the generation stage, each iteration follows one of two alternatives. With probability create_prob, a new cluster with one point is created. Otherwise, an existing point is selected at random and removed, becoming a latent feature. This parent is replaced by two new child points connected to each other by a new edge. Call these points a and b. The child points a and b are assigned values as a stochastic function of the parent's value. Each former neighbor of the parent is then connected to either a with probability split_prob, or to b with probability 1 - split_prob. The parameter values that yield the correct cluster structure can be shown to be create_prob = 1/3 and split_prob = 0.2096414. The derivations of these values and full details on the algorithm will be presented in a forthcoming publication.

Caveats

Because the data generation and embedding procedures are stochastic, any studies should be repeated using multiple datasets generated using different random seeds.
The embedding procedure relies on the statistical tendency of random vectors to be approximately orthogonal in high dimensions. An embedding dimension of O(100) or greater is recommended to avoid rare discrepancies between the nearest neighbors in the percolation graph structure and embedded data points.
A generated dataset represents a data distribution, i.e. the set of all possible data points that could theoretically be observed. To obtain a realistic analog of a machine learning dataset, only a tiny subset of a generated dataset should be used for training.

Next Steps

In the coming months, I hope to share more details on this work as I scale up the synthetic datasets, train neural networks on the data, and interpret those networks. The data model intrinsically defines scales of reconstruction quality corresponding to learning more clusters and interpolating them at higher resolution. Because of this, I'm particularly excited about the potential to develop interpretability metrics for these datasets that trade off the breadth and depth of recovered concepts in a principled way.

^{^}
Percolation on a tree can be thought of as the mean-field approximation for percolation on a lattice, neglecting the possibility of closed loops.

LESSWRONG
LW