This opportunity space was developed with Dmitry Vaintrob and Lucas Teixeira as part of PIBBSS' horizon scanning initiative. A detailed roadmap of can be found here.
The basic premise is to view different aspects of a neural network (NN) as a complex statistical system, as follows:
The data-generating process
A NN depends on empirically obtained labeled datasets drawn from some "ground truth" distribution about which the NN must learn to make predictions. By modeling the data-generating process as a physical system (e.g., a Gaussian field theory), we can use renormalization to reason about the ontology of the dataset independently from the NN representation.
NN activations
We can view a NN itself as a statistical system, which transforms an input to an output via a series of intermediate activations. While a trained neural net is deterministic, interpretability seeks suitably “simple”, “sparse”, or otherwise ‘interpretable’ explanations, by coarse-graining the full information about the NN’s activation into a smaller number of “nicer” summary variables, or features. This process loses information, sacrificing determinism for ‘nice’ statistical properties. We can now try to implement renormalization by introducing notions of locality and coarse-graining on the neural net inputs and activations and looking for relevant features via an associated RG flow. While interpreting a trained model via renormalization is an explicit process, we note that this is done implicitly in diffusion models (e.g. Wang et al., 2023), which sequentially extract coarse-to-fine features at various noise scales.
The learning process
Because it is randomly initialized in the weight landscape, learning is an inherently stochastic process controlled by an architecture, which outputs a function of an input data distribution implemented by weights.
Of all the processes discussed here, this one most looks like a “physical” statistical theory; in certain limits of simple systems (Lee et al., 2023), this process is very well-described by either a free (vacuum) statistical field theory or a computationally tractable perturbative theory controlled by Feynman diagram expansions. Though these approximations fail in more realistic, general cases (Lippl & Stachenfeld, Perin & Deny), we nevertheless expect them to hold locally in certain contexts and for a suitable notion of locality. This is analogous to and directions, in the same way that experiments on activation addition show that, while interesting neural nets are not linear, they behave “kind of linearly” in many contexts (Turner et al. Shoenholz et al., 2016, Lavie et al., 2024, Cohen et al., 2019).
We will support research projects in line with three core aims:
Projects should also be in scope of one of the following programmes.
Previous work (Roberts et al. 2021, Berman et al. 2023, Erbin et al. 2022, Halverson et al. 2020, Ringel et al. 2025) hints at ‘natural’ notions of scale and renormalization in NNs, but – like in physics – there is no one ‘right’ way to operationalize the array of tools and techniques we have at hand. This programme aims to probe the respective regimes of validity of different approaches so they can be built into a coherent renormalization framework for AI safety. By engineering situations in which renormalization has a ‘ground truth’ interpretation, we seek comprehensive theoretical and empirical descriptions of NN training and inference. Where current theories fall short, we aim to identify the model-natural (instead of physical) information needed for renormalization to provide a robust, reliable explanation of AI systems. In addition to physics, this research may also bring insights from fields like neuroscience and biology to inform our understanding of scaling behavior in AI systems. We would also be excited to support the development of an implicit renormalization
A non-exhuastive list of projects in this programme may address:
You may be a good fit for this program if you have:
Inspired by existing work (Fischer et al. 2024, Berman et al. 2023), we think that explicit renormalization can be used to find features which are nonlinear, principled and causally decoupled with respect to the computation. The ultimate goal of this programme is to operationalize renormalization for optimally interpreting the statistical system representing the AI’s reality. Resulting techniques should be capable of finding unsupervised features that perform better than state-of-the-art interpretability tools like SAEs (Anders et al.).
The general problem with causal decomposition remains the question of the extraction of principled features. While SAE’s are a particularly interpretable and practical unsupervised technique to get interesting (linear) features of activations, they are not optimized to provide a complete decomposition into causal primitives of computation – SAEs do not give the “correct” features in general(Mendel 2024). When interpreting more sophisticated behaviors than simple grammar circuits, SAEs by themselves are simply not enough to give a strong interpretation, including a causal decomposition (Leask et al., 2025).
However, SAEs may be, for explicit renormalization, what early renormalization was for RG – an ad-hoc approach to ‘engineer’ away unphysical divergences that nevertheless laid the foundation for a theoretical formalism. We see them – and the associated LRH – as a 1st-order ansatz from which to build.
Projects in this programme may:
You might be a good fit for this programme if you have:
These programmes build on the first two, so their direction will be set at a later time. For now, we present a rough scope to set our intentions for future work.
Programme 3: Leveraging insights gained from implicit renormalization for ‘principled safety’ of AI systems
Inspired by separation of scales between different effective field theories along an RG flow, this programme seeks to provide both theoretical justification and empirical validation for causal separation of scales in neural networks. This work depends on our ability to show that, under appropriate conditions, fine-grained behaviors can be conditionally isolated based on coarser scales—thereby enhancing our capacity to design AI systems with principled safety guarantees. It also depends on a better operationalization of AI safety concepts like ‘deception’, to understand if complex features like deception are in fact separated from finer features in some scale. Another aim is to find candidate operationalizations of finding alternative RG flows with ‘safer’ critical points (with carefully defined metrics to measure this).
Programme 4: Apply Field theory ‘in the wild’
This programme puts what we have learned in the first three programmes to work, testing our framework against empirical evidence of scale, renormalization, and other FT techniques as they naturally occur in large-scale real-world systems. The aim is to build more general evidence for, and theoretical development of, implicit renormalization in SOTA environments. A core goal would be to use the results to develop an analog of the NN-QFT correspondence far from Gaussianity.