Opportunity Space: Renormalization for AI Safety

Lauren Greenspan

This opportunity space was developed with Dmitry Vaintrob and Lucas Teixeira as part of PIBBSS' horizon scanning initiative. A detailed roadmap of can be found here.

Background

The basic premise is to view different aspects of a neural network (NN) as a complex statistical system, as follows:

The data-generating process

A NN depends on empirically obtained labeled datasets drawn from some "ground truth" distribution about which the NN must learn to make predictions. By modeling the data-generating process as a physical system (e.g., a Gaussian field theory), we can use renormalization to reason about the ontology of the dataset independently from the NN representation.

NN activations

We can view a NN itself as a statistical system, which transforms an input to an output via a series of intermediate activations. While a trained neural net is deterministic, interpretability seeks suitably “simple”, “sparse”, or otherwise ‘interpretable’ explanations, by coarse-graining the full information about the NN’s activation into a smaller number of “nicer” summary variables, or features. This process loses information, sacrificing determinism for ‘nice’ statistical properties. We can now try to implement renormalization by introducing notions of locality and coarse-graining on the neural net inputs and activations and looking for relevant features via an associated RG flow. While interpreting a trained model via renormalization is an explicit process, we note that this is done implicitly in diffusion models (e.g. Wang et al., 2023), which sequentially extract coarse-to-fine features at various noise scales.

The learning process

Because it is randomly initialized in the weight landscape, learning is an inherently stochastic process controlled by an architecture, which outputs a function of an input data distribution implemented by weights.

Of all the processes discussed here, this one most looks like a “physical” statistical theory; in certain limits of simple systems (Lee et al., 2023), this process is very well-described by either a free (vacuum) statistical field theory or a computationally tractable perturbative theory controlled by Feynman diagram expansions. Though these approximations fail in more realistic, general cases (Lippl & Stachenfeld, Perin & Deny), we nevertheless expect them to hold locally in certain contexts and for a suitable notion of locality. This is analogous to and directions, in the same way that experiments on activation addition show that, while interesting neural nets are not linear, they behave “kind of linearly” in many contexts (Turner et al. Shoenholz et al., 2016, Lavie et al., 2024, Cohen et al., 2019).

Open Programmes

We will support research projects in line with three core aims:

To determine which renormalization methods, assumptions, and techniques, including numerical RG, functional RG, and real-space RG, are most effective in various neural network contexts. We expect to need different implementations for implicit and explicit renormalization, but also for different ‘phases’ of training and inference. We value projects that are clear about the epistemic ‘baggage’ associated with techniques from different physics disciplines, including the value they add and limitations they bring.
To clarify the degree of formalism needed to apply field theoretic principles and renormalization techniques to AI systems. This includes rigorous definitions of "theoretical tethers" and the development of a coherent renormalizability framework grounded in fixed points, critical phenomena, universality, and Gaussian limits.
To identify ‘difference making factors’ critical to making renormalization useful for advancing NN interpretability and aligning theoretical models with empirical observations.

Projects should also be in scope of one of the following programmes.

Programme 1: Model organisms of implicit renormalization: Relating and comparing different notions of scale and abstraction

Previous work (Roberts et al. 2021, Berman et al. 2023, Erbin et al. 2022, Halverson et al. 2020, Ringel et al. 2025) hints at ‘natural’ notions of scale and renormalization in NNs, but – like in physics – there is no one ‘right’ way to operationalize the array of tools and techniques we have at hand. This programme aims to probe the respective regimes of validity of different approaches so they can be built into a coherent renormalization framework for AI safety. By engineering situations in which renormalization has a ‘ground truth’ interpretation, we seek comprehensive theoretical and empirical descriptions of NN training and inference. Where current theories fall short, we aim to identify the model-natural (instead of physical) information needed for renormalization to provide a robust, reliable explanation of AI systems. In addition to physics, this research may also bring insights from fields like neuroscience and biology to inform our understanding of scaling behavior in AI systems. We would also be excited to support the development of an implicit renormalization

A non-exhuastive list of projects in this programme may address:

Identifying ‘natural’ scales and corresponding measures of ‘closeness’ in toy NNs, and the relationship between network scales and emergence of features or capabilities. For example, you may operationalize the connection between scale and
1. Token position and “information space geometry” in sequence-generating models (e.g. Marzen et al, Shai et al.).
2. Noise scales in diffusion models (Sclocchi et al. 2024).
3. Extending multi-scale formulations of RG (Rubin et al.).
4. Reinforcement Learning (RL) paradigms or multi-agent systems.
Understand how relevant features (for example, in the way of Gordon et al.) build up to specific capabilities.
Creating novel architectures with a more interpretable notion of scale and coarse-graining.
Exploring similarities and respective strengths and weaknesses of different approaches in these settings. For example Wilsonian RG v. real-space RG, variational RG (Mehta et al.)
Identifying and characterizing critical behavior and how it relates to universality in NNs. For example:
1. Universality across architectures (e.g., LLMs, diffusion models), datasets, or different feature representations.
2. The difference between critical behavior during training (Bukva et al., Fischer et al.) v. inference (Aguilera et al., Howard et al..
3. Critical points as (possibly idealized) ‘tethers’ along an RG flow (e.g., Gaussian or non-Gaussian fixed points like Erdminger et al. and Demirtas et al.))
Critically evaluating the assumption of Gaussian behavior, for example:
1. in network dynamics, building on work like Berman et al..
2. Its relation to the Linear Representation Hypothesis. In other words: Are NNs ‘kind of Gaussian’ in the same way that they are ‘kind of linear’?
Testing ‘separation of scales’ as conditional causal independence between fine-grained and coarse-grained phenomena.

You may be a good fit for this program if you have:

Theoretical or experimental expertise in emergent phenomena and critical systems.
A background in theoretical physics, particularly using RG methods to study critical phenomena related to real-world systems.
A neuroscience or biology background, particularly in analyzing phase transitions and universality in complex systems.

Programme 2: Development of unsupervised Explicit Renormalization techniques to identify features in NNs

Inspired by existing work (Fischer et al. 2024, Berman et al. 2023), we think that explicit renormalization can be used to find features which are nonlinear, principled and causally decoupled with respect to the computation. The ultimate goal of this programme is to operationalize renormalization for optimally interpreting the statistical system representing the AI’s reality. Resulting techniques should be capable of finding unsupervised features that perform better than state-of-the-art interpretability tools like SAEs (Anders et al.).

The general problem with causal decomposition remains the question of the extraction of principled features. While SAE’s are a particularly interpretable and practical unsupervised technique to get interesting (linear) features of activations, they are not optimized to provide a complete decomposition into causal primitives of computation – SAEs do not give the “correct” features in general(Mendel 2024). When interpreting more sophisticated behaviors than simple grammar circuits, SAEs by themselves are simply not enough to give a strong interpretation, including a causal decomposition (Leask et al., 2025).

However, SAEs may be, for explicit renormalization, what early renormalization was for RG – an ad-hoc approach to ‘engineer’ away unphysical divergences that nevertheless laid the foundation for a theoretical formalism. We see them – and the associated LRH – as a 1st-order ansatz from which to build.

Projects in this programme may:

Address whether idealized theoretical frameworks (like the NN-QFT) is a sufficient heuristic for guiding operational techniques for feature abstraction.
Measure the comparative advantage of this to SAEs.
Think critically about how candidate feature scales are related to the notion of ‘human interpretability’.
1. For example, is feature splitting related to a scale of abstraction that reflects an (un)principled method of semantic labeling, or can it be related to a model-natural notion of scale?
Relate explicit renormalization to implicit renormalization, but developing good metrics for distinguishing between the two flows.
Extend this framework to interpret multi-agent systems or Large Reasoning Models (LRMs).

You might be a good fit for this programme if you have:

Experience in empirically-guided fields like condensed matter physics tailoring renormalization techniques to real-world systems
Experience applying statistical physics to complex formal systems (like random graph Hamiltonians) and ML
Deep experience with SAEs and theoretical CS

For the Future

These programmes build on the first two, so their direction will be set at a later time. For now, we present a rough scope to set our intentions for future work.

Programme 3: Leveraging insights gained from implicit renormalization for ‘principled safety’ of AI systems

Inspired by separation of scales between different effective field theories along an RG flow, this programme seeks to provide both theoretical justification and empirical validation for causal separation of scales in neural networks. This work depends on our ability to show that, under appropriate conditions, fine-grained behaviors can be conditionally isolated based on coarser scales—thereby enhancing our capacity to design AI systems with principled safety guarantees. It also depends on a better operationalization of AI safety concepts like ‘deception’, to understand if complex features like deception are in fact separated from finer features in some scale. Another aim is to find candidate operationalizations of finding alternative RG flows with ‘safer’ critical points (with carefully defined metrics to measure this).

Programme 4: Apply Field theory ‘in the wild’

This programme puts what we have learned in the first three programmes to work, testing our framework against empirical evidence of scale, renormalization, and other FT techniques as they naturally occur in large-scale real-world systems. The aim is to build more general evidence for, and theoretical development of, implicit renormalization in SOTA environments. A core goal would be to use the results to develop an analog of the NN-QFT correspondence far from Gaussianity.

LESSWRONG
LW