Statistical Physics for Ambitious Interpretability: A Workshop Retrospective

Lauren Greenspan; Lucas Teixeira; ClaudineLim

In April, we held our second workshop on statistical physics and AI safety, organized around a research agenda we've been building over the past year: Physics Inspired Ambitious Mechanistic Interpretability (PIAMI). (Stay tuned for a PIAMI research roadmap, to be shared soon.) This post is a summary of the week's talks, as well as my personal takeaways.

The PIAMI Agenda and PrincInt’s Perspective

Standard interpretability, including methods to probe, steer, and explain model internals, falls short of the perfect faithfulness standards of its most ambitious version. In the keynote talk, Lucas Teixeira introduced PIAMI and its potential to make high-confidence, statistical guarantees necessary to support a robust understanding of a model’s capabilities and failure modes as AI systems scale. In addition to statistical physics’ power to model neural networks as complex systems with many interacting components, the underlying bet rests on physics' proven ability to iteratively, simultaneously develop the theoretical models and measurement tools to respectively describe and probe experimental observations.

Dmitry Vaintrob built on these ideas to lay out a learning theory agenda capable of grounding the robust, principle tools necessary for ambitious interpretability. The agenda aims to bridge statistical physics methods like mean field theory with the desiderata of a theory of interpretability, including neuron sparsity, stochasticity of the training process, and the causal or hierarchical structure of latent features. The goal is to route candidate theories through model organisms with expressivity and realism properties of state-of-the-art models. Andrew Mack presented an agenda for AMI tools development meant to close the gap between tractability and faithfulness. The talk focused on early results of a hierarchical SAE architecture with a tractable alignment tax, and described work to build interpretable-from-scratch architectures with similar design principles. As a complement to this by-design approach, Jennifer Lin talked about putting the theoretical machinery of the eNTK to work finding learned features post-hoc.

Ari Brill discussed a data models agenda to develop metrics and validation methods to test theoretical hypotheses and improve interpretability tools. This line of work uses physics theories, like high-dimensional percolation, to model realistic structural properties of natural data and generate synthetic datasets with those properties, encoded by a tractable hierarchical structure.

Theory

An improved theoretical understanding of learning in deployed models is foundationally important for their robust interpretation. How could existing theoretical hypotheses guide the definitions and assumptions we bake into AMI? How can statistical physics machinery be used – exactly or heuristically – to develop actionable tools to understand learned features and guide their development?

In the workshop's first discussion session Jamie Simon (UC Berkeley / Imbue) introduced learning mechanics: a predictive scientific theory of the learning process analogous to classical mechanics. In a way that is deeply synergistic with the PIAMI research agenda, he made the case that the field of deep learning is ready for such a theory, which should incorporate and develop existing work on, for example, solvable toy settings, limiting behavior, and macroscopic phenomena^[1].

Another driver of this agenda, Daniel Kunin (UC Berkeley), gave an update on alternating gradient flows (AGF) as an ansatz for feature learning in linear or quadratic two-layer neural networks. He discussed extending this work to groups – providing a mechanistic decomposition of features necessary for algebraic tasks – as well as grids, providing a link with biological networks and the emergence of grid cells.

Many of the workshop’s other theory talks could serve as source material for a theory of learning mechanics. From a math perspective, Ben Gerraty spoke about a toy model of renormalization based on singular learning theory, and Yevgeny Liokumovich (U Toronto) led a discussion on the geometry of activation space. From physics and computational neuroscience, Blake Bordelon (Harvard) gave a talk on mean-field limits and scaling laws, Haim Sompolinsky (Harvard / Hebrew University) on the lineage from spin glasses to LLMs, Cengiz Pehlevan (Harvard) on scaling laws and their limits, and Zohar Ringel (Hebrew University) on RG approaches to learning dynamics. Noa Rubin (Hebrew University) led a tutorial on applying a mean-field approximation to neural networks, both as a way to theoretically describe feature learning in terms of kernel fluctuations around the self-consistent mean and as a way to select for the most likely circuits in a given learned representation. This sparked a discussion about how statistical physics redefines the fundamental units of interpretation (features) as circuits.

Application

For interpretability, the power of theory lies in its ability to predict and measure observables in real-world AI systems. Which aspects of existing theories should we use to guide the design principles of interpretability tools? How can the empirical regularities of neural networks feed back into improved theoretical hypotheses?

From the physics community, Indranil Halder (Harvard) explained recent work deriving jailbreak scaling laws from a spin-glass model, focusing on how the attack success rate scales with the number of inference time samples and validating results for a range of realistic attack strategies. Gaia Grosso (IAIFI / MIT / Harvard) introduced the group to SparKer, a sparse, self-organizing ensemble of local kernels capable of localizing rare signals in high-dimensional data without a priori assumptions of the anomalous signal. The work applies a method for statistical anomaly detection in experimental particle physics to the problem of monitoring anomalous internal states in deployed language models, with an eye toward detecting safety-relevant phenomena like alignment faking. Together, these talks represent two approaches to applying statistical physics for interpretability: the top-down use of theory to explain empirical regularities, and the bottom-up construction of model-based tools from data.

From the interpretability community, Logan Riggs Smith discussed an interpretable-by-design architecture built on bilinear activations, making them easy to reverse-engineer. These networks trade a modest cost to performance for a gain in analyzability, making them particularly suited for use in high-stakes settings. Finally, Yonatan Belinkov (Technion / Harvard) distinguished between local and distributed representations, and spoke about using interpretability methods to localize on a bag of heuristics that a model uses to perform a given task. He spoke about how and why localization emerges, framed its onset as a phase transition during training, and sparked a discussion with a dictionary between thermodynamics phenomena in physics and analogous neural network phenomena that are relevant for interpretability.

Standards of Practice and Theory in the Age of AI

Pierfrancesco Beneventano (MIT) asked how much ML theory research can be automated today, and introduced the pAI/MSc system – a multi-agent workflow that proposes, derives, and verifies novel theoretical research. The talk made us think about the role of AI in scientific discovery and of how to conceptualize, define, and evaluate ‘research taste’ in artificial agentic systems.

The importance – and hardness – of measurement and metrics was also the crux of Savannah Thais’ (CUNY) talk about the standards of theory and practice in science and AI. The similarities and differences between these sparked a productive discussion about the role of models and their generalization properties in AI systems, and of how to work with incomplete theoretical or phenomenological models when ground-truth understanding is absent.

Takeaways

It is possible to lower the barrier for entry into a new field by meeting people where they are.

The event brought together a mix of physicists, mechanistic interpretability researchers, mathematicians, neuroscientists, and machine learning theorists. Talks were deliberately not grouped by topic or approach to encourage participants out of their own comfort zones. Overall, everyone made a genuine effort to understand one another’s work and points of view. Compared to our last workshop, there was much more time for discussion. This, combined with our time spent in the interim communicating these ideas (in posts, in papers, and in person), helped everyone start on a more (but not completely) even footing.

The cost of lowering this barrier trades off against the scope and format of a workshop, as it is hard to do many things at once (build up a shared language, transmit the culture of a new field, coordinate on interesting problems).

Different disciplines tend to weigh in on machine learning research in parallel, subject to their own norms, tastes, and values. At the same time, many participants had only a partial understanding of the scope of AI safety research, missing key works from LW, independent research, and smaller AI safety orgs. This means that many were not only encountering each other's work for the first time, but also the differences in language and culture between academic and AI safety research. Talks were interesting and useful for downloading another person’s point of view, but despite a more relaxed schedule relative to the first workshop, it still felt stacked. I plan on many fewer talks at the next meeting, now that we share more of the same context.

I also tried to give the participants as little homework as possible, which meant some talks were either entirely 'physics machinery' or 'interpretability application' (though several did successfully bridge this gap!). This will work itself out if repeat attendees engage across this divide, but in future I will ask contributors to contextualizing their talks with a framing prompt. Similarly, many of the discussion sessions devolved into seminar-style talks, indicating a failure on my part to clarify each participant’s role. Even so, discussions – both planned and impromptu – surfaced important confusions that are impossible to predict a priori, the strongest argument for protecting unstructured time in future.

Academics with the ability to bring new ideas to interpretability are interested and willing to try new things, if presented with the right opportunities.

At the end of the workshop, many attendees expressed interest in collaborating or pivoting to work on PIAMI-relevant topics, suggesting that these communities are eager to build a common language and mobilize on a shared agenda. We hope to share more about these follow-up projects – including workshops, projects, and research groups – within the next few months.

What's Next?

Hosting the first two workshops in the US prevented many talented researchers from attending, so the next PIAMI workshop will take place in Europe or the UK in the fall of 2026. This round will include minimal orienting talks, lightning updates of ongoing work, and plenty of time for guided and unstructured discussions. Suggestions for activation partners (in academia and industry) are welcome -- please get in touch!

PrincInt is an AI safety field-building organization supporting interdisciplinary collaborations aimed at providing high-assurance safety guarantees for advanced AI systems. Find out more about PrincInt’s internal research division, see our available opportunities, or use this form to get in touch.

^{^}
See the companion paper, There Will Be a Scientific Theory of Deep Learning, and list of open problems.

4