Thanks for this post. I am theoretical physicist and I agree that LLM interpretability is a physics problem. I think more and more physicists are becoming interested in this problem because it is obvious to us that physics methods, ideas and tools will be useful here. The whole field of physics is just mechanistic interpretability of the universe. Physics is quite broad and there are no fixed rules. The only goal is to increase understanding in any possible way and in fact one should always change and redefine the rules of the game. LLMs (as a toy model of intelligence) is just another complex system and one with which it is very easy to experiment.
I agree! LLMs are in many ways self-organized complex systems. And they are very easy to experiment on.
Which questions about LLM interpretability do you think theoretical physics can help investigate?
I have some objections to this, from the perspective of a doc I'm writing (and will possibly post in a few weeks). I think that you're using biology as a synonym for feature/circuit microscopy, but I think there are some biologically motivated perspectives like self-organised criticality or systems neuroscience that use statistical-physics formalisms but are primarily biological in nature. Likewise, physics is not only about smooth, universal regularities. Phase transitions, renormalisation and critical phenomena are central to modern physics and they are violently non-smooth. That side of physics is almost completely absent in the piece, and overall I would say that the rhetorical contrast isn't so clear as depicted in the article.
I agree it would be nice if we could get a second law of thermodynamics for LLMs. But safety interventions are usually enacted locally (gradient nudges, RLHF reward shaping, inference-time steering) and a thermodynamic state variable a la "entropy of the latent field" is almost certainly too coarse to guarantee that the next token is non-harmful. I think you underplay a manipulability criterion, where a variable is only valuable if you can steer it cheaply and predictably, which is why we might care about critical windows.
Finally I would also add that the messiness is in some ways the point. I don't have a picture of misalignment as necessarily stemming from these really neat simplicities, I think there's a lot of risk in being insufficiently granular if we elevate only the cleanest "order parameters" and discard messy local details. I would guess alignment failures often don't represent as an scalar drifting past a threshold, and would rather be narrow‐band exploits or corner‐case correlations that are at the same granularity as messy feature and circuit probes you describe as a distraction. If you can jailbreak a frontier model with a one-sentence rhyme, then any interpretability story that averages over millions of parameters until it returns a single macro-variable is, by construction, blind to the event we need to prevent.
Thank you for the comment, I appreciate this perspective.
A couple of things to clarify:
As I quoted from "Emergence of brains", I think the more traditional approach of biology is to embrace the complexity of living systems. In contrast, physics tends to seek more and more general and unifying principles (symmetries, order, curvature) that apply broadly across the universe, whether it's matter or living systems, or even abstract phenomena like financial markets.
In the context of interpretability, I use this analogy to describe two (main) distinct approaches: one is self-proclaimed "biological", and nowadays is mostly about breaking down models into features and circuits, and the other is more emergentist, and seek to find, for example, unifying principles in how models structure internal representations.
Again, it's mostly an analogy, and both approaches are based on math (which is incidentally often similar to stat-mech math). And by "second law of thermodynamics" I did not mean it in a literal sense, but as an illustration for a principle which, without being very precise or practical, could provide a general guiding framework (eg, perpetual motion is not possible).
To conclude, I'm not saying the messiness of features and circuits is not useful; I think it's actually fascinating. But I gently push for more recognition for the alternative approaches, which I believe will become very illuminating.
Interpretability is the nascent science of making the vast complexity of billion-parameter AI models more comprehensible to the human mind. Currently, the mainstream approach is reductionist: dissecting a model into many smaller components, much like a biologist mapping cellular pathways. Here, I describe and advocate for the complementary perspective of seeking emergent simplicities[1]: underlying principles, following physics' march towards universality.
Large language models are not engineered, they are grown -- like a sourdough starter or a bonsai[2]. The analogy has become a trope among the (expanding) circle of researchers digging into the inner workings of LLMs, attempting to elucidate how new words are computed from an input sequence of tokens. It is certainly an upgrade from the vaguer "black box" narrative of the early days of ChatGPT. In fact, the reference to organisms runs deeper than it might first appear.
In the eclectic community of interpretability researchers, a core is now crystallizing around an approach that claims inspiration from biology. For example, Anthropic recently published a paper called On the biology of a large language model[3], which relies on a set of probing techniques referred to therein as a "microscope". Recently an "embryology of a language model"[4], showcasing an embryo-looking UMAP plot, has caught a lot of attention. Not to mention that it all started with... (artificial) neurons.
This mainstream "biological" interpretability is mobilizing substantial human and computational resources -- at places such as Anthropic, DeepMind, Goodfire, or Transluce. Its main concern is to expand on a special kind of model decomposition into features and circuits. Features (in this context) are combinations of neurons that correspond to human-understandable concepts, for example [ducks] or [words ending in -ing]. Circuits are networks of features which seem to implement comprehensible reasoning pathways by combining concepts.
This approach has been successful in providing clear, even actionable findings. It has helped locate bugs and sources of hallucinations. Engineers are now able to tune certain features, after their discovery, to steer language models to exhibit certain traits, such as speaking like a pirate[5]. Anthropic just uncovered "persona vectors"[6], which could be used as little knobs to nudge a model towards certain personality traits.
Upon reflection, however, while features and circuits seem to elucidate certain reasoning mechanisms that LLMs have tacitly implemented during their training, they also make things a little messy. Models have to be pipefitted with a whole array of accessory neural nets (sparse auto-encoders, transcoders, crosscoders) which produce millions of features which then have to be interpreted as human-friendly concepts. And these features can assemble into an astronomical amount of combinatorial circuits.
So, instead of unifying language models around more familiar mathematical objects, the current trend is to add to the complexity. There again, the similarity with a traditional biological perspective is noticeable: biology is messy and thrives in diversity, whether it's expanding the taxonomy of insect species or discovering new genetic pathways.
Is there an alternative scientific approach to simplifying the vast complexity of billions of weights forming the backbone of these grown networks? I propose:
Modern physics has long been concerned with inferring simple laws from systems made of considerable amounts of particles in interaction. The success of thermodynamics was to show that all the randomly jiggling molecules in a gas will collectively collapse onto a small set of variables of interest, such as pressure and temperature. Thermodynamics then spawned statistical mechanics, a more general framework which has found applications to explain things as diverse as financial markets, bird flocks, and, indeed, systems of neurons, whether biological or artificial.
So it's no coincidence that the 2024 Physics Nobel Prize went to neural networks pioneers Hinton and Hopfield. Reflecting on the prize, Princeton's Bill Bialek, a physicist who has consistently pushed stat-mech outside of its conventional bounds[7] recently wrote[8]:
Physics, at least in part, is a search for principles that are simple and universal. Biology, at least in part, is a celebration of the complexity and diversity of life.
The physics approach to interpretability hopes to find universal patterns (e.g., scaling laws) and foundational principles (e.g., conservation laws) which underlie the convoluted edifice of weights and biases above. Rather than pulling out new computational circuits like biologists chase molecular pathways, a physicist's dream would be to find a second law of thermodynamics for LLMs (and AI systems in general).
Getting there will take some time, a gradual distillation toward unification, but compelling evidence is already emerging. AI models, regardless of architecture or training details, appear to converge toward similar patterns. Some studies show that their training trajectories "explore the same low-dimensional manifold"[9], and that token dynamics in latent space follow similar pathways across tokens and models[10]. Representation analysis further reveals that different networks can be linearly aligned, supporting the Platonic Representation Hypothesis that models converge towards a common statistical representation of the world.
The emergentist approach thus seems successful at surfacing universal principles that govern both the internal computation and external behavior of models. It might not produce directly actionable items like the biological circuits of mainstream interpretability, but with a little bit of time it will generate guiding principles for design and applications with broad applicability.
In particular, the physics lens, rather than the biological one, is, in my view, the most likely to answer some of the deepest questions about the perplexing new form of intelligence that has emerged from silicon and electrons. Some that come to mind include:
How is next-token prediction, a very localized, short-range action, capable of inducing long-ranged order and apparent planning over hundreds of words?
Are new emergent capabilities still to be expected if models grow even larger; new phase transitions and scaling laws?
And ultimately: to what extent can these systems be thought of as alive or sentient?
Meanwhile, several lower-hanging fruits are within reach in the next few months or couple years, assuming proper resources are allocated. They will be the subject of a later post, but for a spoiler idea, I think extending the view of LLMs as dynamical systems will be fruitful. For one, it will inform, for example, how to more reliably steer models along features without creating diverging trajectories that break the output. In parallel, we might be able to put LLM behavior into equations, as has been done with C. elegans and other organisms[11], with implications for safety and alignment.
Biology and physics, in the traditional sense, have over the past century been exceedingly successful when joining forces. There are good reasons to believe that the same could apply here, and our multi-level understanding of LLMs will flourish when done through multiple lenses.
In science, the most difficult thing is often finding the right questions to ask.
In our case, what are we hoping to understand about large language models -- and for what purpose? A better microscope will reveal finer details about the internal wiring of LLMs: how information is encoded and passed along; the elementary computational circuits that resolve syntax, semantics, grammar, meaning, planning. A wider telescope might reveal new objects gliding along high-dimensional orbits, hidden attractors and intrinsic curvatures, all together describing new laws and unifying principles. Who knows, it might even reveal whether there is something akin to consciousness hiding somewhere in the latent space.
These thoughts follow from fascinating conversations with many different people, most notably: Jacob Dunefsky, Chris Earls, Toni Liu, Haley Moller, XJ Xu, and researchers at Goodfire. Bialek's Emergence of Brain paper helped crystallize the main idea presented.
Phrase by Sri Iyer-Biswas and Charlie Wright in Emergent Simplicities in the Living Histories of Individual Cells (2025)
Eric Ho, On Optimism for Interpretability, Goodfire AI blog
Lindsey et al., On the Biology of a Large Language Model, Transformer Circuits Thread (2025)
Wang et al., Embryology of a Language Model (2025)
McGrath et al., Mapping the latent space of Llama 3.3 70B, Goodfire AI Research
Chen et al., Persona Vectors: Monitoring and Controlling Character Traits in Language Models
and, incidentally, Dario Amodei's PhD co-advisor
William Bialek, Emergence of Brains, PRX Life (2025)
Mao et al., The training process of many deep networks explores the same low-dimensional manifold, PNAS (2024)
Sarfati et al., Lines of Thought in Large Language Models, ICLR (2025)
Stephens et al., Dimensionality and Dynamics in the Behavior of C. elegans, PLoS (2008)