Large Language Models and the Critical Brain Hypothesis

David Africa

Summary: I argue for a picture of developmental interpretability from neuroscience. A useful way to study and control frontier-scale language models is to treat their training as a sequence of physical phase transitions. Singular learning theory (SLT) rigorously characterizes singularities and learning dynamics in overparameterized models; developmental interpretability hypothesizes that large networks pass through phase transitions - regimes where behaviour changes qualitatively. The Critical Brain Hypothesis (CBH) proposes that biological neural systems may operate near phase transitions, and I claim that modern language models resemble this. Some transformer phenomena may be closely analogous: in-context learning and long-range dependency handling strengthen with scale and training distribution (sometimes appearing threshold-like), short fine-tunes or jailbreak prompts can substantially shift behaviour on certain evaluations, and “grokking” shows delayed generalization with qualitative similarities to critical slowing-down. Finally, I outline tests and implications under this speculative lens.

Phase Transitions

Developmental interpretability proposes that we use singular learning theory (SLT) to study the development of language models in terms of phase transitions. This helps because the common phases they share in their development – rather than their individual idiosyncrasies and quirks– are more likely to apply to models much larger than the ones we have today. Further, we should expect phase transitions to be easy to observe, even without using SLT tools; by definition, a phase transition should be obvious (and hopefully highly visible, when measuring certain indicators). This should make building a catalogue for engineering-style interventions somewhat easier. Finally, phase transitions seem especially relevant for safety purposes. Within a phase, model behavior changes slowly and is easier to predict. In transition, models might exhibit unpredictable behavior which could be erratic or dangerous, and are generally more malleable which might make them susceptible to adversarial settings.

The Critical Brain Hypothesis (CBH) and Alignment

I claim the Critical Brain Hypothesis is a useful lens to understand developmental interpretability.

The CBH claims that biological brains spend most of their time in the narrow “edge-of-order-and-disorder” regime that statistical physicists would call a continuous phase transition. In this sliver of phase space, we notice a couple of interesting things. First, correlation length grows, which means that local perturbations are less local, and propagate system-wide. Second, dynamic range is high, which is to say that the same substrate can represent both very small and very large signals. Finally, magnetic susceptibility peaks. The system is more sensitive to inputs and can switch global states with minimal energy.

This is reminiscent of some observations we might make about LLMs. First, long-range credit assignment suddenly appears at the “emergent” context window length. Critical windows also show up during sampling, where narrow time intervals are heavily determinative of features that show up in the final output. Second, a few gradient updates or a single prompt can sometimes flip the model from “helpful assistant” to harmful behaviour. Finally, grokking shows delayed generalization after long plateaus qualitatively similar to critical slow-downs and finite-size scaling.

We might use the CBH as an organizing metaphor under which emergence anecdotes might be re-organised, where many of the qualitative jumps we see during scaling reflect the thermodynamic phase changes of a high-dimensional learning dynamical system. This might also provide a neat analogy to classical machine learning: data scientists used to carefully balance between the bias and variance terms, worried they might overparametrize their model. They had to pick the best model that wasn’t too complicated for the job. When neural networks scaled up a lot, we found that they worked because once we had more parameters than datapoints, the space of functions we could interpolate through blew up and we could find really simple solutions after some period of double descent (where the loss spiked again, before converging to an even lower plateau). In this sense we might think of the period of interpolation through really overfitted solutions as the critical period – except the model stays in the critical period all the time since the problem our model interpolates through is a constantly moving world which introduces constantly changing datapoints.

If you entertain the possibility that a transformer keeps skating along this surface, alignment can be framed as a matter of state-space control^[1], where the training loop is a sequence of movements toward and away from the critical manifold. In sub-critical phases we harvest stability and low loss on easy data. In super-critical bursts we encourage exploration and task transfer. In the critical window we administer value-laden nudges to do alignment-related things. If this picture is correct, then safety interventions delivered elsewhere are less cost-effective and don’t make as much sense. I am not sure if it should make us more or less confident about monitoring CoT^[2].

A natural objection is that seeming power-law behaviour^[3] in neural networks might be an artefact of finite size or unmodelled heterogeneity rather than evidence of criticality. That is a real possibility, and the history of criticality claims in biology is littered with premature conclusions. But for alignment purposes the burden of proof is asymmetrical. If even a small subset of the observed “phase-transition” stories in large models turns out to be authentic, then ignoring them leaves safety gains on the table. Conversely, making training protocols conditional on criticality metrics is low-risk (in that this would probably be sparse, low-overhead instrumentation during model training you would do anyway). If the metrics turn out to be red herrings, they will at worst add benign regularisation and interpretability checkpoints.

You might also carry the objection to the analogy to brains in general, that brains do a very different type of computation from large language models. I think that is reasonable, and it seems that brains are hardcoded in many different and idiosyncratic ways which neural networks don’t replicate. But part of the claims above are that we might expect neural networks to be critical using other pieces of evidence. And my low confidence guess would be that self-organized criticality is a characteristic of generally intelligent systems, like how many companies run their supply chains with just-in-time inventory or how many financial markets hover near the edge of liquidity crises.

Critical Daydreaming

Several people (Will Kirby and Geoffrey Irving at UK AISI) have mentioned that this is related to LLM daydreaming. I imagine that the day-dreaming loop suggested here is kind of like criticality (in slow-motion), where the loop asks the model to keep doing cheap background probes, and stores the result if it seems interesting. The CBH would suggest that such nudges create useful global changes when the system is critical, and the system is incentivized to burn some extra compute to stay near this edge. If the CBH story is right, the daydreaming loop is actually what any agent at criticality will end up doing to stay ready for its next phase change. Although, there is some difference in the story depicted between daydreaming and the CBH, where the daydreaming loop moves through activation space since weights are frozen and critical surfing (below) would also move through weight space.

You can think of this loop as a criticality tax: extra compute is spent on noisy internal exploration because this keeps the representational landscape loose and re-writable, all so that it is easier to fuse distant ideas, invent new tools, or reinterpret your existing goals. This, I imagine, is why human insight feels stochastic at times. You think very hard on some research problem, and then go for a run or for lunch or for a climb (paying the tax). Then you come back after having the problem untangle in the back of your mind, and it strikes you like a bolt of lightning. I guess these are behavioral tricks so that you can hover near the phase boundary long enough for a useful avalanche to occur.

Critical Surfing and some practical thoughts

Critical systems are simultaneously a boon and a bane: high susceptibility means small alignment gradients go a long way which makes learning easier, and could also mean unintended mis-specifications and malicious inductive biases get amplified. So, this should imply some things about how we treat alignment.

What I think should follow is a control theory of “critical surfing”:

Detect when the model is entering a critical window^[4] (something like lambda spike, Hessian flattening, gradient non-Gaussianity).
Switch to an “alignment-first curriculum”. Here I think you could do some consistency training (if consistency proves to be good for alignment), or some really high quality RLHF.
Exit criticality by regularisation once we satisfy certain safety constraints.

A few ways to verify this come to mind. We might check if the correlation length of representations scales with model depth. Second, we can look for “activation avalanches,” i.e. contiguous cascades of large activations across layers or timesteps; if the criticality picture is correct, their size distribution should approximate a power law with an exponent about 1.5 just before new capabilities emerge. Third, we can run matched fine-tuning experiments. I would guess that models that are nudged while still inside the critical window should need an order of magnitude (roughly ten- to one-hundred-fold) less RLHF reward signal to attain a given alignment metric, whereas the same intervention applied after convergence should leave them more brittle to off-distribution probes.

This leaves us with a couple of open questions still. Two that occur to me: First, can we deliberately widen or narrow the critical window? In pre-training, messing with the temperature scheduling and injecting noise sound like obvious ways to do this. I don’t have as good of a sense for how to do this in post-training, but maybe for inference we can maybe induce massive activations in the model or intervene in the sampling process. Second, is there some multiplexing of criticality going on? That is, are there nested phase transitions (syntax → semantics → meta-reasoning) that require different interventions at different epochs?

I think that if frontier model capabilities continue to improve, future frontier models will be trained at scales where new critical windows appear (think: latent-planning, tool-use, multi-modal fusion). Regardless, if we think of models as being on the edge of criticality the entire time, then maybe the entire thing is a phase transition after a certain point.

Collaborate with us

If this was interesting to you, we’d be very excited to collaborate on future research. If you’re interested in collaborating or applying for funding from UK AISI, you can apply to the Alignment Project, where we will award grants ranging from £50,000 to £1,000,000 for work that accelerates progress in AI alignment research. Our research agenda is detailed in full here.

^{^}
Geoffrey Irving at UK AISI adds that part of the goal should be understanding whether we can keep some good dimensions skating along the boundary and undesirable ones in fast-decay mode. In this way alignment would be doing the dimension separation and control would assume the model is always critical on the dangerous dimensions.
^{^}
If most real computation is done through “neuronal avalanches,” which are these millisecond bursts of activity followed by seconds of quiescence, it would make the important LLM “thoughts” easier to find but harder to parse. I also don’t know if they work on the same timescales. There might also be some analogy to massive activations, which sort of anchor attention probabilities on corresponding tokens.
^{^}
That is, at continuous phase transitions, power-law exponents (together with finite-size scaling and scaling collapse) are evidence of criticality.
^{^}
Which you could think of as "per-dimension," in that some dimensions of weight space are in a critical window when others are not. You could probably constrain them even further to certain highly-active patches of a certain frame.

LESSWRONG
LW

25

Large Language Models and the Critical Brain Hypothesis

25

Ω 13

25

Ω 13

25

Large Language Models and the Critical Brain Hypothesis

25

Ω 13

25

Ω 13

Phase Transitions

The Critical Brain Hypothesis (CBH) and Alignment

Critical Daydreaming

Critical Surfing and some practical thoughts

Collaborate with us