Jesse Hoogland on Developmental Interpretability and Singular Learning Theory

Michaël Trazzi

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a linkpost for https://theinsideview.ai/jesse

Jesse Hoogland is a research assistant at David Krueger's lab in Cambridge studying AI Safety who has recently been publishing on LessWrong about how to apply Singular Learning Theory to Alignment, and even organized some workshop in Berkeley last week around this.

I thought it made sense to interview him to have some high-level overview of Singular Learning Theory (and other more general approaches like developmental interpretability).

Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.

Interpreting Neural Networks: The Phase Transition View

Studying Phase Transitions Could Help Detect Deception

"We want to be able to know when these dangerous capabilities are first acquired because it might be too late. They might become sort of stuck and crystallized and hard to get rid of. And so we want to understand how dangerous capabilities, how misaligned values develop over the course of training. Phase transitions seem particularly relevant for that because they represent kind of the most important structural changes, the qualitative changes in the shape of these models internals.
Now, beyond that, another reason we’re interested in phase transitions is that phase transitions in physics are understood to be a kind of point of contact between the microscopic world and the macroscopic world. So it’s a point where you have more control over the behavior of a system than you normally do. That seems relevant to us from a safety engineering perspective. Why do you have more control in a physical system during phase transitions?" (context)

A Concrete Example of Phase Transition In Physics and an analogous example inside of neural networks

"Jesse: If you heat a magnet to a high enough temperature, then it’s no longer a magnet. It no longer has an overall magnetization. And so if you bring another magnet to it, they won’t stick. But if you cool it down, at some point it reaches this Curie temperature. If you push it lower, then it will become magnetized. So the entire thing will all of a sudden get a direction. It’ll have a north pole and a south pole. So the thing is though, like, which direction will that north pole or south pole be? And so it turns out that you only need an infinitesimally small perturbation to that system in order to point it in a certain direction. And so that’s the kind of sensitivity you see, where the microscopic structure becomes very sensitive to tiny external perturbations.
Michaël: And so if we bring this back to neural networks, if the weights are slightly different, the overall model could be deceptive or not. Is it something similar?
Jesse: This is speculative. There are more concrete examples. So there are these toy models of superposition studied by Anthropic. And that’s a case where you can see that it’s learning some embedding and unembeddings. So it’s trying to compress data. You can see that the way it compresses data involves this kind of symmetry breaking, this sensitivity, where it selects one solution at a phase transition. So that’s a very concrete example of this." (context)

Developmental Interpretability

"Suppose it’s possible to understand what’s going on inside of neural networks, largely understand them. First assumption. Well then, it’s still going to be very difficult to do that at one specific moment in time. I think intractable. The only way you’re actually going to build up an exhaustive idea of what structure the model has internally, is to look at how it forms over the course of training. You want to look at each moment, where you learn specific concepts and skills, and isolate those. Because that tells you where to look, where to look for structure. Developmental interpretability is this idea that you should study how structure forms in neural networks. That actually might be much more tractable than trying to understand how structure is at the end of training." (context)

Singular Learning Theory

Some Background On Singular Learning Theory

Jesse: I think the best way to think about singular learning theory is that it’s something like the thermodynamics of learning. Or maybe the statistical physics of learning. So it takes a bunch of ideas that we understand pretty well from physics and applies them to neural networks to making sense of them. It’s a relatively new field, it’s like two decades old. All invented by this one brilliant guy in Japan, Sumio Watanabe, who just saw the ingredients and clicked them into place and started to apply them to neural networks and other systems like that.
Michaël: That was a few years ago or is that still going on? When did he do that?
Jesse: This was more than a decade ago that he saw the ingredients that would lead to singular learning theory. It’s actually a very general theory that applies to many other systems. Only recently have people started to see that it could be relevant for alignment. (context)

The Intuition Behind Singular Learning Theory

"Singular learning theory is a theory of hierarchical models. Hierarchical models like neural networks, like hidden Markov models, or other kinds of models. And these hierarchical models are very special because the mapping from parameters to functions is not one-to-one. So you can have different models. If you look at the weights, they seem like very different models, but they're actually implementing the same function. It's actually a very special feature of these systems, and it leads to all kinds of special properties." (context)

Does Singular Learning Theory Predict Anything?

"The most interesting predictions made by Singular Learning Theory are phase transitions. It tells us that we should expect there to be phase transitions, so discrete changes, sudden changes, in the kinds of computations being performed by a model over the course of training. I should give that a little caveat. A lot of this theory is built in another learning paradigm, Bayesian learning. There’s still some work to apply this to the learning paradigm we have in the case of neural networks, which is with stochastic gradient descent. What it does tell us is it predicts these phase transitions, and that would tell us something very significant about what makes neural networks different from other kinds of models." (context)

LESSWRONG
LW