Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Announcing Timaeus

28ryan_greenblatt

31Daniel Murfet

7ryan_greenblatt

6Daniel Murfet

19ryan_greenblatt

19Akash

20Daniel Murfet

16plex

5Jesse Hoogland

6Richard_Ngo

8Algon

20Daniel Murfet

5Algon

7Daniel Murfet

1Review Bot

New Comment

15 comments, sorted by Click to highlight new comments since: Today at 5:56 AM

The strongest critique of developmental interpretability we know is the following: while it is established that phase transitions exist in neural network training, it is not yet clear how common they are, and whether they make a good target for alignment.

Is it established that phase transitions exist in the training of *non-toy* neural networks?

There are clearly s-shaped loss curves in many *non-toy* cases, but I'm not aware of any known cases which are clearly *phase transitions* as defined here (which is how the term is commonly used in e.g. physics and how I think this post wants to use the term).

For instance, while formation of induction-like attention heads^{[1]} probably results in *s-shaped loss curves* in at least some cases, my understanding is that this probably has nothing to do with changes in the minima of some notion of energy (as would be required for the definition linked above I think). I think the effect is probably the one described in Multi-Component Learning and S-Curves. Unless there is some notion of energy such that this multi-component case of s-shaped loss curves is well described as a *phase transition* and that's what's discussed in this post?

Some important disclaimers:

- I'm not very familiar with the notion of
*phase transition*I think this post is using. - It seems as though there are cases where real
*phase transitions*occur during the training of toy models and there are also*phase transitions*in the final configuration of toy models as various hyperparameters change. (I'm pretty confident that Toy models of superposition has both phase transitions during training and in final configurations (for phase transitions in final configurations there seems to be one with respect to "does superposition occur").There is also this example of a two layer ReLU model; however, I haven't read or understood this example, I'm just trusting the authors claim that there is a*phase transition*here).

These attention heads probably do a bunch of stuff which isn't that well described as induction, so I'm reluctant to call them "induction heads". ↩︎

Great question, thanks. tldr it depends what you mean by established, probably the obstacle to establishing such a thing is lower than you think.

To clarify the two types of phase transitions involved here, in the terminology of Chen et al:

*Bayesian phase transition in number of samples:*as discussed in the post you link to in Liam's sequence, where the concentration of the Bayesian posterior shifts suddenly from one region of parameter space to another, as the number of samples increased past some critical sample size . There are also Bayesian phase transitions with respect to hyperparameters (such as variations in the true distribution) but those are not what we're talking about here.*Dynamical phase transitions*: the "backwards S-shaped loss curve". I don't believe there is an agreed-upon formal definition of what people mean by this kind of phase transition in the deep learning literature, but what we mean by it is that the SGD trajectory is for some time strongly influenced (e.g. in the neighbourhood of) a critical point and then strongly influenced by another critical point . In the clearest case there are two plateaus, the one with higher loss corresponding to the label and the one with the lower loss corresponding to . In larger systems there may not be a clear plateau (e.g. in the case of induction heads that you mention) but it may still reasonable to think of the trajectory as dominated by the critical points.

The former kind of phase transition is a first-order phase transition in the sense of statistical physics, once you relate the posterior to a Boltzmann distribution. The latter is a notion that belongs more to the theory of dynamical systems or potentially catastrophe theory. The link between these two notions is, as you say, not obvious.

However Singular Learning Theory (SLT) does provide a link, which we explore in Chen et al. SLT says that the phases of Bayesian learning are also dominated by critical points of the loss, and so you can ask whether a given dynamical phase transition has "standing behind it" a Bayesian phase transition where at some critical sample size the posterior shifts from being concentrated near to being concentrated near .

It turns out that, at least for sufficiently large , the only real obstruction to this Bayesian phase transition existing is that the local learning coefficient near should be higher* *than near . This will be hard to prove theoretically in non-toy systems, but we can estimate the local learning coefficient, compare them, and thereby provide evidence that a Bayesian phase transition exists.

This has been done in the Toy Model of Superposition in Chen et al, and we're in the process of looking at a range of larger systems including induction heads. We're not ready to share those results yet, but I would point you to Nina Rimsky and Dmitry Vaintrob's nice post on modular addition which I would say provides evidence for a Bayesian phase transition in that setting.

There are some caveats and details, that I can go into if you're interested. I would say the existence of Bayesian phase transitions in non-toy neural networks is not established yet, but at this point I think we can be reasonably confident they exist.

Thanks for the detailed response!

So, to check my understanding:

The toy cases discussed in Multi-Component Learning and S-Curves are clearly *dynamical phase transitions*. (It's easy to establish *dynamical phase transitions* based on just observation in general. And, in these cases we can verify this property holds for the corresponding differential equations (and step size is unimportant so differential equations are a good model).) Also, I speculate it's easy to prove the existence of a *bayesian phase transition in the number of samples* for these toy cases given how simple they are.

More generally, I wish that when people used the term "phase transition", they clarified whether they meant "s-shaped loss curves" or some more precise notion. Often, people are making a non-mechanistic claim when they say "phase transition" (we observed a loss curve with a s-shape), but there are also mechanistic claims which require additional evidence.

In particular, when citing other work somewhere, it would be nice to clarify what notion of phase transition the other work is discussing.

Thanks Akash. Speaking for myself, I have plenty of experience supervising MSc and PhD students and running an academic research group, but scientific institution building is a next-level problem. I have spent time reading and thinking about it, but it would be great to be connected to people with first-hand experience or who have thought more deeply about it, e.g.

- People with advice on running distributed scientific research groups
- People who have thought about scientific institution building in general (e.g. those with experience starting FROs in biosciences or elsewhere)
- People with experience balancing fundamental research with product development within an institution

I am seeking advice from people within my institution (University of Melbourne) but Timaeus is not a purely academic org and their experience does not cover all the hard parts.

Congratulations on launching!

Added you to the map:

and your Discord to the list of communities, which is now a sub-page of aisafety.com.

One question: Given that interpretability might well lead to systems which are powerful enough to be an x-risk long before we have a strong enough understanding to direct a superintelligence, so publish-by-default seems risky, are you considering adopting a non-publish-by-default policy? I know you talk about capabilities risks in general terms, but is this specific policy on the table?

phase transitions organize learningand thatdetecting, locating, and understanding these transitions

Only the third ability sounds like it could be useful for alignment. The others seem like they'd be firing all the time, and if you only had that and, for instance, the ability to understand the semantics of a single neuron, then I don't think that gets you very far. Or am I missing something obvious?

I think it is too early to know how many phase transitions there are in e.g. the training of a large language model. If there are many, it seems likely to me that they fall along a spectrum of "scale" and that it will be easier to find the more significant ones than the less significant ones (e.g. we discover transitions like the onset of in-context learning first, because they dramatically change how the whole network computes).

As evidence for that view, I would put forward the fact that putting features into superposition is known to be a phase transition in toy models (based on the original post by Elhage et al and also our work in Chen et al) and therefore seems likely to be a phase transition in larger models as well. That gives an example of phase transitions at the "small" end of the scale. At the "big" end of the scale, the evidence in Olsson et al that induction heads and in-context learning appears in a phase transition seems convincing to me.

On general principles, understanding "small" phase transitions (where the scale is judged relative to the overall size of the system, e.g. number of parameters) is like probing a physical system at small length scales / high energy, and will require more sophisticated tools. So I expect that we'll start by gaining a good understanding of "big" phase transitions and then as the experimental methodology and theory improves, move down the spectrum towards smaller transitions.

On these grounds I don't expect us to be swamped by the smaller transitions, because they're just hard to see in the first place; the major open problem in my mind is how far we can get down the scale with reasonable amounts of compute. Maybe one way that SLT & developmental interpretability fails to be useful for alignment is if there is a large "gap" in the spectrum, where beyond the "big" phase transitions that are easy to see (and for which you may not need fancy new ideas) there is just a desert / lack of transitions, and all the transitions that matter for alignment are "small" enough that a lot of compute and/or very sophisticated ideas are necessary to study them. We'll see!

Thank you, that was helpful. If I'm getting this right, you think the "big" transitions plausibly correspond to important capability gains. So under that theory, "chain of thought" and "reflection" arised due to big phase transitions in GPT-3 and 4. I think it'd be great if researchers could, if not access training checkpoints of these models, then at least make bids for experiments to be performed on said models.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

Timaeus is a new AI safety research organization dedicated to making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. Currently, we are working on singular learning theory and developmental interpretability. Over time we expect to work on a broader research agenda, and to create understanding-based evals informed by our research.

## Activities

Our primary focus is research. For now, we're a remote-first organization. We collaborate primarily through online seminars and the DevInterp Discord, with regular in-person meetings at workshops and conferences (see below). We're also investing time in academic outreach to increase the general capacity for work in technical AI alignment.

## Research

We believe singular learning theory, a mathematical subject founded by Sumio Watanabe, will lead to a better fundamental understanding of large-scale learning machines and the computational structures that they learn to represent. It has already given us concepts like the learning coefficient and insights into phase transitions in Bayesian learning. We expect significant advances in the theory to be possible, and that these advances can inform new tools for alignment.

Developmental interpretability is an approach to understanding the emergence of structure in neural networks, which is informed by singular learning theory but also draws on mechanistic interpretability and ideas from statistical physics and developmental biology. The key idea is that

phase transitions organize learningand thatdetecting, locating, and understanding these transitionscould pave a road to evaluation tools thatpreventthe development of dangerous capabilities, values, and behaviors. We're engaged in a research sprint to test the assumptions of this approach.We see these as two particularly promising research directions, and they are our focus for now. Like any ambitious research, they are not guaranteed to succeed, but there's plenty more water in the well. Broadly speaking, the research agenda of Timaeus is oriented towards solving problems in technical AI alignment using deep ideas from across many areas of mathematics and the sciences, with a "full stack" approach that integrates work from pure mathematics through to machine learning experiments.

The outputs we have contributed to so far:

## Academic Outreach

AI safety remains bottlenecked on senior researchers and mentorship capacity. The young people already in the field will grow into these roles. However, given the scale and urgency of the problem, we think it is important to open inroads to academia and encourage established scientists to spend their time on AI safety.

Singular learning theory and developmental interpretability can serve as a natural bridge between the emerging discipline of AI alignment and existing disciplines of mathematics and science, including physics and biology. We plan to spend part of our time onboarding scientists into alignment via concrete projects in these areas.

## Conferences

We're organizing conferences, retreats, hackathons, etc. focusing on singular learning theory and developmental interpretability. These have included and will include:

## Team

## Core Team

Jesse HooglandStan van Wingerden(Operations & Finances) has a MSc. in theoretical physics from the University of Amsterdam and was previously the CTO of an algorithmic trading firm.Alexander Gietelink OldenzielThe research agenda that we are contributing to was established by

Daniel Murfet, who is a mathematician at the University of Melbourne and an expert in singular learning theory, algebraic geometry, and mathematical logic.## Research Assistants

We just concluded a round of hiring and are excited to bring on board several very talented Research Assistants (RAs), starting with

George Wang## Friends and Collaborators

Here are some of the people we are actively collaborating with:

Susan Wei(Statistician, University of Melbourne)Calin Lazaroiu(Mathematical physicist, UNED-Madrid and IFIN-HH Bucharest)Simon Pepin Lehalleur(Mathematician, Postdoc at University of Amsterdam)Tom Burns(Neuroscientist, Postdoc at ICERM, Brown University)Matthew Farrugia-Roberts(Research assistant, David Krueger's AI Safety Lab, University of Cambridge)Liam Carroll(Independent AI safety researcher)Jake Mendel(Research scientist, Apollo Research)Zach Furman(Independent AI safety researcher)Rumi Salazar(Independent AI safety researcher)Inclusion on this list does not imply endorsement of Timaeus' views.

## Advisors

We're advised by

Evan HubingerandDavid ("Davidad") Dalrymple.## FAQ

## Where can I learn more, and contact you?

Learn more on the Timaeus webpage. You can email Jesse Hoogland.

## What about capabilities risk?

There is a risk that fundamental progress in either singular learning theory or developmental interpretability could contribute to further acceleration in AI capabilities in the medium term. We take this seriously and are seeking advice from other alignment researchers and organizations. By the end of our current research sprint we will have in place institutional forms to help us navigate this risk.

Likewise, there is a risk that outreach which aims to involve more scientists in AI alignment work will also accelerate progress in AI capabilities. However, those of us in academia can already see that as the risks become more visible, scientists are starting to think about these problems on their own. So the question is not

whethera broad range of scientists will become interested in alignment butwhenthey will start to contribute andwhat they work on.It is part of Timaeus' mission to help scientists to responsibly contribute to technical AI alignment, while minimizing these risks.

## Are phase transitions really the key?

The strongest critique of developmental interpretability we know is the following: while it is established that phase transitions exist in neural network training, it is not yet clear

how commonthey are, and whether they make agood targetfor alignment.We think developmental interpretability is a good investment in a world where many of the important structures (e.g., circuits) in neural networks form in phase transitions. Figuring out whether we live in such a world is one of our top priorities. It's not trivial because even if transitions exist they may not necessarily be visible to naive probes. Our approach is to systematically advance the fundamental science of finding and classifying transitions, starting with smaller systems where transitions can be definitively shown to exist.

## How are you funded?

We're funded through a $142k Manifund grant led primarily by Evan Hubinger, Ryan Kidd, Rachel Weinberg, and Marcus Abramovitch. We are fiscally sponsored by Ashgro.

## "Timaeus"? How do I even pronounce that?

Pronounce it however you want.

Timaeus is the eponymous character in the dialogue where Plato introduces his theory of forms. The dialogue posits a correspondence between the elements that make up the world and the Platonic solids. That's wrong, but it contains the germ of the idea of the unreasonable effectiveness of mathematics in understanding the natural world.

We read the Timaeus dialogue with a spirit of hope, in the capacity of the human intellect to understand and solve wicked problems. The narrow gate to human flourishing is preceded by a narrow path.

We'll see you on that path.