Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Timaeus was announced in late October 2023, with the mission of making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. This is our first progress update.

In service of the mission, our first priority has been to support and contribute to ongoing work in Singular Learning Theory (SLT) and developmental interpretability, with the aim of laying theoretical and empirical foundations for a science of deep learning and neural network interpretability. 

Our main uncertainties in this research were:

  • Is SLT useful in deep learning? While SLT is mathematically established, it was not clear whether the central quantities of SLT could be estimated at sufficient scale, and whether SLT's predictions actually held for realistic models (esp. language models). 
  • Does structure in neural networks form in phase transitions? The idea of developmental interpretability was to view phase transitions as a core primitive in the study of internal structure in neural networks. However, it was not clear how common phase transitions are, and whether we can detect them.

The research Timaeus has conducted over the past four months, in collaboration with Daniel Murfet's group at the University of Melbourne and several independent AI safety researchers, has significantly reduced these uncertainties, as we explain below. As a result we are now substantially more confident in the research agenda.

While we view this fundamental work in deep learning science and interpretability as critical, the mission of Timaeus is to make fundamental contributions to alignment and these investments in basic science are to be judged relative to that end goal. We are impatient about making direct contact between these ideas and central problems in alignment. This impatience has resulted in the research directions outlined at the end of this post

Contributions

Timaeus conducts two main activities: (1) research, that is, developing new tools for interpretability, mechanistic anomaly detection, etc., and (2) outreach, that is, introducing and advocating for these techniques to other researchers and organizations.

Research Contributions

What we learned

Regarding the question of whether SLT is useful in deep learning we have learned that

  • The Local Learning Coefficient (LLC) can be accurately estimated in deep linear networks at scales up to 100M parameters, see Lau et al 2023 and Furman and Lau 2024. We think there is a good chance that LLC estimation can be scaled usefully to much larger (nonlinear) models.
  • Changes in the LLC predict phase transitions in toy models (Chen et al 2023), deep linear networks (Furman and Lau 2024), and (language) transformers at scales up to 3M parameters (Hoogland et al 2024).

Regarding whether structure in neural networks forms in phase transitions, we have learned that

  • This is true in the toy model of superposition introduced by Elhage et al, where in a special case the critical points and transitions between them were analyzed in detail by Chen et al 2023 and the LLC detects these transitions (which appear to be consistent with the kind of phase transition defined mathematically in SLT).
  • Learning is organized around discrete developmental stages. By looking for SLT-predicted changes in structure and geometry, we successfully discovered new "hidden" transitions (Hoogland et al. 2024) in a 3M parameter language transformer and a 50k parameter linear regression transformer. While these transitions are consistent with phase transitions in the Bayesian sense in SLT, they are not necessarily fast when measured in SGD steps (that is, the relevant developmental timescale is not steps). For this reason we have started following the developmental biologists in referring to the phenomena as "developmental stages".
  • Trajectory PCA (aka "essential dynamics") can be used to discover emergent behaviors in small language models (Hoogland et al. 2024). Combined with LLC estimation, this provides some evidence that developmental interpretability can lead to methods for automatically eliciting capabilities rather than manually testing hand-picked capabilities. 

Though we do not predict that all structure forms in SLT-defined developmental stages, we now expect that enough important structure forms in discrete stages for developmental interpretability to be a viable perspective in interpretability. 

Papers

The aforementioned work has been undertaken by members of the Timaeus core team (Jesse Hoogland), Timaeus RAs (George Wang and Zach Furman), a group at the University of Melbourne, and independent AI safety researchers:

Note that the first two papers above were mostly/entirely completed before Timaeus was launched in October. 

Outreach Contributions

Posts

Posts authored by Timaeus core team members Jesse Hoogland and Stan van Wingerden:

We've developed other resources such as a list of learning materials for SLT and a list of open problems in DevInterp & SLT. Our collaborators distilled Chen et al. (2023) in Growth and Form in a Toy Model of Superposition. Also of interest (but not written or supported by Timaeus) is Joar Skalse’s criticism of Singular Learning Theory with extensive discussion in the comments.

Code

We published a devinterp repo and Python package for using the techniques we've introduced. This is supported by documentation and introductory notebooks to help newcomers get started.

Talks

We've given talks at DeepMind, Anthropic, OpenAI, 80,000 Hours, Constellation, MATS, the Topos Institute, Carnegie Mellon University, Monash University, and the University of Melbourne. We have planned talks with FAR, the Tokyo Technical AI Safety Conference, and the Foresight Institute.

Events

  • Workshop on SLT (program here): a 4-day workshop on SLT organized with collaborators at the University of Amsterdam. (This was hosted in September, before the official Timaeus announcement late October.)
  • DevInterp & SLT Hackathon: a 2-day hackathon organized together with AI Safety Melbourne and Responsible AI Development, hosted in early October (before the announcement). This was followed by a Demo Day in November for people to showcase work-in-progress on SLT & DevInterp.
  • Developmental Interpretability Conference: a 7-day conference on SLT & DevInterp in Oxford, bringing in academics from Spain, the Netherlands, Australia and Japan (talks here). This was followed by a 2-day program for senior researchers, whom we introduced to long-term open problems in a subfield of our research agenda.

Organization

Our team currently consists of six people: three core team members, two research assistants, and one research lead. We continue to collaborate closely with Daniel Murfet's research group at the University of Melbourne.

What we need

  • Senior Researchers. We're currently bottlenecked on senior researchers who can set research directions and lead projects.
  • Compute. More compute will enable us to iterate faster and to scale to larger models (our aim is to analyze GPT-2 using DevInterp tools by the end of the year). 
  • Funding. We are still fundraising for this year (current runway until October), and we are currently paying well below standard rates within AI safety research. Our team is willing to take a hit because we believe in the mission, but long-term this will lead to us losing out on excellent researchers. 

What's next 

Six months ago, we first put our minds to the question of what SLT could have to say about alignment. This led to developmental interpretability. But this was always a first step, not the end goal. Having now validated the basic premises of this research agenda we are starting to work on additional points of contact with alignment.

Our current research priorities:

  • Developmental interpretability is an application of SLT for studying the internal structure of NNs. We expect the next generations of tools to go beyond detecting when circuits form to locating where they form, identifying what they mean, and relating how they're connected, i.e., automated circuit discovery and analysis. Through 2024 this will remain where the bulk of our research effort is spent, and we expect by mid-year to have some idea of whether the more ambitious goals of automated circuit discovery and analysis are feasible.
  • Structural generalization is a new research direction that has come out of SLT for studying out-of-distribution generalization, in a way that allows us to talk formally about the internal structures that are involved in how predictions change as the input distribution changes. Unlike developmental interpretability (which has so far been primarily empirical) this will require some nontrivial theoretical progress in SLT. We expect this to lead eventually to new tools for mechanistic anomaly detection.
  • Geometry of program synthesis is a new research direction that has come out of SLT for studying inductive biases. We expect this to lead to a better theoretical understanding of phenomena including goal-directedness and power-seeking. 

We will provide an outline of this new research agenda (as we did with our original post announcing developmental interpretability) in the coming months. Structural generalization and geometry of program synthesis are ambitious multi-year research programs in their own right. However, we will move aggressively to empirically validate the theoretical ideas, maintain tight feedback loops between empirical and theoretical progress, and publish incremental progress on a regular basis. Our capacity to live up to these principles has been demonstrated in the developmental interpretability agenda over the past six months.

We will have a next round of updates and results to report in late May. 

To stay up-to-date, join the DevInterp discord. If you're interested in contributing or learning more, don't hesitate to reach out

  1. ^

    George joined this project later and is thus not yet included on the current preprint version.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 4:15 PM

That's a lot of things done, congratulations!

I would love to see something like Vanessa's LTA reading list but for devinterp.

You can find a v0 of an SLT/devinterp reading list here. Expect an updated reading list soon (which we will cross-post to LW). 

[-]evhub2moΩ340

(Moderation note: added to the Alignment Forum from LessWrong.)

Generalization, from thermodynamics to statistical physics (Hoogland 2023) — A review of and introduction to generalization theory from classical to contemporary. 

FYI this link is broken

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?