Stan van Wingerden

Towards Developmental Interpretability

Developmental interpretability is a research agenda that has grown out of a meeting of the Singular Learning Theory (SLT) and AI alignment communities. To mark the completion of the first SLT & AI alignment summit we have prepared this document as an outline of the key ideas. As the name suggests, developmental interpretability (or "devinterp") is inspired by recent progress in the field of mechanistic interpretability, specifically work on phase transitions in neural networks and their relation to internal structure. Our two main motivating examples are the work by Olsson et al. on In-context Learning and Induction Heads and the work by Elhage et al. on Toy Models of Superposition. Developmental interpretability studies how structure incrementally emerges through phase transitions during training. Mechanistic interpretability emphasizes features and circuits as the fundamental units of analysis and usually aims at understanding a fully trained neural network. In contrast, developmental interpretability: * is organized around phases and phase transitions as defined mathematically in SLT, and * aims at an incremental understanding of the development of internal structure in neural networks, one phase transition at a time. The hope is that an understanding of phase transitions, integrated over the course of training, will provide a new way of looking at the computational and logical structure of the final trained network. We term this developmental interpretability because of the parallel with developmental biology, which aims to understand the final state of a different class of complex self-assembling systems (living organisms) by analyzing the key steps in development from an embryonic state.[1] In the rest of this post, we explain why we focus on phase transitions, the relevance of SLT, and how we see developmental interpretability contributing to AI alignment. > Thank you to @DanielFilan, @bilalchughtai, @Liam Carroll for reviewing early drafts of this

194Jul 12, 2023

Stan van Wingerden

Message

919

Timaeus in 2024

> TLDR: We made substantial progress in 2024: > > * We published a series of papers that verify key predictions of Singular Learning Theory (SLT) [1, 2, 3, 4, 5, 6]. > * We scaled key SLT-derived techniques to models with billions of parameters, eliminating our main concerns around...

Feb 20, 2025100

Timaeus is hiring researchers & engineers

TLDR: We're hiring for research & engineering roles across different levels of seniority. Hires will work on applications of singular learning theory to alignment, including developmental interpretability. About Us Timaeus' mission is to empower humanity by making breakthrough scientific progress on alignment. Our research focuses on applications of singular learning...

Jan 17, 202565

Stan van Wingerden's Shortform

Dec 11, 20245

Timaeus is hiring!

TLDR: We’re hiring two research assistants to work on advancing developmental interpretability and other applications of singular learning theory to alignment. About Us Timaeus’s mission is to empower humanity by making breakthrough scientific progress on alignment. Our research focuses on applications of singular learning theory to foundational problems within alignment,...

Jul 12, 202467

Timaeus's First Four Months

Timaeus was announced in late October 2023, with the mission of making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. This is our first progress update. In service of the mission, our first priority has been to support and contribute to ongoing work in...

Feb 28, 2024173

Announcing Timaeus

Timaeus is a new AI safety research organization dedicated to making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. Currently, we are working on singular learning theory and developmental interpretability. Over time we expect to work on a broader research agenda, and to create...

Oct 22, 2023188

You’re Measuring Model Complexity Wrong

TLDR: We explain why you should care about model complexity, why the local learning coefficient is arguably the correct measure of model complexity, and how to estimate its value. In particular, we review a new set of estimation techniques introduced by Lau et al. (2023). These techniques are foundational to...

Oct 11, 202393

Load More (7/10)

LESSWRONG
LW

LESSWRONG
LW

Stan van Wingerden

Stan van Wingerden

Stan van Wingerden

Towards Developmental Interpretability

Announcing Timaeus

Timaeus's First Four Months

Timaeus in 2024

Stan van Wingerden

Timaeus in 2024

Timaeus is hiring researchers & engineers

Stan van Wingerden's Shortform

Timaeus is hiring!

Timaeus's First Four Months

Announcing Timaeus

You’re Measuring Model Complexity Wrong

Timaeus in 2024

Timaeus is hiring researchers & engineers

Stan van Wingerden's Shortform

Timaeus is hiring!

Timaeus's First Four Months

Announcing Timaeus

You’re Measuring Model Complexity Wrong

Towards Developmental Interpretability

Announcing Timaeus

Timaeus's First Four Months

Timaeus in 2024