Developmental Interpretability

Developmental interpretability is an AI alignment research agenda studying how structure forms in neural networks.

Devinterp→
Learn more about developmental interpretability and its applications to AI safety.
Conferences→
Check out the 2023 SLT & alignment conferences for the necessary background and plan.

Developmental Interpretability

Developmental interpretability is an alignment research agenda grounded in singular learning theory (SLT), statistical physics, and developmental biology. The aim of developmental interpretability is to build tools for detecting, locating, and interpreting phase transitions that govern training and in-context learning. This has the potential to reduce the alignment tax for existing techniques and inform scalable new methods for interpreting neural networks.

Check out the research agenda.

2023 SLT & alignment summit

The first SLT & alignment summit ("Singularities against the singularity") was run in June 2023. In the first week, we recorded more than 20 hours of lectures on the necessary background, all of which you can find here. In the second week, we started research collaborations on a dozen open problems.

A second summit is planned for November 2023. Stay tuned for more details.

Community

Join our discord to discuss ask questions, find collabators, and stay up to date on the latest developments.

Join

Developmental Interpretability

Devinterp→

Conferences→