Many interpretability researchers (ourselves included) believe that neural networks store knowledge in superposition—that is, networks encode more facts than they have individual components. A natural extension of this idea is that networks also perform computation on knowledge that lives in superposition. Despite the centrality of this concept, there are few...
Many thanks to Michael Hanna and Joshua Batson for useful feedback and discussion. Kat Dearstyne and Kamal Maher conducted experiments during the SPAR Fall 2025 Cohort. TL;DR Cross-layer transcoders (CLTs) enable circuit tracing that can extract high-level mechanistic explanations for arbitrary prompts and are emerging as general-purpose infrastructure for mechanistic...
Zephaniah Roe (mentee) and Rick Goldstein (mentor) conducted these experiments during continued work following the SPAR Spring 2025 cohort. Disclaimer / Epistemic status: We spent roughly 30 hours on this post. We are not confident in these findings but we think they are interesting and worth sharing. We assume some...
Why this post I’ve been doing MI research full-time for about three months and since my current grant is ending soon and I recently received a Lightspeed rejection, now seems like a good time to take some time away from object-level work to reflect on direction and next steps. I...
1) Introduction In February, Stephen Casper posted two Mechanistic Interpretability challenges. The first of these challenges asks participants to uncover a secret labeling function from a trained CNN and was solved by Stefan Heimersheim and Marius Hobbhahn. The second of these challenges, which will be the focus of this post,...
Hey Everyone! I recently left my FAANG job to split my time between doing Alignment Research (70%) and investigating start-up ideas (30%). If I decide to fully commit to Alignment Research, what is the best way to go about applying for and/or getting funding? (In a perfect world, this funding...