How To Become A Mechanistic Interpretability Researcher
Last updated Sept 2 2025 Note - if you want to pursue a career in this kind of research, apply to my MATS stream! Due Dec 23 TL;DR * This post is about the mindset and process I recommend if you want to do mechanistic interpretability research. I aim to give a clear sense of direction, so give opinionated advice and concrete recommendations. * Mech interp is high-leverage, impactful, and learnable on your own with short feedback loops and modest compute. * Learn the minimum viable basics, then do research. Mech interp is an empirical science * Three stages: * Learn the ropes (≤1 month) learn the essentials, go breadth-first; * Learn with research mini-projects practice basic research skills with 1-5 day mini projects, focus on fast feedback loop skills; * Work up to full projects, do 1-2 week research sprints, continue the best ones. Explore deeper skills and the mindset of a great researcher. * Stage 1: Learning the Ropes * Breadth over depth; get a good baseline not perfection * Learn the basics: Code a transformer from scratch, key mech interp techniques, the landscape of the field, linear algebra intuitions, how to write mech interp code (ARENA is your friend) * Get your hands dirty: Do not just read things. Mech interp is a fundamentally empirical science * Move on after a month. Don’t expect to feel “done” or to have covered all of the ropes, learn more when needed. You won’t stumble across great research insights without starting to do something real * Use LLMs extensively - they’re not perfect, but are better at mech interp than you right now! They’re a crucial learning tool (when used right!) * Unpacking the research process: * Many skills, categorise them by the feedback loops. * Fast skills (minutes-hours) like write/run/debug experiments * Slow (weeks) like how to prioritise and when to pivot * Very slow (months) like generating good research ideas * Do not try to learn all skills at once. Focus o
A lot. Because the only thing that is recurrent is the text/vector CoT. The residual stream is very rich but the number of sequential steps of computation is bounded by the number of layers, without being able to send the intermediate information back to the beginning with some recurrence