Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Announcing the first academic Mechanistic Interpretability workshop, held at ICML 2024! I think this is an exciting development that's a lagging indicator of mech interp gaining legitimacy as an academic field, and a good chance for field building and sharing recent progress! 

We'd love to get papers submitted if any of you have relevant projects! Deadline May 29, max 4 or max 8 pages. We welcome anything that brings us closer to a principled understanding of model internals, even if it's not "traditional” mech interp. Check out our website for example topics! There's $1750 in best paper prizes. We also welcome less standard submissions, like open source software, models or datasets, negative results, distillations, or position pieces. 

And if anyone is attending ICML, you'd be very welcome at the workshop! We have a great speaker line-up: Chris Olah, Jacob Steinhardt, David Bau and Asma Ghandeharioun. And a panel discussion, hands-on tutorial, and social. I’m excited to meet more people into mech interp! And if you know anyone who might be interested in attending/submitting, please pass this on.

Twitter threadWebsite

Thanks to my great co-organisers: Fazl Barez, Lawrence Chan, Kayo Yin, Mor Geva, Atticus Geiger and Max Tegmark

New Comment
6 comments, sorted by Click to highlight new comments since:

Would a tooling paper be appropriate for this workshop?

I wrote a tool that helps ML researchers to analyze the internals of a neural network:

It is not directly research on mechanistic interpretability, but this could be useful for many people working in the field.

Looks relevant to me on a skim! I'd probably want to see some arguments in the submission for why this is useful tooling for mech interp people specifically (though being useful to non mech interp people too is a bonus!)

It just seems intuitively like a natural fit: Everyone in mech interp needs to inspect models. This tool makes it easier to inspect models.

Does it need to be more specific than that?

One thing that comes to mind: The tool allows you to categorize different training steps and records them separately, and you can define categories arbitrarily. This can be used to compare what the network does internally in two different scenarios of interest. E.g. the categories could be "the race of the character in the story" or some other real-life condition you would want to know the impact of.

The tool will then allow you to quickly compare KPIs of tensors all across the network for these categories. It's less about testing a specific hypothesis and more about quickly getting an overview and intuition, and finding anomalies.

Makes sense! Sounds like a fairly good fit

It just seems intuitively like a natural fit: Everyone in mech interp needs to inspect models. This tool makes it easier to inspect models.

Another way of framing it: Try to write your paper in such a way that a mech interp researcher reading it says "huh, I want to go and use this library for my research". Eg give examples of things that were previously hard that are now easy.

I'm looking for other tools to contrast it with and found TransformerLens. Are there any other tools it would make sense to compare it to?

Nnsight, pyvene, inseq, torchlens are other libraries coming to mind that it would be good to discuss in a related work. Also penzai in JAX