Want to try your hand at neural network interpretability? A very nice way to get started is to find an existing Python notebook using one or more interpretability techniques, hopefully one written with beginners in mind. In a click or two you can make a copy of it, which you can typically run without any modification, and then start tweaking it to look at what you're interested in. 

Fortunately, many such notebooks already exist, thanks to helpful members of the interp community! This post is just a list of those, mostly Colab notebooks. Many of them I have no personal experience with, but all of them have been recommended by people who know what they're doing. This list will probably be acceptably current through late 2024 or so; after that you should use a more up-to-date resource if one exists (although if one existed now I would have used it instead of writing one, so there may or may not be another one then).

Suggestions for other similarly useful starter notebooks for other areas are extremely welcomed!

And the main list is in no particular order, so no need to go top to bottom.

 

Thanks to @Jesse Hoogland and @CallumMcDougall for extremely useful input! 

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 3:13 PM

This seems like a useful resource, thanks for making it! I think it would be more useful if you enumerated the different ARENA notebooks, my guess is many readers won't click through to the link, and are more likely to if they see the different names. And IMO the arena tutorials are much higher production quality than the other notebooks on that list

That seems reasonable! When I get a minute I'll list out the individual ARENA notebooks and give them more emphasis (I did personally really like that exploratory analysis demo because of how well it situates the techniques in the context of a concrete problem. Maybe the ARENA version does too, I haven't gone through it).

[EDIT - done]