LESSWRONG
LW

Interpretability (ML & AI)AI
Frontpage

26

Useful starting code for interpretability

by eggsyntax
13th Feb 2024
1 min read
2

26

Interpretability (ML & AI)AI
Frontpage

26

Useful starting code for interpretability
6Neel Nanda
1eggsyntax
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 10:36 PM
[-]Neel Nanda2y64

This seems like a useful resource, thanks for making it! I think it would be more useful if you enumerated the different ARENA notebooks, my guess is many readers won't click through to the link, and are more likely to if they see the different names. And IMO the arena tutorials are much higher production quality than the other notebooks on that list

Reply
[-]eggsyntax2y*10

That seems reasonable! When I get a minute I'll list out the individual ARENA notebooks and give them more emphasis (I did personally really like that exploratory analysis demo because of how well it situates the techniques in the context of a concrete problem. Maybe the ARENA version does too, I haven't gone through it).

[EDIT - done]

Reply
Moderation Log
More from eggsyntax
View more
Curated and popular this week
2Comments

Want to try your hand at neural network interpretability? A very nice way to get started is to find an existing Python notebook using one or more interpretability techniques, hopefully one written with beginners in mind. In a click or two you can make a copy of it, which you can typically run without any modification, and then start tweaking it to look at what you're interested in. 

Fortunately, many such notebooks already exist, thanks to helpful members of the interp community! This post is just a list of those, mostly Colab notebooks. Many of them I have no personal experience with, but all of them have been recommended by people who know what they're doing. This list will probably be acceptably current through late 2024 or so; after that you should use a more up-to-date resource if one exists (although if one existed now I would have used it instead of writing one, so there may or may not be another one then).

Suggestions for other similarly useful starter notebooks for other areas are extremely welcomed!

  • Notebooks for understanding machine learning (as background): Transformers From Scratch, some other ML technique notebooks, reinforcement learning.

And the main list is in no particular order, so no need to go top to bottom.

  • @Neel Nanda's exploratory analysis demo for TransformerLens walks you through many of the basic mech interp techniques, and is highly recommended, and he has others as well.
  • Another intro to mech interp from ARENA, along with several other excellent notebooks reproducing some important mech interp results:
    •  Indirect Object Identification
    • Superposition & Sparse Autoencoders
    • Function Vectors & Model Steering
    • OthelloGPT
    • Balanced Bracket Classifier
    • Grokking and Modular Arithmetic
  • Two activation steering notebooks, based on "Steering GPT-2-XL by adding an activation vector" (bonus: several different implementations from @Annah) (extra bonus: quick and dirty representation engineering on Mistral)
  • Developmental interpretability and singular learning theory notebooks, from @Jesse Hoogland.
  • A smallish notebook on using the tuned lens technique (successor to the logit lens).
  • Mech interp on Mamba using nnsight.

 

Thanks to @Jesse Hoogland and @CallumMcDougall for extremely useful input!