Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'm excited about trying different formats for mechanistic interpretability education! I've made a video walkthrough where we replicate my paper, Progress Measures for Grokking via Mechanistic Interpretability. With Jess Smith, one of my co-authors, we record ourselves coding a replication and discussed what we did at each step. This is a three part walkthrough and you can see the accompanying code for the walkthrough here:

  • In part 1, we train a model to perform modular addition, and see that it does grok!
  • In part 2, we take this model and reverse-engineer the trig-based circuit it's learned to do modular addition. We show that you can both read out intermediate steps of the circuit from the activations, and that you can just read off some of the algorithm's steps from the model weights.
  • In part 3, we define some progress measures that let us distinguish progress towards the generalising and the memorising algorithm. We then look at the model during training and watch how the circuits develop, and use this to understand why it groks.

This is an experiment with a new format, and I'd love to hear about how useful you find it!

New Comment