Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

New paper walkthrough: In-Context Learning and Induction Heads. This is the second paper in Anthropic's Transformer Circuits thread, a series of papers trying to reverse engineer transformer language models. I read through it with Charles Frye (from Full-Stack Deep Learning), and we discuss the paper, and give takes and intuitions. See the original paper and a Twitter thread of my paper takeaways

This is pitched so that it's hopefully accessible to people who haven't read the paper (very interested in feedback on this!), but I expect you to get more out of it if you understand transformers, and especially if you've read A Mathematical Framework. We only got partway through the paper, so there's a more in-the-weeds Part 2 in the works where we finish it off - let me know if you're interested in seeing it!

Disclaimer: I worked on this paper, along with Catherine Olsson, Nelson Elhage and Chris Olah, when I was at Anthropic, but I have since left and everything in this video is purely my own takes!

If you find this useful, check out my previous walkthroughs: A Mathematical Framework for Transformer Circuits and Interpretability in the Wild

And I'd be excited to see other researchers do these kinds of walkthroughs! The effort to usefulness ratio is way better than writing papers (and to my tastes, it's much more fun!)

New to LessWrong?

New Comment