Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Arthur Conmy's Automated Circuit Discovery is a great paper that makes initial forays into automating parts of mechanistic interpretability (specifically, automatically finding a sparse subgraph for a circuit). In this three part series of Youtube videos, I interview him about the paper, and we walk through it and discuss the key results and takeaways. We discuss the high-level point of the paper and what researchers should takeaway from it, the ACDC algorithm and its key nuances, existing baselines and how they adapted them to be relevant to circuit discovery, how well the algorithm works, and how you can even evaluate how well an interpretability method works.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 5:38 AM

Awesome, thanks for all of these videos.