This work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with mentorship from Adrià Garriga-Alonso. Equal contribution by Niels and Iván.
Summary
In this post we analyse the circuit evaluation metrics proposed by Wang et al. (IOI paper) and show that their specific formalisation allows for counter intuitive examples that don't always reflect the natural language properties they aim to capture. For example, some incomplete circuits can be made complete by removing components, and there exist faithful, complete and minimal circuits that have smaller circuits that are also faithful, complete and minimal.
Introduction
The main goal of Mechanistic Interpretability is to reverse engineer neural networks to make the learned weights into human-interpretable... (read 1137 more words →)