Yes -- By design, the circuits discovered in this manner might miss how/when something is computed. But we argue that finding the important representations at bottlenecks and their change over layers can provide important/useful information about the model.
One of our future directions, along the direction of crosscoders, is to have "Layer Output Buffer SAEs" that aim to tackle the computation between bottlenecks.
Thanks a lot for this review!
On strengths, we also believe that we are the first to examine “few saes” for scalable circuit discovery.
On weaknesses,
A core question shared in the community is whether the idea of circuits is plausible as models continue to scale up. Current automated methods either are too computationally expensive or generate a subgraph that is too large to examine. We explore the idea of a few equally spaced SAEs with the goal of solving both those issues. Though as you mentioned, a more thorough comparison between circuits of different numbers of saes is needed.
Can't agree more with this post! I used to be afraid of long notebooks but they are powerful in allowing me to just think.
Although while creating a script I tend to use "#%%" of vscode to run cells inside the script to test stuff. My notebooks usually contain a bunch of analysis code that don't need to be run, but should stay.
Makes sense! Thanks! In that case, we can potentially reduce the width, which might (along with a smaller dataset) help scale saes to understanding mechanisms in big models?
Great work! Is there something like too narrow of a dataset? For refusal, what do you think happens if we specifically train on a bunch of examples that show signs refusal?