Yes -- By design, the circuits discovered in this manner might miss how/when something is computed. But we argue that finding the important representations at bottlenecks and their change over layers can provide important/useful information about the model.
One of our future directions, along the direction of crosscoders, is to have "Layer Output Buffer SAEs" that aim to tackle the computation between bottlenecks.
Thanks a lot for this review!
On strengths, we also believe that we are the first to examine “few saes” for scalable circuit discovery.
On weaknesses,
[This is an interim report and continuation of the work from the research sprint done in MATS winter 7 (Neel Nanda's Training Phase)]
Try out binary masking for a few residual saes in this colab notebook: [Github Notebook] [Colab Notebook]
We propose a novel approach to:
Our discovered circuits paint a clear picture of how Gemma does a given task, with one...
Can't agree more with this post! I used to be afraid of long notebooks but they are powerful in allowing me to just think.
Although while creating a script I tend to use "#%%" of vscode to run cells inside the script to test stuff. My notebooks usually contain a bunch of analysis code that don't need to be run, but should stay.
Makes sense! Thanks! In that case, we can potentially reduce the width, which might (along with a smaller dataset) help scale saes to understanding mechanisms in big models?
Great work! Is there something like too narrow of a dataset? For refusal, what do you think happens if we specifically train on a bunch of examples that show signs refusal?
- You are correct that the current method will only give a set of features at each selected layer. The edges are intended to show the attention direction within the architecture. We updated it to make it more clear and fix some small issues.
- We think there are a few reasons why the results of the ACDC paper do not transfer to our domain:
- ACDC and EAP (Syed et al.) rely on overlap with a manual circuit as their metric, whereas we rely on faithfulness and completeness. Because the metrics are different, the comparison isn’t apples-to-apples.
- The major difference
... (read more)