You are correct that the current method will only give a set of features at each selected layer. The edges are intended to show the attention direction within the architecture. We updated it to make it more clear and fix some small issues.
We think there are a few reasons why the results of the ACDC paper do not transfer to our domain:
ACDC and EAP (Syed et al.) rely on overlap with a manual circuit as their metric, whereas we rely on faithfulness and completeness. Because the metrics are different, the comparison isn’t apples-to-apples.
The major difference between methods, as you mentioned, is that we are finding circuits in the SAE basis. This quite
Yes -- By design, the circuits discovered in this manner might miss how/when something is computed. But we argue that finding the important representations at bottlenecks and their change over layers can provide important/useful information about the model.
One of our future directions, along the direction of crosscoders, is to have "Layer Output Buffer SAEs" that aim to tackle the computation between bottlenecks.
On strengths, we also believe that we are the first to examine “few saes” for scalable circuit discovery.
On weaknesses,
While we plan to do a more thorough sweep of SAE placements and comparison, the first weakness remains true for this post.
Our major argument for the support of using few SAEs is imaging them as interpretable bottlenecks. Because they are so minimal and interpretable, they allow us to understand blocks of the transformer between them functionally (in terms of input and output). We were going to include more intuition about this but were worried it might add unnecessary complications. We mention the fact about residual stream to highlight that
Scaling SAE Circuits to Large Models: By placing sparse autoencoders only in the residual stream at intervals, we find circuits in models as large as Gemma 9B without requiring SAEs to be trained for every transformer layer.
Finding Circuits: We develop a better circuit finding algorithm. Our method optimizes a binary mask over SAE latents, which proves significantly more effective than existing thresholding-based methods like Attribution Patching or Integrated Gradients.
Our... (read 4855 more words →)
Can't agree more with this post! I used to be afraid of long notebooks but they are powerful in allowing me to just think.
Although while creating a script I tend to use "#%%" of vscode to run cells inside the script to test stuff. My notebooks usually contain a bunch of analysis code that don't need to be run, but should stay.
Makes sense! Thanks! In that case, we can potentially reduce the width, which might (along with a smaller dataset) help scale saes to understanding mechanisms in big models?
Great work! Is there something like too narrow of a dataset? For refusal, what do you think happens if we specifically train on a bunch of examples that show signs refusal?
- You are correct that the current method will only give a set of features at each selected layer. The edges are intended to show the attention direction within the architecture. We updated it to make it more clear and fix some small issues.
- We think there are a few reasons why the results of the ACDC paper do not transfer to our domain:
- ACDC and EAP (Syed et al.) rely on overlap with a manual circuit as their metric, whereas we rely on faithfulness and completeness. Because the metrics are different, the comparison isn’t apples-to-apples.
- The major difference between methods, as you mentioned, is that we are finding circuits in the SAE basis. This quite
... (read more)