Jatin Nainani — LessWrong

Scaling Sparse Feature Circuit Finding to Gemma 9B

You are correct that the current method will only give a set of features at each selected layer. The edges are intended to show the attention direction within the architecture. We updated it to make it more clear and fix some small issues.
We think there are a few reasons why the results of the ACDC paper do not transfer to our domain:
1. ACDC and EAP (Syed et al.) rely on overlap with a manual circuit as their metric, whereas we rely on faithfulness and completeness. Because the metrics are different, the comparison isn’t apples-to-apples.
2. The major difference between methods, as you mentioned, is that we are finding circuits in the SAE basis. This quite possibly accounts for most of the differences.
3. The SAEs vs neurons comparison is something we definitely want to test. However, the methods mentioned above (ACDC, eap, etc) used transformer components (MLP, Attn) as their units for circuit analysis. Our setup would need to rely on neurons of the residual stream. We don’t think residual neurons are directly comparable to transformer components because they are at different levels of granularity.

Scaling Sparse Feature Circuit Finding to Gemma 9B

Yes -- By design, the circuits discovered in this manner might miss how/when something is computed. But we argue that finding the important representations at bottlenecks and their change over layers can provide important/useful information about the model.

One of our future directions, along the direction of crosscoders, is to have "Layer Output Buffer SAEs" that aim to tackle the computation between bottlenecks.

Scaling Sparse Feature Circuit Finding to Gemma 9B

Jatin Nainani9mo10

Thanks a lot for this review!

On strengths, we also believe that we are the first to examine “few saes” for scalable circuit discovery.

On weaknesses,

While we plan to do a more thorough sweep of SAE placements and comparison, the first weakness remains true for this post.
Our major argument for the support of using few SAEs is imaging them as interpretable bottlenecks. Because they are so minimal and interpretable, they allow us to understand blocks of the transformer between them functionally (in terms of input and output). We were going to include more intuition about this but were worried it might add unnecessary complications. We mention the fact about residual stream to highlight that information cannot be passed to layer L+1 by any other path than the residual output of layer L. Thus, by training a mask at layer L, we find a minimal set of representations needed for future layers. To future layers, nothing other than these latents matter. We do agree that the nature of circuits found with coarse grained saes will differ, and this needs to be further studied.
We plan to explore the “gender bias removal” of Marks et al. [1] to compare the downstream application effectiveness. However, we do have a small application where we found a "bug" in the model, covered in section 5, where it over relies on duplicate token latents. We can try to do something similar to Marks et al.[1] in trying to "fix" this bug
Thanks for sharing the citation!

A core question shared in the community is whether the idea of circuits is plausible as models continue to scale up. Current automated methods either are too computationally expensive or generate a subgraph that is too large to examine. We explore the idea of a few equally spaced SAEs with the goal of solving both those issues. Though as you mentioned, a more thorough comparison between circuits of different numbers of saes is needed.

SAEs are highly dataset dependent: a case study on the refusal direction

Jatin Nainani1y20

Makes sense! Thanks! In that case, we can potentially reduce the width, which might (along with a smaller dataset) help scale saes to understanding mechanisms in big models?

SAEs are highly dataset dependent: a case study on the refusal direction

Jatin Nainani1y31

Great work! Is there something like too narrow of a dataset? For refusal, what do you think happens if we specifically train on a bunch of examples that show signs refusal?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments