LESSWRONG
LW

Nikita Balagansky
2020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features
Nikita Balagansky6mo20

Interesting! Glad to see our method being utilized in future research.

Do you have any metrics (e.g., Explained Variance or CE loss difference) on how SAEs trained on a specific dataset perform when applied to others? I suspect that if there is a small gap between the explained variance on the training dataset and other datasets, we might infer that, even though there’s no one-to-one correspondence between features learned across datasets, the combination of features retains a degree of similarity.

Additionally, it would be intriguing to investigate whether features across datasets become more aligned as training steps increase. I suspect a clear correlation between the number of training steps and the percentage of matched features, up to a saturation point.

Reply
[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
Nikita Balagansky6mo20

While authors claim that their approach is fundamentally different from transcoders, from my perspective, it addresses the same issue: finding interpretable circuits. I agree that it modulates MLPs in different ways (e.g., connecting sparse inputs and sparse outputs, rather than modulating MLP directly). However, it would be great to see the difference between circuits identified by transcoders and those found by JSAE.

We also discuss this similarity in Appendix F of our recent work (where transition T is analogous to f ), though we do not consider gradient-based approaches. Nevertheless, the most similar features between input and output can still sufficiently explain feature dynamics being a good baseline.

Reply
No posts to display.