Nikita Balagansky — LessWrong

Interesting! Glad to see our method being utilized in future research.

Do you have any metrics (e.g., Explained Variance or CE loss difference) on how SAEs trained on a specific dataset perform when applied to others? I suspect that if there is a small gap between the explained variance on the training dataset and other datasets, we might infer that, even though there’s no one-to-one correspondence between features learned across datasets, the combination of features retains a degree of similarity.

Additionally, it would be intriguing to investigate whether features across datasets become more aligned as training steps increase. I suspect a clear correlation between the number of training steps and the percentage of matched features, up to a saturation point.

SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features

Nikita Balagansky10mo20

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

While authors claim that their approach is fundamentally different from transcoders, from my perspective, it addresses the same issue: finding interpretable circuits. I agree that it modulates MLPs in different ways (e.g., connecting sparse inputs and sparse outputs, rather than modulating MLP directly). However, it would be great to see the difference between circuits identified by transcoders and those found by JSAE.

We also discuss this similarity in Appendix F of our recent work (where transition T is analogous to f ), though we do not consider gradient-based approaches. Nevertheless, the most similar features between input and output can still sufficiently explain feature dynamics being a good baseline.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments