SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

We find that these evaluation results are nuanced and there is no one ideal SAE configuration - instead, the best SAE varies depending on the specifics of the downstream task. Because of this, we cannot combine the results into a single number without obscuring tradeoffs. Instead, we provide a range of quantitative metrics so that researchers can measure the nuanced effects of experimental changes.

It might be interesting (perhaps in the not-very-near future) to study if automated scientists (maybe roughly in the shape of existing ones, like https://sakana.ai/ai-scientist/) using the evals as proxy metrics, might be able to come up with better (e.g. Pareto improvements) SAE architectures, hyperparams, etc., and whether adding more metrics might help; as an analogy, this seems to be the case for using more LLM-generated unit tests for LLM code generation, see Dynamic Scaling of Unit Tests for Code Reward Modeling.

[-]Adam Karvonen9mo32

SAEs are early enough that there's tons of low hanging fruit and ideas to try. They also require relatively little compute (often around $1 for a training run), so AI agents could afford to test many ideas. I wouldn't be surprised if SAE improvements were a good early target for automated AI research, especially if the feedback loop is just "Come up with idea, modify existing loss function, train, evaluate, get a quantitative result".

[-]Bogdan Ionut Cirstea9mo30

They also require relatively little compute (often around $1 for a training run), so AI agents could afford to test many ideas.

Ok, this seems surprisingly cheap. Can you say more about what such a 1$ training run typically looks like (what the hyperparameters are)? I'd also be very interested in any analysis about how SAE (computational) training costs scale vs. base LLM pretraining costs.

I wouldn't be surprised if SAE improvements were a good early target for automated AI research, especially if the feedback loop is just "Come up with idea, modify existing loss function, train, evaluate, get a quantitative result".

This sounds spiritually quite similar to what's already been done in Discovering Preference Optimization Algorithms with and for Large Language Models and I'd expect something roughly like that to probably produce something interestin, especially if a training run only cost $1.

[-]Adam Karvonen9mo70

A $1 training run would be training 6 SAEs across 6 sparsities at 16K width on Gemma-2-2B for 200M tokens. This includes generating the activations, and it would be cheaper if the activations are precomputed. In practice this seems like large enough scale to validate ideas such as the Matryoshka SAE or the BatchTopK SAE.

[-]Neel Nanda9mo50

Yeah, if you're doing this, you should definitely pre compute and save activations

[-]matijasever10mo*Ω010

Sparsity’s a mess. https://eldfall-chronicles.com/product-category/miniatures/ SAEBench’s nuanced approach is a game changer. No “one-size” magic.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

82

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

82

Ω 42

82

Ω 42

TL;DR

Introduction