Anthropic's JumpReLU training method is really good

Adrià Garriga-alonso

This work was done as part of MATS 7.1.

TLDR; If you've given up on training JumpReLU SAEs, try out Anthropic's JumpReLU training method. It's now supported in SAELens!

Back in January, Anthropic published some updates on how they train JumpReLU SAEs. The post didn't include any sample code or benchmarks or theoretical justification for the changes, so it seems like the community basically shrugged and ignored it. After all, we already have the original GDM implementation in the Dictionary Learning and SAELens libraries, and most practitioners don't use JumpReLU SAEs anyway, since BatchTopK SAEs are so much easier to train and are also considered state-of-the-art.

Why has JumpReLU not been popular?

The biggest issue I've had with the original GDM version of JumpReLU, and why I suspect that JumpReLU SAEs are rarely used in practice, is that it is very difficult to get them to train successfully. In my experience, training JumpReLU SAEs requires training for a very long time (~2 billion tokens or so). During most of the training, it will look like the JumpReLU SAE training is broken, since the L0 doesn't drop much and increasing the L0 coefficient seems to have no effect until about ~1b+ tokens into training. I have also never managed to get a GDM-style JumpReLU SAE to work in a toy model.

This is unfortunate since in theory JumpReLU SAEs should be superior to BatchTopK. JumpReLU allows each SAE latent to learn its own firing threshold, while BatchTopK enforces a single global threshold for the whole SAE. If there are cases where different latents should have different firing thresholds, then we should expect BatchTopK to underperform JumpReLU.

Anthropic's JumpReLU SAEs are easy to train!

For a recent paper, we wanted to evaluate JumpReLU SAEs to compare with BatchTopK, so I decided to try Anthropic's SAE training method. I was very pleasantly surprised to find that it seems to solve all the training issues present in the original GDM implementation! Anthropic-style JumpReLU training "feels" like training a standard L1 SAE. If you change the sparsity coefficient, the L0 changes. It works without requiring a huge number of training tokens. It even works in toy models too.

Below, we show some toy model results from the paper. One of the nice properties of L1 SAEs is that, in toy models, they tend to naturally "snap" to the correct L0 of the toy model as long as the L1 coefficient is set to any reasonably sane value. Anthropic-style JumpReLU SAEs also seem to do this (nowhere near as consistently as L1 SAEs, but still very nice to see).

In a toy model of superposition, we train an Anthropic-style JumpRELU SAE. We change the regularization coefficient (y) and measure the resulting L0 of the SAE (x), compared to the "true L0" of the toy model (vertical dotted line). It "snaps" to the correct L0, similar to L1 SAEs!

We also find that these JumpReLU SAEs seem to out-perform BatchTopK at sparse probing at high L0. We think this is due to this "snap" effect we saw above, where the JumpReLU is able to keep thresholds near the correct point for each SAE latent even at high L0. We plot this below, using the k-sparse probing tasks from "Are Sparse Autoencoders Useful? A Case Study in Sparse Probing".

K=16 sparse probing results vs L0 for BatchTopK and Anthropic-style JumpReLU SAEs, Gemma-2-2b layer 12.

Try it out!

We implemented Anthropic's JumpReLU training method in SAELens - give it a try! And thank you to Anthropic for sharing their training method with the community!

For more context on the plots in this post check out our paper "Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders".

LESSWRONG
LW

LESSWRONG
LW

29

Anthropic's JumpReLU training method is really good

29

29

Why has JumpReLU not been popular?

Anthropic's JumpReLU SAEs are easy to train!

Try it out!