This work was done as part of MATS 7.1 I pointed Claude at our new synthetic Sparse Autoencoder benchmark, told it to improve Sparse Autoencoder (SAE) performance, and left it running overnight. By morning, it had boosted F1 score from 0.88 to 0.95. Within another day, with occasional input from...
Written as part of MATS 7.1. Math by Claude Opus 4.6. I know that models are able to represent exponentially more concepts than they have dimensions by engaging in superposition (representing each concept as a direction, and allowing those directions to overlap slightly), but what does this mean concretely? How...
This work was done as part of MATS 7.1 We recently added support for training and running Matching Pursuit SAEs (MP-SAEs) to SAELens, so I figured this is a good opportunity to train and open source some MP-SAEs, and share what I've learned along the way. Matching pursuit SAEs are...
This work was done as part of MATS 7.1. TLDR; If you've given up on training JumpReLU SAEs, try out Anthropic's JumpReLU training method. It's now supported in SAELens! Back in January, Anthropic published some updates on how they train JumpReLU SAEs. The post didn't include any sample code or...
This work was done as part of MATS 7.1. For more details on the ideas presented here, check out our new workshop paper Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders. Nearly all work on Sparse Autoencoders (SAEs) includes a version of the classic "sparsity vs...
When we train Sparse Autoencoders (SAEs), the sparsity of the SAE, called L0 (the number of latents that fire on average), is treated as an arbitrary design choice. All SAE architectures include plots of L0 vs reconstruction, as if any choice of L0 is equally valid. However, recent work that...
Sparse Autoencoders (and other related feature extraction tools) often optimize for sparsity to extract human-interpretable latent representations from a model's activation space. We show analytically that sparsity naturally leads to feature absorption in a simplified untied SAE, and discuss how this makes SAEs less trustworthy to use for AI safety...