Showing SAE Latents Are Not Atomic Using Meta-SAEs
by Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey, and Neel Nanda
Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda’s streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback! TL;DR: * Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable...
Aug 24, 202473