Sparse autoencoders find composed features in small toy models
by Evan Anders, Clement Neo, Jason Hoelscher-Obermaier, and Jessica N. Howard
Summary * Context: Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models. They achieve sparse, interpretable features by minimizing a loss function which includes an ℓ1 penalty on the SAE hidden layer activations. * Problem & Hypothesis: While the SAE ℓ1 penalty achieves sparsity, it has...
Mar 14, 202434