Sparse autoencoders find composed features in small toy models
Summary * Context: Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models. They achieve sparse, interpretable features by minimizing a loss function which includes an ℓ1 penalty on the SAE hidden layer activations. * Problem & Hypothesis: While the SAE ℓ1 penalty achieves sparsity, it has...
Mar 14, 202433