x

LESSWRONG

LW

Jessica N. Howard — LessWrong

Jessica N. Howard

Jessica N. Howard

Message

32

2y

Jessica N. Howard

32

2y

Sparse autoencoders find composed features in small toy models

by Evan Anders, Clement Neo, Jason Hoelscher-Obermaier, and Jessica N. Howard

Summary * Context: Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models. They achieve sparse, interpretable features by minimizing a loss function which includes an ℓ1 penalty on the SAE hidden layer activations. * Problem & Hypothesis: While the SAE ℓ1 penalty achieves sparsity, it has...

Mar 14, 2024•34