Sparse MLP Distillation
This is a research report about my attempt to extract interpretable features from a transformer MLP by distilling it into a larger student MLP, while encouraging sparsity by applying an L1 penalty to the activations, as depicted in Figure 1. I investigate the features learned by the distilled MLP, compare...
Jan 15, 202430