This is a research report about my attempt to extract interpretable features from a transformer MLP by distilling it into a larger student MLP, while encouraging sparsity by applying an L1 penalty to the activations, as depicted in Figure 1. I investigate the features learned by the distilled MLP, compare them to those found by an autoencoder, and discuss limitations of this approach.
I find that a subset of the distilled MLP's neurons act as 'neuron simulators', mimicking the activations of the original MLP, while the remaining features... (read 1767 more words →)