x
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization — LessWrong