Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization
To those who are not new to mechanistic interpretability – describing features has been a main domain in this area in past years. In the past, researchers used to think that neurons inside the model are an atomic unit, but after some careful investigation they discovered that neurons are responsible...
Nov 25, 20251