x

LESSWRONG

LW

uri kialy — LessWrong

uri kialy

uri kialy

Message

1

7mo

uri kialy

7mo

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

To those who are not new to mechanistic interpretability – describing features has been a main domain in this area in past years. In the past, researchers used to think that neurons inside the model are an atomic unit, but after some careful investigation they discovered that neurons are responsible...

Nov 25, 2025•1