Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

uri kialy

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

To those who are not new to mechanistic interpretability – describing features has been a main domain in this area in past years.
In the past, researchers used to think that neurons inside the model are an atomic unit, but after some careful investigation they discovered that neurons are responsible for more than one concept and they activate as groups many times. Thus, a shift occurred to view "features" as directions (that are composed from groups of neurons).

Now that we establish how concepts are represented in the model, we can try to find them, play with them, and even make them stronger (enhance/suppress them, have a look at Anthropic’s blog on steering a model – https://www.anthropic.com/news/mapping-mind-language-model, a nice example of it).

Although much research digs into the attention part of transformers, the authors decided to search for important features inside the MLP (relying on previous papers claiming that MLPs possess many important signals and due to the fact that most of the model’s parameters lie in the MLP).

In the introduction & related work, the authors mention early supervised and unsupervised methods to detect features in the model – but they mainly focus on SAEs as the baseline comparison.
The famous SAE, probably the most common in the unsupervised area, has (according to the authors and other papers) three main issues:

They are not actually part of the model.
They often “catch” features that aren’t necessarily what the model itself uses.
They often cannot steer model behavior as well as expected.

They propose a new method to decompose MLP activations into interpretable features: Semi-Nonnegative Matrix Factorization (SNMF), which is an adaptation of the classic NMF (an NP-hard problem).

The method is as follows:

They gather activations from the MLP across many tokens (let’s say $n$ ) and compose a matrix A (size $d a \times n$ ), which is then decomposed into matrices $Z \in R^{d_{a} \times k}$ and $Y \in R_{\geq 0}^{k \times n}$ Both Y,Z start as samples from uniform and normal distributions respectively (values in (0,1)), and are optimized via multiplicative-update methods (SNMF is non-convex but we won’t dive into those details).

Matrix Z learns neuron groups: SNMF groups co-firing neurons. Each feature is a sparse pattern of neurons that activate together.
Matrix Y links features to tokens: for each token $a_{j}$ (from the columns of A), it gives nonnegative feature weights. Large $y i, j \Rightarrow$ feature i was active for token j.

They test the method on 3 different experiments:

1. Concept Detection

They used GPT-4o-mini to generate concept labels → sentences with/without the concept (called “activation” and “neutral”), defining which inputs should trigger each feature. They measured feature-input alignment via cosine similarity and the log-ratio $log (\frac{{¯ a}_{activating}}{{¯ a}_{neutral}})$ . The results were that SNMF outperformed the two SAEs it was compared against, even though SNMF is smaller & sparser.

2. Concept Steering

Features learned by SNMF were injected into a prompt (e.g., “I think that…”) and an LLM judge evaluated how well the concept appeared in the generated sentence and how fluent the output was.
Here, not only did they surpass the SAEs-they were at least equal to, and sometimes better than, Diff-in-Means, which is a supervised method (meaning it “knows what to search” explicitly).

3. Neuron Compositionality

This part goes a step further. Since SNMF extracts neuron groups as interpretable features, it raises the question of how neurons combine to form them. They examined this in two ways:

Recursive SNMF: Since SNMF works, why not do it again? They recursively broke down the previous Z matrix (the feature matrix) into even fewer features, enforcing a hierarchy. They succeeded in showing a structure like: days → weekdays → day-of-week, demonstrating deep organization of time units inside the model.
Causal tests on time features: They again used the time-unit idea and checked how the model responds to amplifying different feature groups. When they amplified the “day” concept, all 7 days received stronger logit scores. When they amplified a specific day, that day’s logit increased and others were suppressed. And finally, they showed (via a nice algebraic analysis) that weekdays and weekend days share overlapping neurons-meaning the model internally represents weekday vs. weekend as distinct but related structures.

To sum things up, the paper shows that MLP can help interpret models quite well, and SNMF can actually surface structures in a way that matches the model’s real mechanisms. It doesn’t “solve” interpretability, but it proves neuron groups aren’t chaos and that the right decomposition can reveal, steer, and analyze meaningful internal concepts.

LESSWRONG
LW

LESSWRONG
LW

1

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

1

1. Concept Detection

2. Concept Steering

3. Neuron Compositionality

1

1