I've noticed that L0's above 100 (for the Pythia-70M model) is too high, resulting in mostly polysemantic features (though some single-token features were still monosemantic)

Agreed w/ Arthur on the norms of features being the cause of the higher MSE. Here are the L2 norms I got. Input is for residual stream, output is for MLP_out.

Reply

[-]Sam Marks2yΩ120

I agree that the L0's for 0_8192 are too high in later layers, though I'll note that I think this is mainly due to the cluster of high-frequency features (see the spike in the histogram). Features outside of this spike look pretty decent, and without the spike our L0s would be much more reasonable.

Here are four random features from layer 3, at a range of frequencies outside of the spike.

Layer 3, 0_8192, feature 138 (frequency = 0.003) activates on the newline at the end of the "field of the invention" section in patent applications. I think it's very likely predicting that the next few tokens will be "2. Description of the Related Art" (which always comes next in patents).

Layer 3, 0_8192, feature 27 (frequency = 0.009) seems to activate on the "is" in the phrase "this is"

Layer 3, 0_8192, feature 4 (frequency = 0.026) looks messy at first, but on closer inspection seems to activate on the final token of multi-token words in informative file/variable names.

Layer 3, 0_8192, feature 56 (frequency = 0.035) looks very polysemantic: it's activating on certain terms in LaTeX expressions, words in between periods in urls and code, and some other random-looking stuff.

Reply

[-]Logan Riggs2yΩ120

If you removed the high-frequency features to achieve some L0 norm, X, how much does loss recovered change?

If you increased the l1 penalty to achieve L0 norm X, how does the loss recovered change as well?

Ideally, we can interpret the parts of the model that are doing things, which I'm grounding out as loss recovered in this case.

Reply

[-]Sam Marks2yΩ580

Here's an experiment I'm about to do:

Remove high-frequency features from 0_8192 layer 3 until it has L0 < 40 (the same L0 as the 1_32768 layer 3 dictionary)
Recompute statistics for this modified dictionary.

I predict the resulting dictionary will be "like 1_32768 but a bit worse." Concretely, I'm guessing that means % loss recovered around 72%.

Results:

I killed all features of frequency larger than 0.038. This was 2041 features, and resulted in a L0 just below 40. The stats:

MSE Loss: 0.27 (worse than 1_32768)

Percent loss recovered: 77.9% (a little bit better than 1_32768)

I was a bit surprised by this -- it suggests the high-frequency features are disproportionately likely to be useful for reconstructing activations in ways that don't actually mater to the model's computation. (Though then again, maybe this is what we expect for uninterpretable features.)

It also suggests that we might be better off training dictionaries with a too-low L1 penalty and then just pruning away high-frequency features (sort of the dual operation of "train with a high L1 penalty and resample low-frequency features"). I'd be interested for someone to explore if there's a version of this that helps.

Reply

[-]Arthur Conmy2y10

Do you apply LR warmup immediately after doing resampling (i.e. immediately reducing the LR, and then slowly increasing it back to the normal value)? In my GELU-1L blog post I found this pretty helpful (in addition to doing LR warmup at the start of training)

Reply

[-]Sam Marks2y31

At the time that I made this post, no, but this has been implemented in dictionary_learning since I saw your suggestion to do so in your linked post.

Reply

[-]mishka2y10

Thank you!

Reply

1

Moderation Log

Layer	MSE Loss	L1 loss	L0	% Alive	% Loss Recovered
0	0.003	6.132	9.951	0.998	0.984
1	0.008	6.677	44.739	0.887	0.924
2	0.011	11.44	62.156	0.587	0.867
3	0.018	23.773	175.303	0.588	0.902
4	0.022	27.084	174.07	0.806	0.927
5	0.032	47.126	235.05	0.672	0.972

Layer	MSE Loss	L1 loss	L0	% Alive	% Loss Recovered
0	0.0018	4.32	2.873	0.174	0.946
1	0.017	2.798	11.256	0.159	0.768
2	0.023	6.151	16.381	0.118	0.724
3	0.044	11.571	39.863	0.226	0.765
4	0.048	13.665	29.235	0.19	0.816
5	0.069	26.4	43.846	0.13	0.931

LESSWRONG
LW

LESSWRONG
LW

46

Some open-source dictionaries and dictionary learning infrastructure

46

Ω 21

46

Ω 21

The dictionaries

The dictionary learning repository