Noa Nabeshima

Wiki Contributions

Comments

23&me says I have more Neanderthal DNA than 96% of users and my DNA attribution is half-Japanese and half European. Your Neanderthal link doesn't work for me.

Sometimes FLOP/s isn't the bottleneck for training models; e.g. it could be memory bandwidth. My impression from poking around with Nsight and some other observations is that wide SAEs might actually be FLOP/s bottlenecked but I don't trust my impression that much. I'd be interested in someone doing a comparison of this SAE architectures in terms of H100 seconds or something like that in addition to FLOP.

Did it seem to you like this architecture also trained faster in terms of wall-time?

Anyway, nice work! It's cool to see these results.

I wonder if multiple heads having the same activation pattern in a context is related to the limited rank per head; once the VO subspace of each head is saturated with meaningful directions/features maybe the model uses multiple heads to write out features that can't be placed in the subspace of any one head.

Do you have any updates on this? I'm interested in this.

I've trained some sparse MLPs with 20K neurons on a 4L TinyStories model with ReLU activations and no layernorm and I took a look at them after reading this post. For varying integer , I applied an L1 penalty of  on the average of the activations per token, which seems pretty close to doing an L1 of  on the sum of the activations per token. Your L1 of   with 12K neurons is sort of like  in my setup. After reading your post, I checked out the cosine similarity between encoder/decoder of original mlp neurons and sparse mlp neurons for varying values of  (make sure to scroll down once you click one of the links!):

S=3
https://plotly-figs.s3.amazonaws.com/sparse_mlp_L1_2exp3

S=4
https://plotly-figs.s3.amazonaws.com/sparse_mlp_L1_2exp4

S=5
https://plotly-figs.s3.amazonaws.com/sparse_mlp_L1_2exp5

S=6
https://plotly-figs.s3.amazonaws.com/sparse_mlp_L1_2exp6

I think the behavior you're pointing at is clearly there at lower L1s on layers other than layer 0 (? what's up with that?) and sort of decreases with higher L1 values, to the point that the behavior is there a bit at S=5 and almost not there at S=6. I think the non-dead sparse neurons are almost all interpretable at S=5 and S=6.

Original val loss of model: 1.128 ~= 1.13.
Zero ablation of MLP loss values per layer: [3.72, 1.84, 1.56, 2.07].

S=6 loss recovered per layer

Layer 0:       1-(1.24-1.13)/(3.72-1.13): 96% of loss recovered
Layer 1:       1-(1.18-1.13)/(1.84-1.13): 93% of loss recovered
Layer 2:       1-(1.21-1.13)/(1.56-1.13): 81% of loss recovered
Layer 3:       1-(1.26-1.13)/(2.07-1.13): 86% of loss recovered

Compare to 79% of loss-recovered from Anthropic's A/1 autoencoder with 4K features and a pretty different setup. 

(Also, I was going to focus on S=5 MLPs for layers 1 and 2, but now I think I might instead stick with S=6. This is a little tricky because I wouldn't be surprised if tiny-stories MLP neurons are interpretable at higher rates than other models.)

Basically I think sparse MLPs aren't a dead end and that you probably just want a higher L1.

[word] and [word]
can be thought of as "the previous token is ' and'."

It might just be one of a family of linear features or ?? aspect of some other representation ?? corresponding to what the previous token is, to be used for at least induction head.

Maybe the reason you found ' and' first is because ' and' is an especially frequent word. If you train on the normal document distribution, you'll find the most frequent features first.

Consistency models are trained from scratch in the paper in addition to distilled from diffusion models. I think it'll probably just work with text-conditioned generation, but unclear to me w/o much thought how to do the equivalent of classifier-free guidance.

Load More