Pavan Katta — LessWrong

LESSWRONG
LW

Replying toScaling Laws and Superposition

This might all be academic if (i.e. the dimension of the feed-forward layer is big enough that you run out of meaningful features long before you run out of space to store them).

Thanks for the feedback, this is a great point! I haven't come across evidence in real models which points towards this. My default assumption was that they are operating near the upper bounds of superposition capacity possible. It would be great to know if they aren't, as it affects how we estimate the number of features and subsequently the SAE expansion factor.

Scaling Laws and Superposition

Pavan Katta

Summary

Using results from scaling laws, this short note argues that the following two statements cannot be simultaneously true:

Superposition hypothesis where sparse features are linearly represented across a layer in fewer neurons is a complete theory of feature representation.
Features are universal, meaning two models trained on the same data and achieving equal performance must learn identical features.

Scaling laws^[1] for Language Models gives us a relation for a model's macroscopic properties such as cross entropy loss $L$ , Amount of Data $D$ used and Number of non-embedding parameters $N$ in the model.

$L (N, D) = {[{(\frac{N_{c}}{N})}^{\frac{α_{N}}{α_{D}}} + {(\frac{D_{c}}{D})}^{α_{D}}]}^{α_{D}}$

where $N_{c}$ , $D_{c}$ , $α_{N}$ , and $α_{D}$ are constants for a given task such as Language modeling

The scaling laws are not mere empirical observations and can be seen as a predictive laws on limits of language... (read 1437 more words →)

Replying toIncidental polysemanticity

Pavan Katta2y

Incidental polysemanticity

Great work! Love the push for intuitions especially in the working notes.

My understanding of superposition hypothesis from TMS paper has been(feel free to correct me!):

When there's no privileged basis polysemanticity is the default as there's no reason to expect interpretable neurons.
When there's a privileged basis either because of non linearity on the hidden layer or L1 regularisation, default is monosemanticity and superposition pushes towards polysemanticity when there's enough sparsity.

Is it possible that the features here are not enough basis aligned and is closer to case 1? As you already commented demonstrating polysemanticity when the hidden layer has a non linearity and m>n would be principled imo.