Sid Black — LessWrong

Applying SVD to neural nets in general is not a new idea. It’s been used a bunch in the field (Saxe, Olah) but mostly with relation to some input data - either you run SVD on the activations, or some input-output correlation matrix or something.

You generally need to have some data to compare against in order to understand what each vector of your factorization represents exactly. What’s interesting with this technique (imo - and this is mostly Beren’s work so not trying to toot my own horn here) is twofold:

You don’t have to run your model over a whole evaluation set - which can be very expensive - to do this sort of analysis. Actually - you don’t have to do a forward pass on your model at all. Instead you can project the weight matrix you want to analyse into the embedding space (as first noted in logit lens and https://arxiv.org/pdf/2209.02535.pdf) and factorize the resulting matrix. Now you can analyse each SVD vector with regards to the model’s vocabulary, and get an idea at a glance of what kinds of processing each layer is doing. This could prove to be useful in future scenarios where e.g we want computationally efficient methods of interpretability analysis to be run during training to check for deception, or to otherwise debug a model’s behaviour.
The degree of interpretability of these simple factorizations suggests that the matrices we’re analysing operate on largely* linear representations - which could be good news for the MI field in general, as we haven’t made much headway analysing non-linear features.

*As Peter mentions below - we should avoid overupdating on this. Linear features are almost certainly low hanging fruit. Even if they represent “the majority” of the computation going on inside the network in whatever sense, it’s likely that understanding all of the linear features in a network will not give us the full story about the network’s behaviours.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments