Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

8tailcalled

6Lucius Bushnaq

2Charlie Steiner

1Review Bot

New Comment

I was thinking in similar lines, but eventually dropped it because I felt like the gradients would likely miss something if e.g. a saturated softmax prevents any gradient from going through. I find it interesting that experiments also find that the interaction basis didn't work, and I wonder whether any of the failure here is due to saturated softmaxes.

I doubt it. Evaluating gradients along an entire trajectory from a baseline gave qualitatively similar results.

A saturated softmax also really *does *induce insensitivity to small changes. If two nodes are always connected by a saturated softmax, they can't be exchanging more than one bit of information. Though the importance of that bit can be large.

My best guess for why the Interaction Basis didn't work is that sparse, overcomplete representations really are a thing. So in general, you're not going to get a good decomposition of LMs from a Cartesian basis of activation space.

This was super interesting. I hadn't really thought about the tension between SLT and superposition before, but this is in the middle of it.

Like, there's nothing logically inconsistent with the best local basis for the *weights* being undercomplete while the best basis for the *activations* is overcomplete. But if both are true, it seems like the relationship to the data distribution has to be quite special (and potentially fragile).

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

This is a linkpost for our two recent papers:An exploration of using degeneracy in the loss landscape for interpretabilityhttps://arxiv.org/abs/2405.10927An empirical test of an interpretability technique based on the loss landscapehttps://arxiv.org/abs/2405.10928This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu. Not to be confused with Apollo'srecent Sparse Dictionary Learning paper.A key obstacle to mechanistic interpretability is finding the right representation of neural network internals. Optimally, we would like to derive our features from some high-level principle that holds across different architectures and use cases. At a minimum, we know two things:

Generalization has been linked to basin broadness in the loss landscape in several ways, most notably including singular learning theory, which introduces the learning coefficient as a measure of basin broadness that doubles as a measure of generalization error that replaces the parameter count in Occam's razor.

Inspired by both of these ideas, the

first paperexplores using the structure of the loss landscape to find the most computationally natural representation of a network. We focus on identifying parts of the network that arenotresponsible for low loss (i.e. degeneracy), inspired by singular learning theory. These degeneracies are an obstacle for interpretability as they mean there exist parameters which do not affect the input-output behavior in the network (similar to the parameters of a Transformer WV and WO matrices that do not affect the product WOV).We explore 3 different ways neural network parameterisations can be degenerate:

This investigation leads to the interaction basis, and eventually the local interaction basis (LIB) that we test in the second paper. This basis removes computationally irrelevant features and interactions, and sparsifies the remaining interactions between layers.

Finally, we analyse how modularity is connected to degeneracy in the loss landscape. We suggest a preliminary metric for finding the sorts of modules that the neural network prior is biased towards.

The

second papertests how useful the LIB is in toy and language models. In this new basis we calculate integrated gradient based interactions between features, and analyse the graph of all features in a network. We interpret strongly-interacting features, and identify modules in this graph using the modularity metric of the first paper.To derive the LIB basis we coordinate-transform the activations of neural networks in two steps: Step 1 is a transformation into the PCA basis, removing activation space directions which don't explain any variance. Step 2 is a transformation of the activations to align the basis with the right singular vectors of the gradient vector dataset. The 2nd step is the key new ingredient which aims to make

interactionsbetween adjacent layers sparse, and removes directions which do not affect downstream computation.We test LIB on two toy models (modular addition & CIFAR-10), and two language models (Tinystories-1M & GPT2-small). On toy models we successfully find a basis that is more sparsely interacting and contains only computationally-relevant features, and we can identify circuits based on the interaction graphs. See the interaction graph for the modular addition transformer below (cherry picked result).

On language models however, we find that LIB does not help us understand the networks. We find that interaction sparsity, compared to a PCA baseline, increases only slightly (for Tinystories-1M) or not at all (GPT2-small), and can identify no modules or interpretable features.

While this is mostly a negative result, we think there is valuable future work to develop loss landscape inspired techniques for interpretability that makes fewer assumptions than those that went into the derivation of LIB. Most notably, in deriving LIB, we did not assume superposition to be true because we wanted to start with the simplest possible version of the theory, and because we wanted to make a bet that was decorrelated with other research in the field. However, recent advances in sparse dictionary learning suggests that work which relaxes the assumptions of LIB to allow for superposition may find more interpretable features.