The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

jake_mendel; Dan Braun; StefanHex; Nicholas Goldowsky-Dill; Kaarel; Avery; Joern Stoehler; debrevitatevitae; Magdalena Wache; Marius Hobbhahn

108 The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

by Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache, Marius Hobbhahn

20th May 2024

AI Alignment Forum

3 min read

4

108 Ω 56

Review

Interpretability (ML & AI)Apollo Research (org)AI

Frontpage

108 Ω 56

New Comment

4 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:40 AM

[-]tailcalled2y80

I was thinking in similar lines, but eventually dropped it because I felt like the gradients would likely miss something if e.g. a saturated softmax prevents any gradient from going through. I find it interesting that experiments also find that the interaction basis didn't work, and I wonder whether any of the failure here is due to saturated softmaxes.

Reply

[-]Lucius Bushnaq2y60

I doubt it. Evaluating gradients along an entire trajectory from a baseline gave qualitatively similar results.

A saturated softmax also really does induce insensitivity to small changes. If two nodes are always connected by a saturated softmax, they can't be exchanging more than one bit of information. Though the importance of that bit can be large.

My best guess for why the Interaction Basis didn't work is that sparse, overcomplete representations really are a thing. So in general, you're not going to get a good decomposition of LMs from a Cartesian basis of activation space.

Reply

1

[-]Review Bot2y30

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Reply

[-]Charlie Steiner2yΩ220

This was super interesting. I hadn't really thought about the tension between SLT and superposition before, but this is in the middle of it.

Like, there's nothing logically inconsistent with the best local basis for the weights being undercomplete while the best basis for the activations is overcomplete. But if both are true, it seems like the relationship to the data distribution has to be quite special (and potentially fragile).

Reply

Moderation Log