x

LESSWRONG

LW

tdooms — LessWrong

tdooms

tdooms

Message

65

1

6

2y

tdooms

65

2y

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

by Logan Riggs, tdooms, Conflux, lwroe, MLNissenGonzalez, and mel83

We've found a method that tells you: * How functionally similar two neural networks are across ALL inputs, * Computed solely from the weights (i.e. no data), * Using a principled generalization of cosine similarity. There's only one catch: you have to use a tensor network. We've already shown that...

Tokenized SAEs: Infusing per-token biases.

tl;dr * We introduce the notion of adding a per-token decoder bias to SAEs. Put differently, we add a lookup table indexed by the last seen token. This results in a Pareto improvement across existing architectures (TopK and ReLU) and models (on GPT-2 small and Pythia 1.4B). Attaining the same...

Aug 4, 2024•20