Overview: We train a tiny ReLU network to output sparse top- distributions over a vocabulary much larger than its residual dimension. The trained network seems to converge to a mechanism closely resembling a Bloom filter: tokens are assigned sparse binary hashes, the hidden layer computes an approximate union indicator, and...
Overview: This post builds on Circuits in Superposition 2, using the same terminology. I focus on the z=1 case, meaning that exactly one circuit is active on each forward pass. This restriction simplifies the setting substantially and allows a construction with zero error, with T=D2d2 , L+2 layers, and width...
Introduction: Research making use of SAEs has shown the existence of a "recognition" feature in models, that is a feature which activates when the current token is the last token of a known entity. However, while SAEs are incredibly helpful at finding interesting features, aiding hypothesis generation, they don't tell...
Edit: See Decomposing Attention to Find Context-Sensitive Neurons for a cleaner write-up of the same material. Decomposition of attention patterns: LayerNorm is a problem for understanding models, because it makes it hard to analyze content and position separately. Thankfully, the model wants to separate content and position, and so it...