Paying attention to Attention Sinks

Mitali M

I recently read Spectral Filters, Dark Signals, and Attention Sinks, an interesting paper on discovering where excess attention in transformers is dumped. Researcher found that transformers contain a "Dark Subspace" to store information that isn't intended for the output layer. The attention sink concept is a specific manifestation of this, where the model learns to dump the remaining attention from the Softmax into the first ([BOS]) token.

The authors used spectral filters to decompose the residual stream. The most interesting finding was the U-Dark subspace (the unembedded tokens that didn't map to the vocabulary). By filtering the tail end of the singular value decomposition (SVD) of the weights, they found that the model uses these "dark" signals to let the model know where to dump attention.

When the "dark tail" was removed, the Negative Log-Likelihood (NLL) spiked because the attention mechanism lost its ability to stay "quiet" when it had nothing to say, leading to model confusion.

They created a "sink-preserving" filter that kept the "head" of the spectrum and the mechanical "dark tail," but deleted the middle. The model performed surprisingly well, proving that the "dark" tail is essential for stability and coherence, while the middle is not.

I think an interesting direction in the future would be to see if only the single token in the dark tail acts as the sink, or if there could possibly be multiple sinks. Additionally, I would like to see if we can manually assign a particular section to act as the sink.

I suspect that in models trained with 'No-BOS' (beginning of sequence) constraints, we might find 'Distributed Sinks' across high-frequency tokens like periods or 'the'. It would be worth investigating the spectral signature of these tokens to see if they adopt the same 'U-Dark' profile as the standard 'BOS' sink.

After reading this paper I realize that the concept of attention sinks is so cool! The fundamental reason attention sinks exist is the Softmax function. In a standard transformer, the attention mechanism is forced to distribute exactly 1.0 units worth of "attention" across the available tokens. If a specific attention head is looking for a pattern that doesn't exist in the current window (e.g., a head looking for "dates" in a poem), it cannot simply assign zero attention to everything as it will have all the attention left over.

It always surprises me to see how much there is to know about models. Transformers are a single type of ML model, yet they are deeply complex and mostly unexplored.

Here is a link to my (rough) code implementation using the nnsight library from NDIF.

Attention Is Off By One is an interesting alternative approach to this, although in my tiny experiments it didn't change the attention activations much, and Hacker News comments seem to agree.

Looks super interesting, I'll check it out!

Apparently the more modern and more effective way to do this is sigmoid attention: https://arxiv.org/abs/2409.04431

Sigmoid attention calculates each value separately with no requirement that they sum to 1 (it uses a bias term inside of the function to try to control the values as the context gets longer, but it strictly pushes activations down and never forces activations to increase).

There's another approach that has experimented with something similar:
https://hanlab.mit.edu/blog/streamingllm

We found models dump massive attention onto the first few tokens as "attention sinks"—places to park unused attention since softmax requires weights to sum to 1. Our solution, StreamingLLM, simply keeps these first 4 tokens permanently while sliding the window for everything else, enabling stable processing of 4 million+ tokens instead of just thousands.

Thank you for the recommendation, I'll be sure to check it out!