I recently read Spectral Filters, Dark Signals, and Attention Sinks, an interesting paper on discovering where excess attention in transformers is dumped. Researcher found that transformers contain a "Dark Subspace" to store information that isn't intended for the output layer. The attentionsink concept is a specific manifestation of this, where the model learns to dump the remaining attention from the Softmax into the first ([BOS]) token.
The authors used spectral filters to decompose the residual stream. The most interesting finding was the U-Dark subspace (the unembedded tokens that didn't map to the vocabulary). By filtering the tail end of the singular value decomposition (SVD) of the weights, they found that the model uses these "dark" signals to let the model know where to dump attention.
When the "dark tail" was removed, the Negative Log-Likelihood (NLL) spiked because the attention mechanism lost its ability to stay "quiet" when it had nothing to say, leading to model confusion.
They created a "sink-preserving" filter that kept the "head" of the spectrum and the mechanical "dark tail," but deleted the middle. The model performed surprisingly well, proving that the "dark" tail is essential for stability and coherence, while the middle is not.
I think an interesting direction in the future would be to see if only the single token in the dark tail acts as the sink, or if there could possible multiple sinks. Additionally, I would like to see if we can manually designate a particular section to act as the sink.
I suspect that in models trained with 'No-BOS' (beginning of sequence) constraints, we might find 'Distributed Sinks' across high-frequency tokens like periods or 'the'. It would be worth investigating the spectral signature of these tokens to see if they adopt the same 'U-Dark' profile as the standard 'BOS' sink.
After reading this paper I realize that the concept of attention sinks is so cool! The fundamental reason attention sinks exist is the Softmax function. In a standard transformer, the attention mechanism is forced to distribute exactly 1.0 units worth of "attention" across the available tokens. If a specific attention head is looking for a pattern that doesn't exist in the current window (e.g., a head looking for "dates" in a poem), it cannot simply assign zero attention to everything. It needs a "trash bin."
It always surprises me to see how much there is to know about models. Transformers are a single type of ML model, yet they are deeply complex and mostly unexplored.
Here is a link to my (rough) code implementation using the nnsight library from NDIF:
I recently read Spectral Filters, Dark Signals, and Attention Sinks, an interesting paper on discovering where excess attention in transformers is dumped. Researcher found that transformers contain a "Dark Subspace" to store information that isn't intended for the output layer. The attention sink concept is a specific manifestation of this, where the model learns to dump the remaining attention from the Softmax into the first ([BOS]) token.
The authors used spectral filters to decompose the residual stream. The most interesting finding was the U-Dark subspace (the unembedded tokens that didn't map to the vocabulary). By filtering the tail end of the singular value decomposition (SVD) of the weights, they found that the model uses these "dark" signals to let the model know where to dump attention.
When the "dark tail" was removed, the Negative Log-Likelihood (NLL) spiked because the attention mechanism lost its ability to stay "quiet" when it had nothing to say, leading to model confusion.
They created a "sink-preserving" filter that kept the "head" of the spectrum and the mechanical "dark tail," but deleted the middle. The model performed surprisingly well, proving that the "dark" tail is essential for stability and coherence, while the middle is not.
I think an interesting direction in the future would be to see if only the single token in the dark tail acts as the sink, or if there could possible multiple sinks. Additionally, I would like to see if we can manually designate a particular section to act as the sink.
I suspect that in models trained with 'No-BOS' (beginning of sequence) constraints, we might find 'Distributed Sinks' across high-frequency tokens like periods or 'the'. It would be worth investigating the spectral signature of these tokens to see if they adopt the same 'U-Dark' profile as the standard 'BOS' sink.
After reading this paper I realize that the concept of attention sinks is so cool! The fundamental reason attention sinks exist is the Softmax function. In a standard transformer, the attention mechanism is forced to distribute exactly 1.0 units worth of "attention" across the available tokens. If a specific attention head is looking for a pattern that doesn't exist in the current window (e.g., a head looking for "dates" in a poem), it cannot simply assign zero attention to everything. It needs a "trash bin."
It always surprises me to see how much there is to know about models. Transformers are a single type of ML model, yet they are deeply complex and mostly unexplored.
Here is a link to my (rough) code implementation using the nnsight library from NDIF: