TL;DR: In the last couple years, there have been multiple hype moments of the form "<insert paper> figured out subquadratic/linear attention, this is a game changer!" However, all the subquadratic attention mechanisms I'm aware of either are quadratic the way they are implemented in practice (with efficiency improved by only a constant factor) or underperform quadratic attention on downstream capability benchmarks.
A central issue with attention is that its FLOP complexity is quadratic in the context length (number of tokens in a sequence) and its memory complexity during inference is linear in the context length. In the last couple years, there have been multiple claims, and hype around those claims, that new architectures solved some (often all) of those problems by making alternatives to attention whose FLOP complexity is linear and/or whose memory complexity during inference is constant. These are often called subquadratic/linear attention (as opposed to regular attention which I’ll call quadratic attention). The ones I’m aware of are Kimi Linear, DeepSeek Sparse Attention (DSA), Mamba (and variants), RWKV (and variants), and text diffusion. If this were true, it would be a big deal because it would make transformer inference a lot more efficient at long contexts.
In this blogpost, I argue that they are all better thought of as “incremental improvement number 93595 to the transformer architecture” than as “subquadratic attention, a more than incremental improvement to the transformer architecture". This is because the implementations that work in practice are quadratic and only improve attention by a constant factor and subquadratic implementations underperform quadratic attention on downstream benchmarks. I think some of them are still important and impressive - for instance, Kimi Linear’s 6.3x increased inference speed at 1 million token context lengths is impressive. I just argue that they are not particularly special among incremental improvements to th