TL;DR: In the last couple years, there have been multiple hype moments of the form "<insert paper> figured out subquadratic/linear attention, this is a game changer!" However, all the subquadratic attention mechanisms I'm aware of either are quadratic the way they are implemented in practice (with efficiency improved by only a constant factor) or underperform quadratic attention on downstream capability benchmarks.
A central issue with attention is that its FLOP complexity is quadratic in the context length (number of tokens in a sequence) and its memory complexity during inference is linear in the context length. In the last couple years, there have been multiple claims, and hype around those claims, that new architectures solved some (often all) of those problems by making alternatives to attention whose FLOP complexity is linear and/or whose memory complexity during inference is constant. These are often called subquadratic/linear attention (as opposed to regular attention which I’ll call quadratic attention). The ones I’m aware of are Kimi Linear, DeepSeek Sparse Attention (DSA), Mamba (and variants), RWKV (and variants), and text diffusion. If this were true, it would be a big deal because it would make transformer inference a lot more efficient at long contexts.
In this blogpost, I argue that they are all better thought of as “incremental improvement number 93595 to the transformer architecture” than as “subquadratic attention, a more than incremental improvement to the transformer architecture". This is because the implementations that work in practice are quadratic and only improve attention by a constant factor and subquadratic implementations underperform quadratic attention on downstream benchmarks. I think some of them are still important and impressive - for instance, Kimi Linear’s 6.3x increased inference speed at 1 million token context lengths is impressive. I just argue that they are not particularly special among incremental improvements to the transformer architecture and not game changers.
Kimi Linear and DeepSeek Sparse Attention (DSA) are actually quadratic as they are implemented in practice in the models that Kimi and DeepSeek trained using them. In Kimi Linear’s case, this is because they only use Kimi Linear on ¾ of the layers and use MLA, which is quadratic, on the remaining ¼ of the layers. They do not use Kimi Linear on all layers because it degrades downstream benchmark performance too much. In the setting where improvement is biggest (inference with a context length of 1M tokens) the improvement is 4x in terms of KV cache size (memory) and 6.3x in terms of inference speed. There is also a modest improvement in downstream benchmark performance. DSA does not reduce KV cache size but decreases per-token cost by a bigger factor of about 3x (prompt) and 7x (output) at the maximal context length of 128k tokens. It is still quadratic.
Kimi are very clear about this in the paper and say everything I said here in the abstract. However, some people (not from Kimi) still hype Kimi Linear as subquadratic attention, which is why I included it here. Kimi is not to blame here and wrote an excellent paper.
This is clear after a careful reading of DeepSeek’s paper, though DeepSeek emphasizes this less than Kimi.
Mamba and RWKV do actually have a linear FLOP complexity and constant memory complexity during inference. However, while they perform comparably to attention in small to medium size models, they seem to underperform attention in terms of downstream benchmark performance on frontier scale models and are not used in frontier LLMs. My main reason for believing this is that I do not know of any frontier LLM that uses them, except for Mamba-attention hybrid models - models that have Mamba on a fraction of layers and quadratic attention on the other layers, see appendix for why this is still quadratic. Some papers on frontier Mamba-attention hybrid models do preliminary analysis comparing pure Mamba and Mamba-attention hybrid models. When they do, they usually say that pure Mamba models underperformed hybrids and that this is why they stuck to hybrid architectures. This provides empirical validation that pure Mamba underperforms hybrid architectures. A few 7B models do use pure Mamba and their papers find that it is as good or even a bit better than quadratic attention on downstream capability benchmarks. For example Codestral Mamba. However, the overwhelming majority of 7B models still use quadratic attention.
While text diffusion models can greatly reduce memory usage by eliminating the need for KV caches entirely, they do not reduce the FLOP usage. In fact, they multiply the number of FLOPs needed for inference by a constant factor. Furthermore, same as for pure Mamba, no frontier model uses text diffusion and only a small number of sub-frontier models use it.
There exist many incremental improvements that reduce FLOP and/or memory usage of attention by a constant factor that are not derived from, or related to, subquadratic attention. A probably non-exhaustive list of such improvements no one claims are subquadratic attention is: flash attention, Grouped Query Attention (GQA), sliding window attention (on some but not all layers), sparse attention, Multi Latent Attention (MLA), and making MLPs wider and attention narrower.
Appendix: Short explanation of how each subquadratic attention mechanism works and why it is not actually subquadratic
RWKV and Mamba
Those are entirely different mechanisms from attention that can be thought of as (much) better RNNs. They are actually subquadratic (in fact, linear) but they seem to underperform attention at frontier LLM scale, as argued for above. Mamba-attention hybrids do scale but are quadratic, as explained below for Kimi Linear.
Kimi Linear
Similar to Mamba and RWKV, Kimi Linear can be thought of as a (much) better RNN and it does actually have a linear FLOP complexity and constant memory complexity during inference. However, as said in the Kimi Linear paper, they use Kimi Linear at ¾ of layers and Multi Latent Attention (which is quadratic) on the remaining ¼ of layers. They say in the paper that when they tried using Kimi Linear on every layer, the hit to performance from doing this was too big:
Despite efficiency, pure Linear Attention still struggle with precise memory retrieval and exact copying. This deficiency hinders their adoption in industrial-scale LLMs where robust long-context recall (e.g., beyond 1M tokens) and reliable tool-use over extensive code repositories are critical.
And:
For Kimi Linear, we chose a layerwise approach (alternating entire layers) over a headwise one (mixing heads within layers) for its superior infrastructure simplicity and training stability. Empirically, a uniform 3:1 ratio, i.e., repeating 3 KDA layers to 1 full MLA layer, provided the best quality–throughput trade-off.
Thus, Kimi Linear as done in practice reduces the FLOP and memory used by the attention mechanism by a constant factor - the fraction of layers that don’t have it, in the paper’s case, ¼ (the reduction is smaller at shorter context lengths).
(Note on why the improvement in speed is 6.3x, which is bigger than 4x, at context length 1 million tokens: this is because additionally to making attention faster by a factor of almost 4x at big context length, Kimi Linear makes the KV cache smaller by a factor of almost 4x at big context length, which allows bigger batch sizes (by a factor of almost 4x), thus faster inference beyond the 4x improvement in attention FLOPs.)
DeepSeek Sparse Attention (DSA)
DSA was introduced in the DeepSeek V3.2 paper and DeepSeek V3.2, a frontier model, uses it. It works in the following way:
At each layer, the lightning indexer, which is a modified attention mechanism, chooses 2048 positions.
A regular Multi Latent Attention (MLA) mechanism only attends to those positions.
Thus, DSA’s FLOP complexity has two components: the lightning indexer has (up to a constant) the same complexity as regular MLA (which is quadratic) and the the subsequent MLA has linear complexity (at big context lengths) - min(context_length**2, 2048 * context_length).
So if the lightning indexer is in practice hugely cheaper than the subsequent MLA, the complexity is linear, but if it is only cheaper by a small constant factor, the complexity is still quadratic, just smaller by a small constant factor.
And the theoretical FLOP usage of the lightning indexer is only smaller by a factor of 8, so complexity is still quadratic (at least in terms of theoretical FLOP usage). Here is the calculation that leads to 8: first, n_heads * d_head of the lightning indexer is half that of n_heads * d_head of the subsequent MLA. This is not written in the paper, but can be seen by inspecting the model’s config on HuggingFace. Then, the lightning indexer only has keys and queries, no values and outputs, so that’s another factor of 2. Finally, the lightning indexer is in FP8, not FP16, which is another factor of 2.
For prefill (prompt) tokens, his calculation matches DeepSeek’s empirical findings: figure 3 in the DeepSeek V3.2 paper shows that the slope of cost (in dollars) per token as a function of position in the sequence is about 8x smaller than for MLA at big context lengths. For decoding (output) tokens, the slope is about 20x smaller, not 8x, but this is still a constant factor improvement. The improvements in per-token cost for the token at position 128k are 3.5x for prefill tokens and 9x for decoding tokens (if you look at the average token at context length 128k and not only at the last one, they go down to 3x and 7x). Note that in November 2025 (the latest date for which data is available as of writing this blogpost), OpenRouter processed 8x more prompt tokens than output tokens.
Furthermore, DSA does not reduce the KV cache size (because the 2048 tokens it attends to are different for every generated token and only known when that token is generated). This is important, because an important way in which subquadratic attention is good (for capabilities) is by increasing inference speed by reducing KV cache size which allows bigger batch sizes during inference (thus making inference cheaper) and allowing for longer context lengths by being able to have KV cache for more tokens per gigabyte of GPU memory.
Text Diffusion
Autoregressive LLMs (that is, all LLMs except for text diffusion LLMs) generate output tokens one by one in sequence, doing one forward pass per output token. A text diffusion LLM generates all the tokens at once in a single forward pass, but leaves X% of tokens blank. Then, it generates tokens in place of Y% of the blank tokens, also in a single forward pass. It repeats this a fixed number of times, after which no blank tokens remain.
Thus, while text diffusion eliminates the need for KV caches, it multiplies the FLOP usage on output tokens by a constant factor - the number of forward passes needed until no blank tokens remain.
(But wait, don’t autoregressive LLMs do one forward pass per output token, thus using more FLOPs than text diffusion models if the number of output tokens is big enough? No. Autoregressive LLMs do indeed do one forward pass per output token and thus usually do more forward passes than diffusion models. But they do each forward pass on only one token. Whereas text diffusion LLMs do each forward pass on all the output tokens at once. Thus, each forward pass of a text diffusion LLM requires as many FLOPs as all the forward passes of an autoregressive LLM combined. Text diffusion LLMs can be more efficient than autoregressive models in practice because it is usually more efficient on GPUs to do one big operation than many small operations in sequence, even when both require the same number of FLOPs[1]. However, these efficiency improvements can only happen until inference efficiency becomes bottlenecked by FLOPs.
This last sentence is oversimplified - another thing that matters here is the shapes of matrices that GPUs multiply. But this is out of the scope of this blogpost.
There are architectures which have constant memory and compute usage per token, but they are not used in practice. Ultimately I expect something like this to work; the human brain is an existence proof.
TL;DR: In the last couple years, there have been multiple hype moments of the form "<insert paper> figured out subquadratic/linear attention, this is a game changer!" However, all the subquadratic attention mechanisms I'm aware of either are quadratic the way they are implemented in practice (with efficiency improved by only a constant factor) or underperform quadratic attention on downstream capability benchmarks.
A central issue with attention is that its FLOP complexity is quadratic in the context length (number of tokens in a sequence) and its memory complexity during inference is linear in the context length. In the last couple years, there have been multiple claims, and hype around those claims, that new architectures solved some (often all) of those problems by making alternatives to attention whose FLOP complexity is linear and/or whose memory complexity during inference is constant. These are often called subquadratic/linear attention (as opposed to regular attention which I’ll call quadratic attention). The ones I’m aware of are Kimi Linear, DeepSeek Sparse Attention (DSA), Mamba (and variants), RWKV (and variants), and text diffusion. If this were true, it would be a big deal because it would make transformer inference a lot more efficient at long contexts.
In this blogpost, I argue that they are all better thought of as “incremental improvement number 93595 to the transformer architecture” than as “subquadratic attention, a more than incremental improvement to the transformer architecture". This is because the implementations that work in practice are quadratic and only improve attention by a constant factor and subquadratic implementations underperform quadratic attention on downstream benchmarks. I think some of them are still important and impressive - for instance, Kimi Linear’s 6.3x increased inference speed at 1 million token context lengths is impressive. I just argue that they are not particularly special among incremental improvements to the transformer architecture and not game changers.
Appendix: Short explanation of how each subquadratic attention mechanism works and why it is not actually subquadratic
RWKV and Mamba
Those are entirely different mechanisms from attention that can be thought of as (much) better RNNs. They are actually subquadratic (in fact, linear) but they seem to underperform attention at frontier LLM scale, as argued for above. Mamba-attention hybrids do scale but are quadratic, as explained below for Kimi Linear.
Kimi Linear
Similar to Mamba and RWKV, Kimi Linear can be thought of as a (much) better RNN and it does actually have a linear FLOP complexity and constant memory complexity during inference. However, as said in the Kimi Linear paper, they use Kimi Linear at ¾ of layers and Multi Latent Attention (which is quadratic) on the remaining ¼ of layers. They say in the paper that when they tried using Kimi Linear on every layer, the hit to performance from doing this was too big:
And:
Thus, Kimi Linear as done in practice reduces the FLOP and memory used by the attention mechanism by a constant factor - the fraction of layers that don’t have it, in the paper’s case, ¼ (the reduction is smaller at shorter context lengths).
(Note on why the improvement in speed is 6.3x, which is bigger than 4x, at context length 1 million tokens: this is because additionally to making attention faster by a factor of almost 4x at big context length, Kimi Linear makes the KV cache smaller by a factor of almost 4x at big context length, which allows bigger batch sizes (by a factor of almost 4x), thus faster inference beyond the 4x improvement in attention FLOPs.)
DeepSeek Sparse Attention (DSA)
DSA was introduced in the DeepSeek V3.2 paper and DeepSeek V3.2, a frontier model, uses it. It works in the following way:
Thus, DSA’s FLOP complexity has two components: the lightning indexer has (up to a constant) the same complexity as regular MLA (which is quadratic) and the the subsequent MLA has linear complexity (at big context lengths) - min(context_length**2, 2048 * context_length).
So if the lightning indexer is in practice hugely cheaper than the subsequent MLA, the complexity is linear, but if it is only cheaper by a small constant factor, the complexity is still quadratic, just smaller by a small constant factor.
And the theoretical FLOP usage of the lightning indexer is only smaller by a factor of 8, so complexity is still quadratic (at least in terms of theoretical FLOP usage). Here is the calculation that leads to 8: first, n_heads * d_head of the lightning indexer is half that of n_heads * d_head of the subsequent MLA. This is not written in the paper, but can be seen by inspecting the model’s config on HuggingFace. Then, the lightning indexer only has keys and queries, no values and outputs, so that’s another factor of 2. Finally, the lightning indexer is in FP8, not FP16, which is another factor of 2.
For prefill (prompt) tokens, his calculation matches DeepSeek’s empirical findings: figure 3 in the DeepSeek V3.2 paper shows that the slope of cost (in dollars) per token as a function of position in the sequence is about 8x smaller than for MLA at big context lengths. For decoding (output) tokens, the slope is about 20x smaller, not 8x, but this is still a constant factor improvement. The improvements in per-token cost for the token at position 128k are 3.5x for prefill tokens and 9x for decoding tokens (if you look at the average token at context length 128k and not only at the last one, they go down to 3x and 7x). Note that in November 2025 (the latest date for which data is available as of writing this blogpost), OpenRouter processed 8x more prompt tokens than output tokens.
Furthermore, DSA does not reduce the KV cache size (because the 2048 tokens it attends to are different for every generated token and only known when that token is generated). This is important, because an important way in which subquadratic attention is good (for capabilities) is by increasing inference speed by reducing KV cache size which allows bigger batch sizes during inference (thus making inference cheaper) and allowing for longer context lengths by being able to have KV cache for more tokens per gigabyte of GPU memory.
Text Diffusion
Autoregressive LLMs (that is, all LLMs except for text diffusion LLMs) generate output tokens one by one in sequence, doing one forward pass per output token. A text diffusion LLM generates all the tokens at once in a single forward pass, but leaves X% of tokens blank. Then, it generates tokens in place of Y% of the blank tokens, also in a single forward pass. It repeats this a fixed number of times, after which no blank tokens remain.
Thus, while text diffusion eliminates the need for KV caches, it multiplies the FLOP usage on output tokens by a constant factor - the number of forward passes needed until no blank tokens remain.
(But wait, don’t autoregressive LLMs do one forward pass per output token, thus using more FLOPs than text diffusion models if the number of output tokens is big enough? No. Autoregressive LLMs do indeed do one forward pass per output token and thus usually do more forward passes than diffusion models. But they do each forward pass on only one token. Whereas text diffusion LLMs do each forward pass on all the output tokens at once. Thus, each forward pass of a text diffusion LLM requires as many FLOPs as all the forward passes of an autoregressive LLM combined. Text diffusion LLMs can be more efficient than autoregressive models in practice because it is usually more efficient on GPUs to do one big operation than many small operations in sequence, even when both require the same number of FLOPs[1]. However, these efficiency improvements can only happen until inference efficiency becomes bottlenecked by FLOPs.
This last sentence is oversimplified - another thing that matters here is the shapes of matrices that GPUs multiply. But this is out of the scope of this blogpost.