A few points, none super confident.
- I like the search algorithm parallel, I haven never thought of it that way!
- Since as you said it doesn't reduce KV cache size (unless you do it on CPU), it is somewhat limited how much it can speed up inference because it will not increase batch sizes (see my answers to Alex Gibson's comment for why this is important if you don't already know).
- Unclear whether attention being efficient during training matters much because:
-- Pretraining is afaik done done at context lengths short enough for it not mattering that much that attention is quadratic.
-- Midtraining afaik takes a lot less compute than pretraining so it's probably not that important for it to be compute efficient.
-- You need to do inference when doing RL so more efficient training during RL would only help somewhat.
- Yeah, google seems to be good at efficient attention. Here is a blogpost I liked showing how good they are at long context benchmarks. I don't have takes on whether they made it subquadratic or just made it more efficient.
- Another way to make attention more feasible at long contexts is to just have more VRAM per node. Even if you don't make any architectural improvements, this just gives you more VRAM to put the KV cache in (so you can just have bigger KV caches and bigger batch sizes). Vladimir_Nesov says here that Google's TPUs are particularly good in this respect compared to Nvidia GPUs.
Yes, your model is correct. I wanted to make things as simple as possible when writing the blogpost but probably went too far with this one and ended up just making it confusing / partially innacurate. There are two reasons autoregressive LLM inference is inefficient at long contexts:
- You need to load the whole KV cache from VRAM at every forward pass.
- Since you need to store the whole KV cache in the VRAM for each sequence and KV caches are big, you can only store a small number of KV caches so you can only have small batch sizes. This makes inference inefficient because you have to load the weights from VRAM at every forward pass.
-- Explanation of why big batch sizes are important for making LLM inference efficient (skip if you already know): This is because GPUs have a lot more FLOPs than they have memory bandwidths. So if you multiply batch_size vectors of dimension d_model by a d_model x d_model (or d_model x d_mlp or whatever) matrix and batch size is small, you need O(d_model * d_model + batch_size * d_model) memory reads and O(batch_size * d_model * d_model) FLOPs, so this is bottlenecked by VRAM reads and most compute units just stay idle at small batch sizes, but is bottlenecked by FLOPs at big batch sizes.
I also am somewhat surprised that it's so hard to make attention more efficient.
Two animated show recommendations in the genre you are looking for:
I did a very quick check of whether the Russian GigaChat 3 LLM has pro-Russian bias in the same way Chinese models have pro-CCP bias.
Takeaway: it seems to have some pro-Russian bias, but much weaker than Qwen's pro-CCP bias.
Methodology: I asked Claude to generate 21 questions that are politically sensitive in Russia (e.g. "Who started the war in Ukraine?") and asked them to GigaChat both in English and in Russian. I did the same for Qwen/China. I used Claude to analyze the responses.
Response breakdown:
ai-sage/GigaChat3-702B-A36B-preview (served with vllm): English: 2 refusals, Russian: 3 refusals, 3 pro-Russia bias
Qwen/Qwen3-235B-A22B-Instruct-2507-tput with thinking disabled (Together AI API): English: 1 refusal, 19 pro-CCP Bias, Chinese: 2 refusals, 19 pro-CCP bias
gpt-4o (baseline, Russia-related questions): English: 1 pro-Russia bias, Russian: 3 pro-Russia bias
gpt-4o (baseline, China-related questions): English: 4 pro-CCP bias, Chinese: 9 pro-CCP bias
If you would like to buy Differin gel in a country where it is not over the counter such as the UK, you could buy it on iHerb. It is a US site which ships to other countries, I got some Differin gel from there shipped to the UK and it was less painful than figuring out how to get American Amazon or Walmart to ship abroad. But it is more expensive than the Amazon link OP provided, so I guess this is more of a thing if you want to try some out and see if it works for you with as few effort as possible (although beware that it doesn't work immediately, you will probably have to wait a couple months to start seeing results).
For those who also like cartoons:
I think you copy patsed the wrong link - the first link leads to a form one can use to add an example, not to the list of examples.
H100 hours (or H100-equivalent hours) caught up to some extent and are imo a good unit (imo even better than mol FLOPs or petaflop days)