The closest we can get is a little benchmark where the models are supposed to retrieve “a needle out of a haystack”. Stuff like a big story of 1 million tokens and they are supposed to retrieve a fact from it
This isn't "the closest we can get". Needle-in-a-haystack tests seem like a sensible starting point, but testing long-context utilization in general involves synthesis of information, EG looking at a novel or series of novels and answering reading comprehension questions. There are several benchmarks of this sort, EG:
https://epoch.ai/benchmarks/fictionlivebench
Interesting post. It was my understanding that Google's counterpart to Kimi Linear and Deepseek Sparse for efficient long-horizon attention was built around the Titans architecture, which is publicly available. Is there something that leads you to believe that this is not the case, i.e. that Google abandoned this route and went with something different?
In any case, it makes me happy to see companies evolving in different directions with their architectures. That period in which everyone spent large sums of money doing more-or-less the same exact thing in what amounted to the same way felt very inefficient.
It's a bit of a deepity but also a game theoretical conclusion that "if deepmind releases a paper it is either something groundbreaking or something they will never use in production". The TITANS paper is about a year old now, and the MIRAS paper about 9 months old. you would think that some other frontier lab would have implemented it by now if it worked that well. I suspect a piece is missing here, or maybe the time between pre-training run and deployment is just way longer than I think it is and all the frontier labs are looking at this.
To my understanding TITANS requires you to do a backward pass during inference, this probably is a scaling disaster in inference as well, but maybe less so, since they do say that it can be done efficiently and in parallel. It's unclear to me!
I mean, you may just be right. TITANS+MIRAS could be in the latter category. Gemma 3 (which we know does not use TITANS) for example probably benefits from a lot of RL environments, yet it absolutely sucks at this task. So it is possible that they are using it in production.
I guess like all things we will know for sure once the open chinese labs start doing it.
Something I see a lot in high-end ML is papers that are very simple conceptually, but very tricky to get working properly. I'd imagine that having the dozen or so guys who know all the hyperparameter tweaks to get public algorithm X to run accounts for a lot of 'secret sauce'. Lots of people were wondering how DALL-E worked, but diffusion for quality image synthesis had been around for a while at the time, for instance.
Also, off the top of my head, I can't think of an instance where a lab had an entire secret algorithm locked up that was the basis for their lead. Feels like it's always been public papers getting used to their full potential.
I guess like all things we will know for sure once the open chinese labs start doing it.
What a timeline, eh?
[Epistemic status - speculative, but sort of grounded, might be wrong - don’t take too seriously]
Alternative title: why Gemini 3 pro gets to be so big
Bit more technical than usual but want to try out writing technical articles, thank you to @bycloud on twitter for giving me the idea to write about this. It’s really his discovery. He makes excellent videos on the state of LLM progress, I highly suggest checking him out.
Traditional attention in transformers is O(n^2), which doesn’t scale well.
So, there’s been a big race towards what we call subquadratic attention or even linear attention. It’s a very active area of research, with Kimi spearheading Kimi linear attention and Deepseek inventing Deepseek Sparse Attention (DSA). Both of these (chinese) labs publish their findings. This is not something OpenAI or Anthropic or Google does, they would rather keep this a secret. More on that later.
So let’s say we want to test this right? Test for how good your attention really is? The closest we can get is a little benchmark where the models are supposed to retrieve “a needle out of a haystack”. Stuff like a big story of 1 million tokens and they are supposed to retrieve a fact from it
So now, let’s see how most SOTA models perform shall we? The X-axis is cost, the Y-axis is success rate.
reminder, top left is better!
Haha, what the fuck! Look at how good every Google model is? What the fuck are they doing? this is some black fucking magic! I strongly suspect they’ve sort of cracked good sub-quadratic attention. They’re just mogging everyone!1
The exception to this is Kimi linear, the chinese lab who is experimenting with linear attention. But the problem is that they found their linear attention to be notoriously bad at knowledge tasks, which Gemini 3 flash is super good at. Gemini 3 pro and flash perform spectacularly on benchmarks and are insanely cheap compared to the competition. For context, Gemini 3 flash’s biggest competitor is likely opus 4.5, which is literally 10 times more expensive.2
Google does not have significantly more compute than their competitors but I get more free Gemini 3 pro usage than I get paid opus 4.5 usage. This is kind of bizarre, you can subsidise a userbase, but by how much? Inference is super duper expensive. Need I remind you that is OpenAI literally losing money on $200/month subscriptions?
This is extra-puzzling when you consider that our best available heuristics point to Gemini 3 pro being an absolutely huge model. So how the fuck is google serving this to billions? For free?
Nothing seems to explain this other than Google seemingly having figured out subquadratic attention. Or some other paradigm shift on the architectural level.
Except… Maybe, there is…?
If you ask google engineers: “why is flash 3 so cracked” they will give you a different answer. It’s scaling RL.
And this definitely does play a role. If you look at that graph again, do you see the difference between 2.5 and 2.5 flash preview? Do you see how they get a better score for less money? That’s very likely the exact same underlying architecture. Basically all improvements then from 2.5 flash preview to 2.5 flash can be attributed to RL. It doesn’t seem entirely impossible that they figured out a way to build really good RL environments and use basically the same attention mechanism that Kimi does.
But again, this does not answer how they got it to be so good at other tasks! Maybe kimi sucks at RL, and current linear attention can already do this? It’s possible!
And especially Anthropic seems to be really behind here (to be fair, opus isn’t on here). scoring horribly on these benchmarks, and being notorious for “compacting the conversation“ every X tokens.3 Square that with Opus 4.5 being the best model based on vibes for a lot of people, and my argument strengthens.
The full picture is admittedly more complex, a lot of models use some sort of alternating quadratic and subquadratic attention, so it doesn’t have to be a single algorithm, it could just be a clever way to combine these.
So yeah, google is clearly cooking something.
Whatever it is, it really saddens me I will never get to find out. Academic research had its problems with publishing in journals and such and ease of access. This is the downside of frontier research happening in industry. But there is knowledge on the inside. Knowledge that allows us to build things that more and more seem to resemble some form of general intelligence. And I do not have access to that knowledge, and that really stings. And this problem is only getting worse for frontier LLM research. Thank you China!
So to my friends, who asked me at a party, why has OpenAI seemingly fallen off? Why do researchers literally get billion dollar contracts? It’s because bullshit like this. If you can find good subquadratic attention algorithms, or build the right RL environment, you can hold up the entire stock market.
This all ties into why I feel pretty bullish on the short-term future of LLMs:
If you think AI progress would stall, hard, all 3 of these would have to suddenly not be true anymore, I just do not see this happening. Go read Bentham’s Bulldog or the much longer forethought report on the subject if you’re interested.
Strap in, the next few years are going to be wild.
Note this is generally not really a benchmark you just kinda train for and get good at like you can do with math etc… Models kind of intrinsicly learn this behavior and it’s mostly up to your architecture if they are even capable of doing it
openrouter api input credits. (anthropic generally more expensive for this but point still stands: noone is getting free opus 4.5 usage, everyone is getting nearly unlimited 3 flash usage)
Note: maybe this is just because they notice performance degrading if the context window is full of junk, it isn’t logically implied, but it is curious to note nonetheless.