For running out of optimizations, there is another factor. And that is that some amount of re-optimization is needed each time you change hardware. Consequentially it is really hard to separate hardware progress (the new NVIDIA chips are better) from optimization of the software that runs on them (Because they have new features, the use of which is an optimization)
There’s a lot of talk about “algorithmic progress” in LLMs, especially in the context of exponentially-improving algorithmic efficiency. For example:
It’s nice to see three independent sources reach almost exactly the same conclusion—halving times of 8 months, 6 months, and 7½ months respectively. Surely a sign that the conclusion is solid!
…Haha, just kidding! I’ll argue that these three bullet points are hiding three totally different stories. The first two bullets are about training efficiency, and I’ll argue that both are deeply misleading (each for a different reason!). The third is about inference efficiency, which I think is right, and mostly explained by distillation of ever-better frontier models into their “mini” cousins.
Tl;dr / outline
Status of this post
I wrote this very quickly, on a topic where I am not remotely an expert. I’m hoping for feedback and opinions!
1. The big picture of LLM algorithmic progress, as I understand it right now
1.1 Stereotypical “algorithmic efficiency improvements”: there’s the Transformer itself, and … well, actually, not much else to speak of
(This part is heavily reliant on @Hans Gundlach et al. 2025b “On the origin of algorithmic progress in AI”. So I hope it’s right, and that I’m understanding it properly! Please let me know if you have reasons to doubt either of those.)
1.2 “Optimizations”: Let’s say up to 20×, but there’s a ceiling
This category is stuff that wouldn’t show up in a FLOP metric, but it’s equally important for cost. It includes everything specific to a particular hardware configuration and training setup—details about quantization, parallelization, FlashAttention, KV cache, other CUDA black magic, and so on, along with system-level optimizations like speculative decoding.
It seems very plausible to me that, after an initial rushed setup of a new configuration, the next many months of configuration-specific optimizations will yield another factor of, I dunno, up to 20×. For example, the original FlashAttention alone apparently sped up some training setups by 3×.
What I don’t believe is that “optimizations” can contribute to an ever-growing exponential, where it’s 10× after two years, 100× after four years, 1000× after six years, etc. These kinds of optimizations have a ceiling, where you’re doing the best you can with the training approach and hardware configuration you’ve got.
1.3 Data-related improvements
As discussed in “Most Algorithmic Progress is Data Progress” by Beren Millidge (@beren), a lot of LLM improvement has come from more and/or better training data,[4] including:
What are the impacts of these data-related improvements? Are they relevant to those exponentials at the top? My current take is:
1.4 Algorithmic changes that are not really quantifiable as “efficiency”
If we put aside the “3×/year” and related quotes at the top, and take a broader view of what LLM algorithmic progress can look like, then of course we find many more items. These include:
2. Explaining away the two training-efficiency exponential claims
At the top, I cited Epoch AI and Dario Amodei as claiming that algorithmic improvements constitute a rapid exponential that’s been going on for years. I don’t currently believe either of them. Here’s why.
2.1 The Epoch “8-month halving time” claim is a weird artifact of their methodology
(The Epoch AI claim in question is at blog, paper, and my response here is entirely based on Gundlach et al. 2025b.)
Some algorithmic changes matter more and more as the model scale gets bigger. Specifically, there were two such changes: the switch from LSTMs to Transformers, and Chinchilla-optimal training.
For example, let’s suppose that the Transformer is N× more efficient than LSTMs with 2018-scale LLMs, and 10N× more efficient than LSTMs with 2025-scale LLMs.
Now let’s put aside everything else, and imagine a world where we switch from LSTMs to Transformers in 2018, and then scale up the transformers from 2018 to 2025 with no additional algorithmic change at all. In the funny Epoch methodology, they would say that we got an “extra” 10× algorithmic improvement (50%/year!) in the 2018-2025 period, because we’re kinda milking ever more advantage from the one-time LSTM-to-Transformer switch.
OK, but … that’s super confusing! Right? By assumption, the algorithms weren’t actually getting better during that 2018-2025 period, at all!
Anyway, the important point is: the actual Epoch analysis seems to be fully compatible with the claims I made in §1.1 above.
2.2 The Dario “4x/year” claim is I think just confused
I quoted Dario Amodei 2025 at the top. Here’s a longer version of that quote:
At first, I found this quote baffling:
So what the heck is Dario talking about??
…But I think I got it now. So here’s my current take—the only way I can get everything to fit together and make sense:
So overall, I think Dario is somewhat confused, and giving a misleading (or at least very confusing) description of the situation.
[Boy, I sure feel weird about lecturing Dario Amodei on the big picture of LLM training! He knows more about LLM training than almost anyone on Earth, and I have (checks notes) no LLM training experience whatsoever. So if anyone has a different proposal for how to make sense of Dario’s quote above, I’m all ears!]
3. Sanity-check: nanochat
There were seven years between GPT-2 (trained in early 2019) and the new nanochat, which matches GPT-2 performance on the “CORE” metric (“a diverse set of reasoning and knowledge tasks from the DCLM benchmark suite”).
Remarkably, Karpathy says here that nanochat training costs $73 (“3 hours on a single 8XH100 node”), whereas “GPT-2 was trained by OpenAI on 32 TPU v3 chips for 168 hours (7 days), with $8/hour/TPUv3 back then, for a total cost of approx. $43K”.
So that would be a factor of 600 in 7 years, i.e. a halving time of 9 months. Is that consistent with my story in §1? I think so! For example, it might be something like:
For that last one: GPT-2 used “webtext”, which was generated by scraping URLs linked from Reddit. By contrast, nanochat trains on “fineweb-EDU”, a dataset of educational materials crafted and curated with incomparably more effort and care. Remember, we’re comparing nanochat to GPT-2 on “reasoning and knowledge tasks”, not perplexity; I would be shocked if this better data was not playing a major role.
So anyway, my take in §1 seems at least plausibly consistent with the nanochat thing, AFAICT at a glance. To be clear, I didn’t check things in detail or scrutinize it very hard. If anyone wants to really check, you could just download nanochat and have at it!
4. Optional bonus section: why does this matter?
Well for 99% of the people reading this post, this topic matters to you because you’re trying to forecast future LLM progress. But that’s not my interest, so I won’t talk about it. I’ll leave that to others!
I’m actually interested in a rather different debate, related to arguments about takeoff speeds for a hypothetical future AI paradigm—see Foom & Doom 1: “Brain in a box in a basement”, e.g. comment on that post by @ryan_greenblatt. Here’s the debate:
One school of thought (that I vaguely associate with Paul Christiano[5]) says: When people are trying to do something in ML, they very rapidly get to near the ceiling of how efficiently they can do that thing, given the available data and hardware situation (but perhaps leaving aside paradigm shifts, which are not about doing the same thing more efficiently, but rather about trying to do something different instead).
A different school of thought says: No, that’s wrong, instead when people are trying to do something in ML, there will be a very large collection of algorithmic improvements that could make it work more efficiently, and these improvements will take many thousands of person-years to discover, and they will collectively amount to orders of magnitude of efficiency difference.
I’m generally in the first school of thought, which of course goes along with my belief that a future AI paradigm shift could lead to a remarkably sudden emergence of AGI and ASI.
…OK, if that’s the debate, then what lesson do we take away from this LLM case-study? My answer: If I’m right (a big “if”!), then the history of LLMs seems to mostly support the first school of thought.
To be clear, I don’t think this kind of analogizing is terribly strong evidence either way; and note also that there are other case-studies like this ImageNet analysis that might or might not paint a different picture, I didn’t check.
In fact, there are two key disanalogies between LLMs versus the future AI paradigm I’m expecting (see Foom & Doom 1), and they make my case for next-paradigm sudden takeoff even stronger: the future paradigm I’m expecting would (1) not rely on training data for its capabilities (unlike LLMs), making §1.3 basically moot; and (2) require very little compute to get from random initialization to AGI (if efficiently implemented), which would allow for much more rapid iteration and testing than we’re used to from LLMs.
Thanks Hans Gundlach, Seth Herd, plex, algon, and ishaan for critical comments on earlier drafts.
Gundlach links to this paper on tokenizers, and describes it as a claim that “inefficient tokenization” can cause up to 68% performance hit. But I think that 68% number comes from comparing best practices to the extraordinarily dumb idea of building a tokenizer using English-only text and then running inference on a multilingual corpus. As for real tokenizer improvements, I think everyone’s been using BPE since before the Transformer, and different flavors of BPE seem quite similar, if I’m reading the paper right. As an example, this page benchmarks the tokenizer of the recently-released nanochat (§3) against the tokenizer used by GPT-2 in 2019, and finds 0-15% compression difference, depending on the data type. The difference probably comes from using better training data to set up the tokenizer.
FYI: They note that if you revert one of these things at a time, it has a bigger deleterious impact than if you revert the whole package at once. In other words, the “Retro Transformer” and the “Modern Transformer” were each a package of components that worked particularly well together.
For example, Karpathy here mentions the “muon optimizer … residual pathways and skip connections gated by learnable scalars, and value embeddings”, none of which seem to be studied by Gundlach et al. 2025b, unless I missed it.
This of course fits perfectly with my belief that we should think of LLMs as getting their impressive powers almost entirely via imitation learning from their training corpus.
For example, Paul Christiano 2018: “A more precise version [of] my claim: if you gave smart grad students from 1990 access to all of the non-AI technology of 2017 (esp. software tools + hardware + data) and a big budget, it would not take them long to reach nearly state of the art performance on supervised learning and RL. For example, I think it's pretty plausible that 20 good grad students could do it in 3 years if they were motivated and reasonably well managed.” (Actually, Paul is much further in the direction of the first school of thought than I am, because I defined it with a carve-out for possible hard-to-discover paradigm shifts, and he’s not even conceding that.)