I originally wrote this just for myself but thought I'd share it in case others find it useful. Sharing rough notes in the spirit of not letting perfect be the enemy of the good. This was written in early May 2025.
In this post I collect evidence and forecasts regarding recent and future trends in software and compute progress. I also make my own forecasts which inform an updated version of AI 2027’s timelines forecast (more to come on these and further updates). These forecasts are rough and uncertain.
I especially build on this post.
Evidence and forecasts regarding the recent rate of software progress | ||
Evidence | Estimate | Caveats, reasoning, etc. |
Compute efficiency increase at fixed performance | ||
Epoch paper: Analyzing a dataset of LLM performance on Wikitext and Penn Treebank (source) | 2.5-2.8x/year
Average efficiency increase reported in the paper (compute needed to reach same performance halving every 8-9 months) | |
Epoch’s analysis of trends in inference cost to achieve fixed performance level (source). Median rate of cost decrease is 40x/year, trends lasting 0.5-3 years.
The trends after Jan 2024 seem faster (but I think this might be due to trends generally being faster right after benchmark release). | <<1,600x/year
If cost were proportional to parameter count and all models are Chinchilla trained then this would imply compute efficiency increase of 40*40=1,600x/year. But it’s actually way less, see the next column for why. | This is obviously too large but still interesting.
Most of this is coming from (a) distillation, which we should disallow[1] (b) a trend toward increased overtraining on smaller models and (c) overfitting to benchmarks; in particular, some or all of the benchmarks might be more narrow than we’d like.
However even for MMLU there’s a 10x/year trend lasting 3 years, which is interesting.
|
GPT-4 vs. DeepSeek-V3 / R1 on MMLU etc. (source: Ryan comment, Epoch data) | >2.8x/year
GPT-4 released in Mar 2023 with 2e25 training compute, DeepSeek-V3 was released Dec 2024 with 3.4e24 training compute.
DeepSeek-V3 is significantly better at MMLU and other benchmarks. This corresponds to a >2.8x/year decrease. | Peter claims that this isn’t good to look at because GPT-4 isn’t trained compute optimally. It’s also possible that DeepSeek-V3 is distilled. |
Chinchilla vs. Qwen on MMLU (sources: Epoch models database, Chinchilla paper) | ~2.8x/year
Mar 2022: Chinchilla has 5.76e23 training compute and got 67.6% on 5-shot MMLU.
Sep 2023: Qwen-14B gets 66.3% on 5-shot MMLU 2.5e23 training compute. | The lowest training compute estimate in the Epoch database is 1e23 which is pretty rough for the exercise of looking for efficiency improvements over Chinchilla. Also, MMLU is the least dated benchmark used in the Chinchilla paper.
Given the lack of models with such little training compute and the old benchmarks unfortunately I could only find one model and benchmark combination that worked. Some models that I unsuccessfully tried to use in a footnote.[2] |
Comparing Qwen models against each other (source: calculations at Qwen models algorithmic progress) | >10x/year for math and coding 3-10x/year for MMLU-Pro and GPQA | See more at the “Efficiency” section of Qwen models algorithmic progress, then I converted the 7 month efficiency increase to a yearly rate.
It’s possible that distillation is inflating these results, but I couldn’t find any mention of distillation in Qwen 1-2.5 blog posts or papers (while it was mentioned in Qwen 3, which I didn’t use in these calculations). |
Epoch’s MATH Level 5 data (source) | ~30x/year
Nov 2023 GPT-4: 40%, 2.1e25 Jun 2024 Qwen2-72B: 39%, 3e24
| This seemed like the most sensible comparison I could find, one could collect a bunch more but a lot of the recent ones would just get blown out by R1 quickly, R1 currently ranks 1st despite a large compute deficit (of course that is some interesting evidence). |
Something on an agentic benchmark (ideally agentic SWE) | N/A as of yet, unfortunately can’t find good data. I suspect the trend would be at least a bit faster than on MMLU because it would better take into account advances in post-training. | IMO the benchmarks measuring the most relevant skills at the moment are (off the top of my head) METR’s HCAST and REBench, SWE-Lancer and SWE-Bench.
Of these, SWE-Bench has by far the most submissions we can look at.[3] But still it seems too hard to find data on efficiency increases. |
Capability improvements at fixed compute levels, measured in compute-equivalent gains (CEG) i.e. this improvement is equal to scaling compute by X | ||
Improving at the same compute level within the Qwen model family (source: EL data collected at Qwen models algorithmic progress) | Data points range from ~1-10x/year, with those over longer time periods giving higher results.
Data points collected are:
See more at the “CEGs” section of Qwen models algorithmic progress. | My best guess interpretation of these results is ~4x/year (with high uncertainty, given very low sample size).
Qwen 1/1.5/2/2.5/3 is by far the best model family I could find for this exercise, because there are lots of data points and they generally look close to SOTA compute efficiency upon release.
It’s possible that distillation is inflating these results, but I couldn’t find any mention of distillation in Qwen 1-2.5 blog posts or papers (while it was mentioned in Qwen 3, which I didn’t use).
Model families I didn’t look into in footnote.[4] |
PTEs: Looking at the various post-training enhancements between 2020-2024 and estimating how they might stack (source) | ~10x/year | But with very serious limitations, which are discussed in the paper (including limitations of CEG as a metric in general). I think we should take this very lightly, but we should keep in mind that PTEs should be accounted for to the extent they aren’t in other estimates. |
Fancy methods | ||
Epoch paper: Shapley value estimation techniques on a dataset of LLM performance on Wikitext and Penn Treebank (source) | 5-40% software progress share
Decreasing to more like 5-15% since the Transformer | Anson Ho (first author) says that the decrease after the transformer isn’t robust and wouldn’t put too much stock into it.
He also says it’s not really only efficiency-at-fixed-performance style but is closer to that then taking into account all software improvements. |
Overall estimates | ||
Epoch website, “Contribution of algorithmic innovation” (source) | 35% software progress share | Anson says this is likely higher than the Shapley paper due to using a slower compute scaling reference point. |
Anson Ho (first author on the Epoch paper) informal estimate | ~50% software progress share | Anson says this is higher than the Epoch numbers due to post-training enhancements (this is my understanding as the main reason for adjusting, I’m not 100% sure). |
Ryan Greenblatt | 4.5x/year (a bit >50% progress share as he thinks recent compute scaling has been lower) | “though I'd be sympathetic to substantially higher estimates”
“loss doesn't accurately capture downstream performance”
“I'd also guess that the pure loss algorithmic progress over the last 2-3 years is somewhat lower than 3x”
“More minimally, I'd say that most progress in the last 2 years is down to algorithmic progress rather than due to scaling up the overall amount of compute used for training.” |
My (Eli’s) overall best guess estimate | ~6.3x/year (aka 55% progress share, given 4.5x/year from compute)
| To sum up the above:
Mashing these together gives me ~4x, with substantial uncertainty. I’m particularly uncertain about how to weight general vs. math/coding improvements. I’m mostly paying attention to general improvements here with a little weight to math/coding, esp. given that the timelines supplement focuses on coding.
A final consideration is that none of the above account for algorithmic progress which has increasingly bigger multipliers at higher compute scales. (like LSTMs->Transformers (paper)). Perhaps this increases the rate of progress by 33% to 6.3x.[5] I’m highly uncertain about how big this adjustment should be.[6]
Both Anson and Ryan gave overall estimates of ~50% share of software progress so ~4.5x/year, which is pretty similar to where I ended up. |
Evidence and forecasts regarding the future rate of software progress | ||
Evidence | Estimate | Caveats, reasnoning, etc. |
Trends in recent software progress | ||
Epoch paper: Shapley value estimation techniques on a dataset of LLM performance on Wikitext and Penn Treebank (source) | Share of software progress appears to be decreasing since the Transformer. | Anson (first author) says that the decrease after the transformer isn’t robust and wouldn’t put too much stock into it. |
Various other efficiency trends described above in the “Evidence regarding the recent rate of software progress (and share vs. compute)” table | Efficiency trends look relatively straight overall as far as I can tell. | |
Labor growth | ||
Recent quality adjusted growth in frontier AI researchers, Eli’s rough guess | 50%/year
| This is just a quite rough guess based on how fast frontier AI companies have been growing, adjusted for differences in quality. I believe OpenAI and Anthropic have been growing faster than this but a lot of the expansion has been product, sales, etc. rather than capabilities researchers.
|
Growth in frontier AI researchers starting in ~2029 after hitting investment limits | 10%/year (~4x slower) | Rough guess informed by Claude’s estimate for the growth rate of mid-size and large companies. |
Overall estimates and implications | ||
Decrease in rate of human-driven algorithmic progress | Starting in 2028, I’ll decrease growth from 50%/year to 10%/year over the course of a few years. | At current labor growth rates, the trend looks pretty stable. |
Evidence and forecasts regarding the recent compute trend | ||
Evidence | Estimate | EL notes |
Compute trend measurements | ||
Epoch data (source, source w/data+justification) | 4.7x/year on Epoch frontpage 4.1x/year for “notable models” 4.2x/year for frontier models 5x/year for frontier LLMs 4.9x/year for GDM, 5.3x/year for OpenAI | 5x/year for frontier LLMs seems like the most relevant estimate |
AI 2027 compute forecast, GPT-4 to May 2025 | 4x/year Between Sep 2022 (GPT-4) and May 2025 the best internal model goes from ~2e25 to ~8e26, that gives 4x/year (2.67th root of 40) | Interesting that this is lower than Epoch data |
Vladimir post, 2022-2025 | 3.55x/year | |
Overall estimates | ||
Ryan Greenblatt’s estimate | <4.5x/year | I thought he endorsed 4.5x/year in the post but in comments on this document he said he thinks it’s been slower. |
My (Eli’s) overall best guess estimate | 4.5x/year | In between Epoch frontier LLM trend and AI 2027 forecast, lines up with Ryan’s estimate |
Evidence and forecasts regarding future compute trends | ||
Evidence | Estimate | Caveats, reasnoning, etc. |
Considerations | ||
While Grok 3 is basically on trend for the 4.5x per year trend, other AI companies seemingly haven't released an AI with much more compute than GPT-4 (Ryan post) | Lower than recent trends | |
Training run duration will stop increasing (Ryan post) | Lower than recent trends | |
More compute will move into RL, which is harder to scale up (Ryan post) | Lower than recent trends | |
Overall estimates | ||
Ryan Greenblatt’s estimate for the trend going forward | 3.5x/year (or possibly much lower due to pretraining hitting a wall and RL being hard to scale) | |
Vladimir post, 2025-2028 | 3.55x/year | |
AI 2027 compute forecast, 2025 to 2027 as reported for leading AI company compute | 3.4x/year | |
AI 2027 compute forecast, scaling pace from 2025 to 2027 by looking at internal training FLOP numbers | 5x/year
Best internal model as of May 2025: ~8e26 FLOP Best internal model as of May 2027: ~2e28 FLOP | This is higher than the above due to the training run being 2x longer. |
Pace of compute increase after ~2028, as forecasted by Vladimir + Romeo | 1.55x/year | This is ~3.4x slower than recent trends (4.5x/year) |
My (Eli’s) overall estimate, 2025-2027 | 4.5x/year | The AI 2027 internal FLOP numbers seem like the most important data point but various others are lower, which especially makes sense if we don’t see AGI very soon. Also, the AI 2027 numbers rely on training time increasing and it’s unclear whether that assumption will hold. |
My (Eli’s) overall estimate, 2028 and onward | Decreases from 4.5 to 1.55x/year over the course of 2028 to 2031 |
Thanks to Ryan Greenblatt and Anson Ho for comments, and Peter Johnson for the nudge to spend more time on this.
We’re aiming to forecast the contributors of compute + algorithms to frontier models, so distillation seems like it shouldn’t be allowed. Seems like Epoch agrees because they exclude distillation.
Mar 2023: GPT-3.5 gets 70% few-shot. Epoch estimates 2.58e24 training compute.
May 2024: Gemini 1.5 Flash gets 79% (5-shot). Epoch doesn’t have compute estimates.
Oct 2024: Gemini 1.5 Flash 8B gets 75%. Epoch doesn’t have compute estimates.
HCAST has a decent amount as well but not from non-frontier models which we need to look at efficiency improvements.
4*cube_root(4)
Epoch’s algorithmic progress paper contains some weak evidence on this in footnote 8 and appendix B.