LESSWRONG
LW

AI
Frontpage

46

Recent and forecasted rates of software and hardware progress

by elifland
26th Jun 2025
AI Alignment Forum
9 min read
0

46

Ω 27

AI
Frontpage

46

Ω 27

New Comment
Moderation Log
More from elifland
View more
Curated and popular this week
0Comments

I originally wrote this just for myself but thought I'd share it in case others find it useful. Sharing rough notes in the spirit of not letting perfect be the enemy of the good. This was written in early May 2025.

In this post I collect evidence and forecasts regarding recent and future trends in software and compute progress. I also make my own forecasts which inform an updated version of AI 2027’s timelines forecast (more to come on these and further updates). These forecasts are rough and uncertain.

I especially build on this post.

Evidence and forecasts regarding the recent rate of software progress 
(and share vs. compute)

EvidenceEstimateCaveats, reasoning, etc.

Compute efficiency increase at fixed performance

Epoch paper: Analyzing a dataset of LLM performance on Wikitext and Penn Treebank (source)

2.5-2.8x/year 

 

Average efficiency increase reported in the paper (compute needed to reach same performance halving every 8-9 months)

 

Epoch’s analysis of trends in inference cost to achieve fixed performance level (source). Median rate of cost decrease is 40x/year, trends lasting 0.5-3 years.

 

The trends after Jan 2024 seem faster (but I think this might be due to trends generally being faster right after benchmark release).

<<1,600x/year

 

If cost were proportional to parameter count and all models are Chinchilla trained then this would imply compute efficiency increase of 40*40=1,600x/year. But it’s actually way less, see the next column for why.

This is obviously too large but still interesting.

 

Most of this is coming from (a) distillation, which we should disallow[1] (b) a trend toward increased overtraining on smaller models and (c) overfitting to benchmarks; in particular, some or all of the benchmarks might be more narrow than we’d like.

 

However even for MMLU there’s a 10x/year trend lasting 3 years, which is interesting.

 

 

GPT-4 vs. DeepSeek-V3 / R1 on MMLU etc. (source: Ryan comment, Epoch data)

>2.8x/year

 

GPT-4 released in Mar 2023 

with 2e25 training compute, DeepSeek-V3 was released Dec 2024 with 3.4e24 training compute.

 

DeepSeek-V3 is significantly better at MMLU and other benchmarks. This corresponds to a >2.8x/year decrease.

Peter claims that this isn’t good to look at because GPT-4 isn’t trained compute optimally. It’s also possible that DeepSeek-V3 is distilled.
Chinchilla vs. Qwen on MMLU (sources: Epoch models database, Chinchilla paper)

~2.8x/year

 

Mar 2022: Chinchilla has 5.76e23 training compute and got 67.6% on 5-shot MMLU.

 

Sep 2023: Qwen-14B gets 66.3% on 5-shot MMLU 2.5e23 training compute.

The lowest training compute estimate in the Epoch database is 1e23 which is pretty rough for the exercise of looking for efficiency improvements over Chinchilla. Also, MMLU is the least dated benchmark used in the Chinchilla paper.

 

Given the lack of models with such little training compute and the old benchmarks unfortunately I could only find one model and benchmark combination that worked. Some models that I unsuccessfully tried to use in a footnote.[2]

Comparing Qwen models against each other (source: calculations at Qwen models algorithmic progress)

>10x/year for math and coding

3-10x/year for MMLU-Pro and GPQA

See more at the “Efficiency” section of Qwen models algorithmic progress, then I converted the 7 month efficiency increase to a yearly rate.

 

It’s possible that distillation is inflating these results, but I couldn’t find any mention of distillation in Qwen 1-2.5 blog posts or papers (while it was mentioned in Qwen 3, which I didn’t use in these calculations).

Epoch’s MATH Level 5 data (source)

~30x/year

 

Nov 2023 GPT-4: 40%, 2.1e25

Jun 2024 Qwen2-72B: 39%, 3e24

 

 

This seemed like the most sensible comparison I could find, one could collect a bunch more but a lot of the recent ones would just get blown out by R1 quickly, R1 currently ranks 1st despite a large compute deficit (of course that is some interesting evidence).
Something on an agentic benchmark (ideally agentic SWE)N/A as of yet, unfortunately can’t find good data. I suspect the trend would be at least a bit faster than on MMLU because it would better take into account advances in post-training.

IMO the benchmarks measuring the most relevant skills at the moment are (off the top of my head) METR’s HCAST and REBench, SWE-Lancer and SWE-Bench.

 

Of these, SWE-Bench has by far the most submissions we can look at.[3] But still it seems too hard to find data on efficiency increases.

Capability improvements at fixed compute levels, measured in compute-equivalent gains (CEG) i.e. this improvement is equal to scaling compute by X

Improving at the same compute level within the Qwen model family (source: EL data collected at Qwen models algorithmic progress)

Data points range from ~1-10x/year, with those over longer time periods giving higher results.

 

Data points collected are:

  1. 7 months to get almost a CEG of 4 (i.e. equivalent to a 4x scaleup)
  2. 2 correlated instances of seeing little improvement over 3 months
  3. 7 months + 1/4 of a 2x scaleup gets significantly  more than 2 CEG

 

See more at the “CEGs” section of Qwen models algorithmic progress.

My best guess interpretation of these results is ~4x/year (with high uncertainty, given very low sample size).

 

Qwen 1/1.5/2/2.5/3 is by far the best model family I could find for this exercise, because there are lots of data points and they generally look close to SOTA compute efficiency upon release.

 

It’s possible that distillation is inflating these results, but I couldn’t find any mention of distillation in Qwen 1-2.5 blog posts or papers (while it was mentioned in Qwen 3, which I didn’t use).

 

Model families I didn’t look into in footnote.[4]

PTEs: Looking at the various post-training enhancements between 2020-2024 and estimating how they might stack (source)~10x/yearBut with very serious limitations, which are discussed in the paper (including limitations of CEG as a metric in general). I think we should take this very lightly, but we should keep in mind that PTEs should be accounted for to the extent they aren’t in other estimates.

Fancy methods

Epoch paper: Shapley value estimation techniques on a dataset of LLM performance on Wikitext and Penn Treebank (source)

5-40% software progress share

 

Decreasing to more like 5-15% since the Transformer

Anson Ho (first author) says that the decrease after the transformer isn’t robust and wouldn’t put too much stock into it.

 

He also says it’s not really only efficiency-at-fixed-performance style but is closer to that then taking into account all software improvements.

Overall estimates

Epoch website, “Contribution of algorithmic innovation” (source)35% software progress shareAnson says this is likely higher than the Shapley paper due to using a slower compute scaling reference point.
Anson Ho (first author on the Epoch paper) informal estimate~50% software progress shareAnson says this is higher than the Epoch numbers due to post-training enhancements (this is my understanding as the main reason for adjusting, I’m not 100% sure).
Ryan Greenblatt4.5x/year (a bit >50% progress share as he thinks recent compute scaling has been lower)

“though I'd be sympathetic to substantially higher estimates”

 

“loss doesn't accurately capture downstream performance”

 

“I'd also guess that the pure loss algorithmic progress over the last 2-3 years is somewhat lower than 3x”

 

“More minimally, I'd say that most progress in the last 2 years is down to algorithmic progress rather than due to scaling up the overall amount of compute used for training.”

My (Eli’s) overall best guess estimate

~6.3x/year (aka 55% progress share, given 4.5x/year from compute)

 

 

 

 

To sum up the above:

  1. Compute efficiency improvement rates at a fixed performance are usually roughly 2.5-3.5x/year for perplexity/MMLU/GPQA, with one method giving 3-10x. For math/coding they appear to be >10x.
  2. Looking at improvements at a fixed compute level gives ~4x/year compute-equivalent gains (with very small sample size).
  3. A fancy Shapley method of perplexity gives 5-40% share of software progress, appearing to decrease in recent times, but the first author (Anson) doesn’t put much stock in this decrease.

 

Mashing these together gives me ~4x, with substantial uncertainty. I’m particularly uncertain about how to weight general vs. math/coding improvements. I’m mostly paying attention to general improvements here with a little weight to math/coding, esp. given that the timelines supplement focuses on coding.

 

A final consideration is that none of the above account for algorithmic progress which has increasingly bigger multipliers at higher compute scales. (like LSTMs->Transformers (paper)). Perhaps this increases the rate of progress by 33% to 6.3x.[5] I’m highly uncertain about how big this adjustment should be.[6]

 

Both Anson and Ryan gave overall estimates of ~50% share of software progress so ~4.5x/year, which is pretty similar to where I ended up.

 

 

Evidence and forecasts regarding the future rate of software progress 
(and share vs. compute)

EvidenceEstimateCaveats, reasnoning, etc.

Trends in recent software progress

Epoch paper: Shapley value estimation techniques on a dataset of LLM performance on Wikitext and Penn Treebank (source)Share of software progress appears to be decreasing since the Transformer.Anson (first author) says that the decrease after the transformer isn’t robust and wouldn’t put too much stock into it.
Various other efficiency trends described above in the “Evidence regarding the recent rate of software progress (and share vs. compute)” tableEfficiency trends look relatively straight overall as far as I can tell. 

Labor growth

Recent quality adjusted growth in frontier AI researchers, Eli’s rough guess

50%/year

 

 

This is just a quite rough guess based on how fast frontier AI companies have been growing, adjusted for differences in quality. I believe OpenAI and Anthropic have been growing faster than this but a lot of the expansion has been product, sales, etc. rather than capabilities researchers.

 

Growth in frontier AI researchers starting in ~2029 after hitting investment limits10%/year (~4x slower)Rough guess informed by Claude’s estimate for the growth rate of mid-size and large companies.

Overall estimates and implications

Decrease in rate of human-driven algorithmic progressStarting in 2028, I’ll decrease growth from 50%/year to 10%/year over the course of a few years.At current labor growth rates, the trend looks pretty stable.

 

 

Evidence and forecasts regarding the recent compute trend

EvidenceEstimateEL notes

Compute trend measurements

Epoch data (source, source w/data+justification)

4.7x/year on Epoch frontpage

4.1x/year for “notable models”

4.2x/year for frontier models

5x/year for frontier LLMs

4.9x/year for GDM, 5.3x/year for OpenAI

5x/year for frontier LLMs seems like the most relevant estimate
AI 2027 compute forecast, GPT-4 to May 2025

4x/year

Between Sep 2022 (GPT-4) and May 2025 the best internal model goes from ~2e25 to ~8e26, that gives 4x/year (2.67th root of 40)

Interesting that this is lower than Epoch data
Vladimir post, 2022-20253.55x/year 

Overall estimates

Ryan Greenblatt’s estimate<4.5x/yearI thought he endorsed 4.5x/year in the post but in comments on this document he said he thinks it’s been slower.
My (Eli’s) overall best guess estimate4.5x/yearIn between Epoch frontier LLM trend and AI 2027 forecast, lines up with Ryan’s estimate

 

Evidence and forecasts regarding future compute trends

EvidenceEstimateCaveats, reasnoning, etc.

Considerations

While Grok 3 is basically on trend for the 4.5x per year trend, other AI companies seemingly haven't released an AI with much more compute than GPT-4 (Ryan post)Lower than recent trends 
Training run duration will stop increasing (Ryan post)Lower than recent trends 
More compute will move into RL, which is harder to scale up (Ryan post)Lower than recent trends 

Overall estimates

Ryan Greenblatt’s estimate for the trend going forward3.5x/year (or possibly much lower due to pretraining hitting a wall and RL being hard to scale) 
Vladimir post, 2025-20283.55x/year 
AI 2027 compute forecast, 2025 to 2027 as reported for leading AI company compute3.4x/year 
AI 2027 compute forecast, scaling pace from 2025 to 2027 by looking at internal training FLOP numbers

5x/year

 

Best internal model as of May 2025: ~8e26 FLOP

Best internal model as of May 2027: ~2e28 FLOP

This is higher than the above due to the training run being 2x longer.
Pace of compute increase after ~2028, as forecasted by Vladimir + Romeo1.55x/yearThis is ~3.4x slower than recent trends (4.5x/year)
My (Eli’s) overall estimate, 2025-20274.5x/yearThe AI 2027 internal FLOP numbers seem like the most important data point but various others are lower, which especially makes sense if we don’t see AGI very soon. Also, the AI 2027 numbers rely on training time increasing and it’s unclear whether that assumption will hold.
My (Eli’s) overall estimate, 2028 and onwardDecreases from 4.5 to 1.55x/year over the course of 2028 to 2031 

 

Thanks to Ryan Greenblatt and Anson Ho for comments, and Peter Johnson for the nudge to spend more time on this.

  1. ^

    We’re aiming to forecast the contributors of compute + algorithms to frontier models, so distillation seems like it shouldn’t be allowed. Seems like Epoch agrees because they exclude distillation.

  2. ^

    Mar 2023: GPT-3.5 gets 70% few-shot. Epoch estimates 2.58e24 training compute.

    May 2024: Gemini 1.5 Flash gets 79% (5-shot). Epoch doesn’t have compute estimates.

    Oct 2024: Gemini 1.5 Flash 8B gets 75%. Epoch doesn’t have compute estimates.

  3. ^

    HCAST has a decent amount as well but not from non-frontier models which we need to look at efficiency improvements.

  4. ^
    • Yi-1.5 because lack of time (May 2024: Yi-1.5-9 (2e23), 34B (7e23))
    • Geminio 1.0 because lack of time (Pro and Ultra, 2e24-5e25)
    • Llama 2, 3 and 3.1 because not SOTA in its compute class and didn’t seem particularly fruitful to rely on within-family comparisons
    • Reka Flash + Core because not SOTA in compute class
  5. ^

    4*cube_root(4)

  6. ^

    Epoch’s algorithmic progress paper contains some weak evidence on this in footnote 8 and appendix B.

Mentioned in
29SE Gyges' response to AI-2027