Catch-Up Algorithmic Progress Might Actually be 60× per Year

Aaron_Scher

Epistemic status: This is a quick analysis that might have major mistakes. I currently think there is something real and important here. I’m sharing to elicit feedback and update others insofar as an update is in order, and to learn that I am wrong insofar as that’s the case.

Summary

The canonical paper about Algorithmic Progress is by Ho et al. (2024) who find that, historically, the pre-training compute used to reach a particular level of AI capabilities decreases by about 3× each year. Their data covers 2012-2023 and is focused on pre-training.

In this post I look at AI models from 2023-2025 and find that, based on what I think is the most intuitive analysis, catch-up algorithmic progress (including post-training) over this period is something like 16×–60× each year.

This intuitive analysis involves drawing the best-fit line through models that are on the frontier of training-compute efficiency over time, i.e., those that use the least training compute of any model yet to reach or exceed some capability level. I combine Epoch AI’s estimates of training compute with model capability scores from Artificial Analysis’s Intelligence Index. Each capability level thus yields a slope from its fit line, and these slopes can be aggregated in various ways to determine an overall rate of progress. One way to do this aggregation is to assign subjective weights to each capability level and take a weighted mean of the capability level slopes (in log-compute), yielding an overall estimate of algorithmic progress: 1.76 orders of magnitude per year, or a ~60× improvement in compute efficiency, or a 2 month halving time in the training compute needed to reach a particular capability level. Looking at the median of the slopes yields 16× or a halving time of 2.9 months.

Based on this evidence and existing literature, my overall expectation of catch-up algorithmic progress in the next year is maybe 20× with an 80% confidence interval of [2×–200×], considerably higher than I initially thought.

The body of this post explains catch-up vs. frontier algorithmic progress, discusses the data analysis and results, compares two Qwen models as a sanity check, discusses existing estimates of algorithmic progress, and covers several related topics in the appendices.

What do I mean by ‘algorithmic progress’?

First, let me differentiate between two distinct things people care about when they discuss “algorithmic progress”: the rate of catch-up, and algorithmic efficiency improvement at the frontier.

Catch-up: when a capability is first reached using X amount of compute, how long does it take until that capability can be reached with [some amount less than X] compute? Conveniently, catch-up is directly measurable using relatively simple measures: release date, benchmark scores, and an estimate of training compute. Catch-up rates affect the proliferation/diffusion of AI capabilities and indirectly reflect the second kind of algorithmic progress.

Algorithmic progress at the frontier is less clearly defined. It asks: for a given set of assumptions about compute growth, how quickly will the frontier of AI capabilities improve due to better algorithms? Frontier efficiency or “effective compute” informs predictions about the automation of AI research or an intelligence explosion; if compute remains constant while the amount of research effort surges, how much will capabilities improve?

Hernandez & Brown define effective compute as follows:

The conception we find most useful is if we imagine how much more efficient it is to train models of interest in 2018 in terms of floating-point operations than it would have been to “scale up” training of 2012 models until they got to current capability levels. By "scale up," we mean more compute, the additional parameters that come with that increased compute, the additional data required to avoid overfitting, and some tuning, but nothing more clever than that.

Unfortunately, this is not easily measured. It invokes a counterfactual in which somebody in 2012 massively scales up training compute. (If they had actually done that, then, looking back, we would be measuring catch-up instead!) The common workaround is empirical scaling laws: train a family of models in 2012 using different amounts of compute but the same dataset and algorithms, and compare their training compute and performance, extrapolating to estimate how they would likely perform with more training compute.

Several factors affect the relative speed of these two measures. Catch-up might be faster due to distillation or synthetic data: once an AI model reaches a given capability level, it can be used to generate high-quality data for smaller models. Catch-up has a fast-follower or proof-of-concept effect: one company or project achieving a new frontier of intelligence lets everybody else know that this is possible and inspires efforts to follow suit (and the specific methods used might also be disseminated). On the other hand, the returns to performance from compute might diminish rapidly at the frontier. Without better algorithms, capabilities progress at the frontier may require vast compute budgets, rendering algorithmic efficiency a particularly large progress multiplier. However, it’s not clear to me how strongly these returns diminish on downstream tasks (vs. language modeling loss where they diminish steeply). See e.g., Owen 2024, Pimpale 2025, or the Llama-3.1 paper.

This post is about catch-up algorithmic progress, not algorithmic progress at the frontier.

Methods and Results

The intuitive way to measure catch-up algorithmic progress is to look at how much compute was used to train models of similar capability, over time, and then look at the slope of the compute frontier. That is, look at how fast “smallest amount of compute needed to reach this capability level” has changed over time, for different capability levels.

So I did that, with substantial help from Claude^[1]. I use Epoch’s database of AI models for compute estimates (though I make a few edits to fix what I believe to be errors), and for capabilities, I use Artificial Analysis’s Intelligence Index, an average of 10 widely used benchmarks. Here’s the most important graph:

And the accompanying table:

Results table

Capability_threshold	slope_log10_yr	efficiency_factor_yr	multiplier_yr	subjective_weight	n_models	n_models_on_frontier	first_date	last_date
5	1.20	15.72	0.064	5	80	7	2023-03-15	2025-08-14
10	1.85	70.02	0.014	5	67	3	2023-03-15	2024-04-23
15	1.04	10.92	0.092	8	57	6	2023-03-15	2025-07-15
20	0.93	8.60	0.116	9	51	8	2023-03-15	2025-07-15
25	2.32	210.24	0.005	9	39	7	2024-07-23	2025-07-15
30	1.22	16.61	0.060	8	31	5	2024-12-24	2025-09-10
35	1.41	25.91	0.039	8	30	5	2025-01-20	2025-09-10
40	1.22	16.50	0.061	8	22	6	2025-01-20	2025-09-10
45	4.48	29984.61	0.000	8	15	6	2025-02-17	2025-09-10
50	0.32	2.11	0.473	0	10	2	2025-08-05	2025-09-10
55	1.05	11.25	0.089	0	6	2	2025-08-05	2025-09-22
60	0.754	5.68	0.176	0	3	2	2025-08-05	2025-09-29
65				0	2	1	2025-09-29	2025-09-29
Mean	1.48	2531.51	0.099
Weighted	1.76	3571.10	0.051	68

Weighted log conversion	1.76	57.10	0.018
Median	1.21	16.10	0.062

The headline result: By a reasonable analysis, catch-up algorithmic progress is 57× (call it 60×) per year in the last two years. By another reasonable analysis, it’s merely 16×.

These correspond to compute halving times of 2 months and 2.9 months.

There were only three capability levels in this dataset that experienced less than one order of magnitude per year of catch-up.

There are a bunch of reasonable ways to filter/clean the data. For example, I choose to focus only on models with “Confident” or “Likely” compute estimates. Historically, I’ve found the methodology for compute estimates shaky in general, and less confident compute estimates seem pretty poor. To aggregate across the different capability bins, I put down some subjective weightings.^[2]

Other ways of looking at the data, such as considering all models with compute estimates or only those with Confident estimates, produce catch-up rates mostly in the range of 10×–100× per year. I’ve put various other analyses in this Appendix.

Sanity check: Qwen2.5-72B vs. Qwen3-30B-A3B

As a sanity check, let’s look at progress between Qwen2.5 and Qwen3. For simplicity, I’ll just look at the comparison between Qwen2.5-72B-Instruct and Qwen3-30B-A3B (thinking)^[3]. I picked these models because they’re both very capable models that were near the frontier of compute efficiency at their release, among other reasons^[4]. I manually calculated the approximate training compute for both of these models^[5].

Model	Qwen2.5-72B-Instruct	Qwen3-30B-A3B (thinking)
Release date	September 18 2024	April 29 2025
Training compute (FLOP)	8.6e24	7.8e23
Artificial Analysis Intelligence Index	29	36.7
Approximate cost of running AAII ($)^[6]	3.4	38

So these models were released about 7.5 months apart, the latter is trained with an order of magnitude less compute, and it exceeds the former’s capabilities—for full eval results see this Appendix. The 60×/yr trend given above would imply that reaching the capabilities of Qwen2.5-72B-Instruct with 7.8e23 FLOP would take 7.1 months^[7]. Meanwhile, Qwen3-30B-A3B (thinking) exceeded this capability after 7.5 months. (I’m not going to attempt to answer whether the amount of capability-improvement over 2.5 is consistent with the trend.) So the sanity check passes: from Qwen2.5 to Qwen3 we have seen training compute efficiency improve significantly. (I’m not going to analyze the inference cost differences, though it is interesting that the smaller model is more expensive due to costing a similar amount per token and using many more tokens in its answers!)

Discussion

How does this compare to the recent analysis in A Rosetta Stone for AI Benchmarks?

There are a bunch of existing estimates of algorithmic progress. One of the most recent and relevant is that from Ho et al. 2025, who use the Epoch Capabilities Index (ECI) to estimate algorithmic progress in various ways. I’ll focus on this paper and then briefly discuss other previous estimates in the next section.

Their Appendix C.2 “Directly estimating algorithmic progress” performs basically the same methodology as in this post, but they relegate it to an appendix because they do not consider it to be the most relevant. They write: “This gives us a way of sanity-checking our core results, although we consider these estimates less reliable overall — hence we place them in the appendix rather than in the main paper.” and later “Like the estimates using our primary method in Section 3.2.2, the range of values is very wide. In particular, we find training compute reductions from 2× to 400×! The median estimate across these is around 10× per year, but unfortunately we do not have much data and consider this method quite unreliable.”

I find this reasoning unconvincing because their appendix analysis (like that in this blog post) is based on more AI models than their primary analysis! The primary analysis in the paper relates a model’s capabilities (Cm) to its training compute (Fm) as follows: Cm = k*log(Fm) + b, where b is the algorithmic quality of a model. Then solving for algorithmic progress is a multi-step process, using specific model families^[8]to estimate k, and then using k to estimate b for all models. The change in b over time is algorithmic progress. The crucial data bottleneck here is on step one, where you use a particular model family to estimate k. They only have 12 models in the primary analysis, coming from the Llama, Llama 2, and Llama 3.1 families. The overall results are highly sensitive to these models, as they discuss: “Much of this uncertainty comes from the uncertainty in the estimate of k.” I would consider relying on just 3 model families to be a worse case of “we do not have much data”, and thus not a good argument against using the intuitive approach.

There are various other differences between this post and Ho et al. 2025 where I think I have made a better choice.

In their primary analysis of algorithmic progress they exclude “distilled” models. They write “We drop distilled models from the dataset since we are interested in capturing the relationship between model capabilities and training compute for the final training run. This relationship might be heavily influenced by additional compute sources, such as from distillation or substantial quantities of synthetic data generation (Somala, Ho, and Krier 2025).” In an appendix, they correctly explain that publicly available information doesn’t tell us whether many models are distilled, making this difficult to do in practice.

I also think it’s unprincipled. When thinking about catch-up algorithmic progress, it’s totally fine for existing models to influence the training of future models, for instance, via creating synthetic data, being used for logit distillation, or even doing research and engineering to train future AIs more efficiently. I don’t see the principled reason to exclude distilled models, given that existing models simply will, by default, be used to help train future models. But note that this isn’t universally true. For example, it was reported that Anthropic cut off access to Claude for OpenAI employees, and broadly there are many access-levels of AI that would prevent certain kinds of “use of existing models to help train future models”. Interestingly, their appendix results show similar results to the main paper even when including distilled models.

Edit: the following paragraph and the histogram (now deleted) were based on me looking at the wrong data. According to Epoch, ECI scores are based on at least 4 benchmarks scores for each model.

I am also unconvinced that ECI is a better metric to use than AAII. One issue with ECI scores is that they are often calculated using just 2 benchmark scores for a particular model. I expect this introduces significant noise. By comparison, the Artificial Analysis Intelligence Index includes 10 benchmark scores for each model (at least most of the time, see ~~this limitation). As we can see, the ECI score for many models is based on just 2 or 3 different benchmark scores:~~

How does this compare to other previous estimates of algorithmic progress

For the sake of time, I’m just discussing headline results. I’m not going to discuss the methodological differences between these works or whether they focus on catch-up or algorithmic progress at the frontier. This is more of a pointer to the literature than an actual literature review:

Hernandez & Brown (2020) find that algorithmic progress in computer vision is 1.7× per year.
Erdil & Besiroglu (2023) find that algorithmic progress in computer vision is 2.5× per year.
Ho et al. (2024) find that algorithmic progress in language models is progressing at around 2.7× (often rounded to 3×), 95% CI [1.8×–6.3×], per year.
Ho et al. (2025) (discussed above) find that algorithmic progress in language models is 6× per year. Other estimates they arrive at with various methods include 5×, 3.5×, 10×, and 20×.
Gundlach et al. (2025) account for specific algorithmic improvements that net out to around a 2.2× rate of algorithmic progress.
Whitfill et al. (2025) estimate algorithmic progress in language models at 3.5× per year.
MacAskill & Moorhouse (2025) combine existing estimates of pre-training algorithmic progress (3× from Ho et al. 2024) with an informal guess at post-training enhancements from Anthropic (3×) to arrive at a combined rate of overall algorithmic progress of around 9× per year.
In January 2025, Dario Amodei wrote the following: “In 2020, my team published a paper suggesting that the shift in the curve due to algorithmic progress is ~1.68×/year [Hernandez & Brown, 2020]. That has probably sped up significantly since; it also doesn't take efficiency and hardware into account. I'd guess the number today is maybe ~4×/year.”

As discussed in an Appendix, the rate of inference cost reduction is also relevant to one’s overall estimate of algorithmic progress.

Gundlach et al. (2025) estimate a rate of algorithmic progress relevant to inference prices at 3× per year, or 5× to 10× before taking hardware efficiency into account.

Other related work includes:

How should we update on this analysis?

I think we should update on this analysis, even though there are various methodological concerns—see this Appendix for limitations. This analysis was about using the most intuitive approach to estimate the rate of catch-up algorithmic progress. As somebody who doesn’t love math, I think intuitive approaches, where they are available, should be preferred to complicated modeling.

How should we update? Well, if you are me and you previously thought that algorithmic progress was 3× per year, you should update toward thinking it is higher, e.g., 60× or 20× or somewhere between your previous view and those numbers. The data from the last 2 years is not consistent with 3× per year algorithmic progress (to be clear and fair to Ho et al. 2024, their work focused on pre-training only). Due to the combination of pre-training improvements and post-training improvements, one probably should have expected overall algorithmic progress to be greater than 3× even before seeing these results. But also remember that catch-up algorithmic progress is not the same as algorithmic progress at the frontier!

Based on this analysis and the existing literature, my current all-things-considered view is that catch-up algorithmic progress in the last couple of years and for the next year is likely 20× with an 80% confidence interval of [2×–200×], considerably higher than I initially thought.

Here is a concrete and falsifiable prediction from that estimate^[9]:
DeepSeek-V3.2-Exp is estimated by Epoch to be trained with 3.8e24 FLOP. It reached an AAII index score of 65.9 and was released on September 29, 2025. It is on the compute-efficiency frontier. I predict that by September 29, 2026, the least-compute-used-to-train model that reaches a score of 65 will be trained with around 1.9e23 FLOP, with the 80% CI covering 1.9e22–1.9e24 FLOP.

There are various implications of this update for one’s beliefs about AI governance, but I won’t discuss them for the sake of time.

The analysis here should be largely replicable using this data^[10]and this colab notebook^[11]. The various tables in this post are available in spreadsheet format here.

Appendices

Appendix: Filtering by different confidence levels of compute estimates

All models

Results table

Capability_threshold	slope_log10_yr	efficiency_factor_yr	multiplier_yr	subjective_weight	n_models	n_models_on_frontier	first_date	last_date
5	1.242982673	17.49776877	0.05715014372	5	89	8	2023-03-15	2025-08-14
10	1.845247949	70.02416666	0.01428078402	5	75	3	2023-03-15	2024-04-23
15	1.038225438	10.9200704	0.09157450124	8	63	6	2023-03-15	2025-07-15
20	0.9343559857	8.597179326	0.1163172201	9	57	8	2023-03-15	2025-07-15
25	2.073184863	118.3545239	0.008449191184	9	45	7	2024-06-20	2025-07-15
30	1.220342999	16.6089814	0.06020838821	8	35	5	2024-12-24	2025-09-10
35	1.533072741	34.12500632	0.02930402388	8	33	6	2025-01-20	2025-09-10
40	1.217481885	16.49992177	0.06060634794	8	24	6	2025-01-20	2025-09-10
45	4.476898444	29984.61273	3.34E-05	8	17	6	2025-02-17	2025-09-10
50	16.42308809	2.65E+16	3.77E-17	0	12	3	2025-07-09	2025-09-10
55	9.345398436	2215126006	4.51E-10	0	8	3	2025-07-09	2025-09-22
60	8.161341043	1.45E+08	6.90E-09	0	5	3	2025-07-09	2025-09-29
65	9.327678293	2.13E+09	4.70E-10	0	4	3	2025-07-09	2025-09-29
Mean	4.53	2037721399413230.00	0.034
Weighted	1.74	3560.03	0.050	68

Weighted log conversion	1.74	55.10	0.018
Median	1.85	70.02	0.014

Confident compute estimates

Results table

Capability_threshold	slope_log10_yr	efficiency_factor_yr	multiplier_yr	subjective_weight	n_models	n_models_on_frontier	first_date	last_date
5	0.505	3.201	0.312	5	61	5	2023-07-18	2025-08-14
10	0.062	1.154	0.866	5	49	2	2023-07-18	2024-04-23
15	1.516	32.845	0.030	8	40	5	2024-06-07	2025-07-15
20	1.605	40.244	0.025	9	36	7	2024-07-23	2025-07-15
25	2.425	266.051	0.004	9	26	6	2024-07-23	2025-07-15
30	1.006	10.131	0.099	8	19	7	2024-12-24	2025-09-10
35	1.275	18.823	0.053	8	18	7	2025-01-20	2025-09-10
40	1.217	16.500	0.061	8	14	6	2025-01-20	2025-09-10
45	4.288	19399.837	0.000	8	9	3	2025-07-11	2025-09-10
50	0.325	2.112	0.473	0	7	2	2025-08-05	2025-09-10
55	0.754	5.675	0.176	0	3	2	2025-08-05	2025-09-29
60	0.754	5.675	0.176	0	2	2	2025-08-05	2025-09-29
65				0	1	0
Mean	1.31	1650.19	0.190
Weighted	1.67	2332.40	0.119	68

Weighted log conversion	1.67	46.71	0.021
Median	1.11	12.93	0.077

Appendix: How fast is the cost of AI inference falling?

We might ask whether AI inference costs are also falling very fast. It’s really easy to look at per-token costs, so that’s what I do here. It would be more principled to look at “Cost to Run Artificial Analysis Intelligence Index”.

Fortunately, that token-adjusted analysis has already been done by Gundlach et al. 2025. They find “the price for a given level of benchmark performance has decreased remarkably fast, around 5× to 10× per year, for frontier models on knowledge, reasoning, math, and software engineering benchmarks.” They also write, “Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around 3× per year.” I will defer to them on the token-quantity adjusted numbers.

But let’s look at per-token numbers briefly.

Results table

Capability_threshold	slope_log10_yr	price_reduction_factor_yr	price_multiplier_yr	subjective_weight	n_models	n_models_on_frontier	first_date	last_date
5	0.5308772781	3.395293156	0.2945253779	5	139	5	2022-11-30	2025-05-20
10	1.101835417	12.64257147	0.07909783247	5	133	4	2023-03-15	2025-05-20
15	1.487149892	30.70081409	0.03257242616	8	121	7	2023-03-15	2025-05-20
20	1.30426336	20.14945761	0.04962912745	9	113	7	2023-03-15	2025-08-18
25	1.573068449	37.41695563	0.02672585151	9	97	7	2024-05-13	2025-08-18
30	2.634093754	430.6195613	0.002322235425	8	74	8	2024-09-12	2025-08-18
35	2.903967451	801.6179829	0.001247477004	8	67	9	2024-09-12	2025-08-18
40	2.98712387	970.7868174	0.001030092274	8	59	8	2024-09-12	2025-08-05
45	2.720627924	525.5668007	0.001902707703	8	46	6	2024-12-05	2025-08-05
50	2.330820045	214.2002853	0.004668527863	8	36	4	2024-12-20	2025-08-05
55	1.476940885	29.98754309	0.03334718009	8	23	3	2024-12-20	2025-08-05
60	1.800798316	63.2118231	0.01581982533	5	15	2	2024-12-20	2025-08-05
65	0.9487721533	8.887347329	0.1125195138	5	10	3	2024-12-20	2025-09-29
Mean	1.83	242.24	0.050
Weighted	1.92	265.82	0.041	89

Weighted log conversion	1.92	82.47	0.012
Median	1.57	37.42	0.027

So by my weighting, the cost per 1m tokens is falling at around 82× per year. To modify this to be a true estimate of algorithmic efficiency, one would need to adjust for other factors that affect prices, including improvements in hardware price-performance. Note that Artificial Analysis has made a similar graph here, and that others have estimated similar quantities for the falling cost of inference. This recent OpenAI blog post says “the cost per unit of a given level of intelligence has fallen steeply; 40× per year is a reasonable estimate over the last few years!”. This data insight from Epoch finds rates of 9×, 40×, and 900× for three different capability levels. Similar analysis has appeared from Dan Hendrycks, and in the State of AI report for 2024.

Prior work here generally uses per-token costs, and, again, a more relevant analysis would look at the cost to run benchmarks (cost per token * number of tokens), as in Gundlach et al. 2025 (who find 5× to 10× per year price decreases before accounting for hardware efficiency) or Erol et al. 2025. Gundlach et al. 2025 and Cottier et al. 2025 find that progress appears to be faster for higher capability levels.

Overall I think trends in inference costs provide a small update against “20×–60×” rates of catch-up algorithmic progress for training and point toward lower rates, even though they are not directly comparable.

Appendix: Histogram of 1 point buckets

A natural question is to ask what the distribution of the slopes of catch-up are across the different capability buckets. This shows us that it’s not just the high-capability buckets that are driving high rates of progress, even though they seem to have higher rates of progress.

Appendix: Qwen2.5 and Qwen3 benchmark performance

For those interested, here’s a more thorough comparison of the models’ capabilities, adapted from the Qwen3 paper. First, Instruct vs. Thinking, where the newer, small model dominates:

Task/Metric	Qwen2.5-72B-Instruct	Qwen3-30B-A3B (thinking)
Architecture	Dense	MoE
# Activated Params	72B	3B
# Total Params	72B	30B
MMLU-Redux	86.8	89.5
GPQA-Diamond	49	65.8
C-Eval	84.7	86.6
LiveBench 2024-11-25	51.4	74.3
IFEval strict prompt	84.1	86.5
Arena-Hard	81.2	91
AlignBench v1.1	7.89	8.7
Creative Writing v3	61.8	79.1
WritingBench	7.06	7.7
MATH-500	83.6	98
AIME’24	18.9	80.4
AIME’25	15	70.9
ZebraLogic	26.6	89.5
AutoLogi	66.1	88.7
BFCL v3	63.4	69.1
LiveCodeBench v5	30.7	62.6
CodeForces (Rating / Percentile)	859 / 35.0%	1974 / 97.7%
Multi-IF	65.3	72.2
INCLUDE	69.6	71.9
MMMLU 14 languages	76.9	78.4
MT-AIME2024	12.7	73.9
PolyMath	16.9	46.1
MLogiQA	59.3	70.1
Average	50.2	72.1

I was also curious to compare the base models which turn out to be very close in their capabilities (note these are different benchmarks than for thinking/instruct):

Metric	Qwen2.5-72B	Qwen3-30B-A3B
Variant	Base	Base
Architecture	Dense	MoE
# Total Params	72B	30B
# Activated Params	72B	3B
MMLU	86.06	81.38
MMLU-Redux	83.91	81.17
MMLU-Pro	58.07	61.49
SuperGPQA	36.2	35.72
BBH	86.3	81.54
GPQA	45.88	43.94
GSM8K	91.5	91.81
MATH	62.12	59.04
EvalPlus	65.93	71.45
MultiPL-E	58.7	66.53
MBPP	76	74.4
CRUX-O	66.2	67.2
MGSM	82.4	79.11
MMMLU	84.4	81.46
INCLUDE	69.05	67
Average	70.2	69.5

Appendix Leave-One-Out analysis

The methodology in this post is sensitive to outlier models, but it’s unclear how bad the problem is. To understand whether these outliers might be throwing things off substantially, we can recompute the slope of each bucket while excluding one of the efficiency-frontier models, iterating through each efficiency-frontier model one at a time. A naive way to do this would be to remove the model and calculate the slope of the remaining efficiency-frontier models, but we first have to recalculate the efficiency-frontier after removing the model, because other models could be added to the frontier when this happens.

Then we can examine the distribution of slopes produced in that process for each capability threshold. Looking at the slope_range_min and slope_range_max, gives us (in log-compute) what the slowest and fastest rates of reduction are when doing leave-one-out. If it were the case that particular models were problematic, then this range would be very wide. If outliers were often inflating the slope estimates, then slope_range_min would be pretty small compared to the baseline_slope (all models included).

What we actually see is a moderate range in the slopes and that slope_range_min is often still quite high. Therefore, I do not think that outlier models are a primary driver of the rapid rate of algorithmic progress documented in this post.

Leave-One-Out (loo) for Confident and Likely compute estimates:

Results table

threshold	n_frontier	baseline_slope	slope_range_min	slope_range_max	range_width	most_influential_model	max_influence	weight
5	7	1.196	0.958	1.349	0.390	GPT-4 (Mar 2023)	0.238	5
10	3	1.845	0.062	7.007	6.944	phi-3-mini 3.8B	-5.162	5
15	6	1.038	0.901	1.101	0.200	Phi-4 Mini	0.138	8
20	8	0.942	0.890	1.518	0.628	GPT-4 (Mar 2023)	-0.576	9
25	7	2.323	1.702	2.502	0.801	EXAONE 4.0 (1.2B)	0.621	9
30	5	1.220	1.132	1.413	0.282	DeepSeek-V3	-0.193	8
35	5	1.413	1.330	3.976	2.646	DeepSeek-R1	-2.563	8
40	6	1.217	0.930	4.223	3.293	DeepSeek-R1	-3.006	8
45	6	4.477	3.518	5.319	1.801	Grok 3	0.958	8
50	2	0.325				N/A (n<3)		0
55	2	1.051				N/A (n<3)		0
60	2	0.754				N/A (n<3)		0
65	1					N/A (n<3)		0

Appendix: Limitations

Outlier models

One major limitation of this methodology is that it is highly sensitive to specific outlier models.

On one hand, outlier models that are highly capable but developed with small amounts of compute pose an issue. For instance, early versions of this analysis that directly used all of the compute estimates from Epoch resulted in much larger rates of algorithmic progress, such as 200×, because of a couple outlier models that had (what I now realize are) incorrect compute estimates in the Epoch database, including the Qwen-3-Omni-30B-A3B model and the Aya Expanse 8B model. I investigated some of the models that were greatly affecting the trend lines and manually confirmed/modified some of their compute estimates. I believe that clearly erroneous FLOP estimates are no longer setting the trend lines. However, compute estimates can still be noisy in ways that are not clearly an error.

Noisy estimates are especially a problem for this methodology because the method selects the most compute-efficient model at each capability level. If there is lots of noise in compute estimates, extremely low-compute models will set the trend. Meanwhile, extremely high-compute models don’t affect the efficiency-frontier trend at all unless they were the first model to reach some capability level (less likely due to fewer models setting new capability records, specifically one per capability level). This issue can be partially mitigated by up-weighting the capability levels that have many models setting their frontier (as I do) (single outliers do less to set the trend line for these series), but it is still a major limitation.

This methodology is also sensitive to early models being trained with a large amount of compute and setting the trend line too high. For example, Grok-3 starts the frontier for the AAII ≥45 bucket, but then Claude 3.7 Sonnet was released about a week later, is in the same bucket, and is estimated to use much less compute. Now, it turns out that the slope for the 45 series is still very steep if Grok-3 is removed, but this data point shows how the methodology could lead us astray—there wasn’t actually an order of magnitude of compute worth of algorithmic improvements that happened in that week. One way to mitigate this issue is to investigate leave-one-out bootstrapped analysis, as I do in this Appendix. This analysis makes me think that outlier models are not a primary driver of the rapid trends reported here.

Lack of early, weak models

There is a lack of weak models in the dataset before 2023. GPT-4 scores 21.5, but in this dataset, it is the first model to score above the thresholds of 5, 10, 15, and 20. In reality, it is probably the first model to score above 20 and maybe the first model to score above 10 and 15, but the relevant comparison models either do not have compute estimates, or do not have AAII scores, and thus are not in this dataset. For example, GPT-3.5 Turbo (2022-11-30) scores 8.3 but has no compute estimate. This issue is partially mitigated by weighting the 5, 10, and 15 buckets lower, but also the overall results are not very sensitive to the weighting of these particular buckets.

Post-training compute excluded

The compute data in this analysis is generally only for pre-training compute, not post-training compute. This is probably fine because post-training compute is likely a small fraction of the compute used to train the vast majority of these models, but it is frustrating that it is not being captured. Some models, such as Grok-4, use a lot of post-training compute. I currently believe (but will not justify here) that the amount of post-training compute used in the vast majority of models in this dataset is less than 10% of their pre-training compute and therefore ignorable, and I do not think counting post-training compute would substantially change the results.

Inference-time compute excluded

When looking at “reasoning” models, this analysis uses their highest-reasoning-effort performance. This makes recent models seem more training-compute efficient because, in some sense, they are trading off training compute for inference compute, compared to earlier models. I don’t think this is a major concern because I think inference costs are mostly not that big of a deal when thinking about AI capabilities improvements and catastrophic risk. I won’t fully explain my reasoning here, but as a general intuition, the absolute cost to accomplish some task is usually quite small. For example, this paper that uses LLMs to develop cybersecurity exploits arrives at an empirical cost of about $24 per successful exploit. Because inference costs are, empirically, fairly small compared to the budget of many bad actors, it is more relevant whether an AI model can accomplish a task at all rather than whether it takes a bunch of inference tokens to do so.

Some AAII scores are estimates

For some models (especially older ones), the Artificial Analysis Intelligence Index score is labeled as “Estimate (independent evaluation forthcoming)”. It is unclear how these scores are determined, and they may not be a reliable estimate. The Artificial Analysis API does not clearly label such estimates and I did not manually remove them for secondary analysis. Ideally the capability levels that have these models (probably typically lower levels) would be weighted less, but I don’t do this due to uncertainty about which models have Estimates vs. independently tested scores.

Comparing old and new models on the same benchmark

There are various potential problems with using the Artificial Analysis Intelligence Index (AAII) instead of, say, the recent ECI score from Epoch. Overall, I think AAII is a reasonable choice.

One problem is that AAII assigns equal weight to 10 benchmarks, but this is unprincipled and might distort progress (e.g., because getting 10 percentage points higher of a score on tau bench is easier than doing the same on MMLU—frontier models have probably scored approximately as high as they ever will on MMLU).

Relatedly, AAII likely favors recent models due to heavy influence of agentic tasks and recent benchmarks. Basically nobody tried to train for agentic tool use 2 years ago, nor did they try to optimize performance on a benchmark that didn’t exist yet. I’m not sure there is a satisfactory answer to this problem. But I’m also not sure it’s that big of a problem! It is an important fact that AI use cases are changing over time, largely because the AIs are getting capable enough to do more things. It’s good that we’re not still evaluating models on whether they can identify what word a pronoun refers to! Evaluating yesterday’s models by today’s standards of excellence does rig the game against them, but I’m not sure it’s worse than evaluating today’s models on stale and irrelevant benchmarks.

I expect the makeup of AAII to change over time, and that’s okay. If I want to predict, “how cheap will it be in late 2026 to train a model that is as good as GPT-5.2-thinking on the tasks that are relevant to me in late 2026?” then the AAII approach makes a lot of sense! I don’t anticipate my late 2026 self caring all that much about the current (late 2025) benchmarks compared to the late 2026 benchmarks. But this is a different question from “how much compute will be needed a year from now to reach the current models’ capabilities, full stop”.

It’s good to pursue directions like ECI that try to compare across different benchmarks better, but I’m skeptical of it for various reasons. One reason is that I have tried to keep this analysis as intuitive and simplistic as possible. Raw benchmark scores are intuitive, they tell you the likelihood of a model getting questions in [some distribution sufficiently close to the test questions] correct. AAII is slightly less intuitive as it’s an average of 10 such benchmarks, but the score still means something to me. In general, I am pretty worried about over-analysis leading us astray due to introducing more places for mistakes in reasoning and more suspect assumptions. That’s why the analysis in this post takes the most simple and intuitive approach (by my lights) and why I choose to use AAII as the capabilities metric.

Claude did all the coding, I reviewed the final code. I take credit for any mistakes. ↩︎
The weightings are, roughly, based on the following reasoning (some of these ideas are repeated elsewhere in this post): ↩︎
Not to be confused with Qwen3-30B-A3B-Thinking-2507 (July 29 2025), Qwen-3-Next-80B-A3B-Thinking (September 9 2025), Qwen3-Omni-30B-A3B (Sept 15 2025), or Qwen3-VL-30B-A3B-Thinking (Sept 30 2025). ↩︎
These two models are a good fit for this analysis because: ↩︎
The compute estimate for Qwen2.5-72B is based on the paper: the model has 72B active parameters and is trained on “18 trillion tokens”. There is then some post-training, seemingly for tens or hundreds of billions of tokens. For simplicity we’ll do a 10% bump to account for post-training, even though the true amount is probably less (note this is not consistent with how FLOP calculations are done in the Epoch database, typically post-training is ignored). So the calculation is 1.1 * (6*72e9*18e12) = 8.6e24 FLOP. ↩︎
While Artificial Analysis reports “Cost to run Artificial Analysis Intelligence Index” for many models, it does not directly do this for the 72B model. The cost of running AAII for Qwen3-30B-A3B (Reasoning) is reported as $151. This is around 60M output tokens and uses the pricing from Alibaba Cloud ($2.4/M output); using Fireworks pricing ($0.6/M output) would cost around $38, which I think is a better estimate. For Qwen2.5-72B we have 8.5M output tokens; at a current cost of $0.4/M output tokens (the median of three providers), this would cost $3.4 (input tokens are a small fraction of the cost so we’ll ignore them). Note that there is large variance in price between providers, and I expect the cost-to-a-provider of serving Qwen3-30B-A3B is actually lower per-token than the 72B model, though considering it uses ~10× the tokens, the overall cost might be higher as it is here. ↩︎
(ln((8.6e24)/(7.8e23))/ln(57.10))*12 = 7.12 ↩︎
A model family is a series of models that are trained with very similar algorithms and data, differing only/primarily in their training compute. ↩︎
I would be willing to make some bets with reputable counterparties if we can work out more specifics. ↩︎
This data was last updated from Epoch and Artificial Analysis on 17 Dec 2025. ↩︎
There may be some small discrepancies between the results reported here and those replicated with the notebook due to me making various small fixes in the final version of the data/code compared to the results presented here. ↩︎

~~Your results are primarily driven by the inclusion of Llama 3.1-405B and Grok 3 (and Grok 4 when you include it).~~ Those models are widely believed to be cases where a ton of compute was poured in to make up for poor algorithmic efficiency. If you remove those I expect your methodology would produce similar results as prior work (which is usually trying to estimate progress at the frontier of algorithmic efficiency, rather than efficiency progress at the frontier of capabilities).

I could imagine a reply that says "well, it's a real fact that when you start with a model like Grok 3, the next models to reach a similar capability level will be much more efficient". And this is true! But if you care about that fact, I think you should instead have two stylized facts, one about what happens when you are catching up to Grok or Llama, and one about what happens when you are catching up to GPT, Claude, or Gemini, rather than trying to combine these into a single estimate that doesn't describe either case.

Your detailed results are also screaming at you that your method is not reliable. It is really not a good sign when your analysis that by construction has to give numbers in produces results that on the low end include 1.154, 2.112, 3.201 and on the high end include 19,399.837 and even (if you include Grok 4) 2.13E+09 and 2.65E+16 (!!)

I find this reasoning unconvincing because their appendix analysis (like that in this blog post) is based on more AI models than their primary analysis!

The primary evidence that the method is unreliable is not that the dataset is too small, it's that the results span such a wide interval, and it seems very sensitive to choices that shouldn't matter much.

Your results are primarily driven by the inclusion of Llama 3.1-405B and Grok 3

I'm fairly sure this is not the case. In this appendix when I systematically drop one frontier model at a time and recalculate the slope for each bucket, Llama 3.1-405B isn't even the most influential model for the >=25 bucket (the only bucket it's frontier for)! And looking at the graph, that's not surprising, it looks right on trend. Grok 3 also looks surprisingly on trend, and looking at that leave-one-out analysis, it is pretty influential, but even without it, the slope for that capability bucket is -3.5 order of magnitude per year. Another reason to think these models are not the main driver of the results is that there are high slopes in capability buckets that don't include these models, such as 30, 35, 40 (log10 slopes of 1.22, 1.41, 1.22).

For thoroughness, I also just reran the analysis and totally excluded these data points and the results are basically the same, for confident and likely compute estimates (main result in the post) we get a weighted log10 mean of 1.64 (44×) and median of 1.21 (16×). I consider these to be quite in line with the main results (1.76, 1.21).

There's a related point, which is maybe what you're getting at, which is that these results suffer from the exclusion of proprietary models for which we don't have good compute estimates. For example, o1 would have been the first models in Grok 3's performance tier and plausibly used less compute—if we had a better compute estimate for it and it was less than Grok 3, Grok 3 wouldn't have made the frontier. By definition the slope for that capability bucket would be less steep. I thought about trying to make my own compute estimates for such models but decided not to for the sake of project scope.

Thanks, I hadn't looked at the leave-one-out results carefully enough. I agree this (and your follow-up analysis rerun) means my claim is incorrect. Looking more closely at the graphs, in the case of Llama 3.1, I should have noticed that EXAONE 4.0 (1.2B) was also a pretty key data point for that line. No idea what's going on with that model.

(That said, I do think going from 1.76 to 1.64 after dropping just two data points is a pretty significant change, also I assume that this is really just attributable to Grok 3 so it's really more like one data point. Of course the median won't change, and I do prefer the median estimate because it is more robust to these outliers.)

There's a related point, which is maybe what you're getting at, which is that these results suffer from the exclusion of proprietary models for which we don't have good compute estimates.

I agree this is a weakness but I don't care about it too much (except inasmuch as it causes us to estimate algorithmic progress by starting with models like Grok). I'd usually expect it to cause estimates to be biased downwards (that is, the true number is higher than estimated).

Another reason to think these models are not the main driver of the results is that there are high slopes in capability buckets that don't include these models, such as 30, 35, 40 (log10 slopes of 1.22, 1.41, 1.22).

This corresponds to 16-26x drop in cost per year? Those estimates seem reasonable (maybe slightly high) given you're measuring drop in cost to achieve benchmark scores.

I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:

Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time
Later models are more likely to have reasoning training

None of these apply to the pretraining based analysis, though of course it is biased in the other direction (if you care about catch-up algorithmic progress) by not taking into account distillation or post-training.

I do think 3x is too low as an estimate for catch-up algorithmic progress, inasmuch as your main claim is "it's a lot bigger than 3x" I'm on board with that.

This corresponds to 16-26x drop in cost per year?

Yep.

I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time

These are important limitations, thanks for bringing them up!

Later models are more likely to have reasoning training

Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying "later models are more likely to use the transformer architecture," where my response is "that's algorithmic progress for ya". One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.

Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying "later models are more likely to use the transformer architecture," where my response is "that's algorithmic progress for ya". One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.

Yeah it's just the reason you give, though I'd frame it slightly differently. I'd say that the point of "catch-up algorithmic progress" was to look at costs paid to get a certain level of benefit, and while historically "training compute" was a good proxy for cost, reasoning models change that since inference compute becomes decoupled from training compute.

I reread the section you linked. I agree that the tasks that models do today have a very small absolute cost such that, if they were catastrophically risky, it wouldn't really matter how much inference compute they used. However, models are far enough from that point that I think you are better off focusing on the frontier of currently-economically-useful-tasks. In those cases, assuming you are using a good scaffold, my sense is that the absolute costs do in fact matter.

Your detailed results are also screaming at you that your method is not reliable

I seems to me that they are screaming that we can't be confident in the particular number output by these methods. And I'm not. I tried to be clear in this post that what I would consider the results from this method (16×–60× per year) are not my all-things-considered view (20×, with an 80% CI from 2×–200×).

Speaking colloquially, I might say "these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think". And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it's based only on pre-training.

The primary evidence that the method is unreliable is not that the dataset is too small, it's that the results span such a wide interval, and it seems very sensitive to choices that shouldn't matter much.

This was helpful clarification, thanks. In the present analysis, the results span a wide interval, but the lower end of that interval is still generally higher than my prior!

As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I'm updating too much based on unreliable methods? Okay come take my money.

Speaking colloquially, I might say "these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think". And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it's based only on pre-training.

Okay fair enough, I agree with that.

As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I'm updating too much based on unreliable methods? Okay come take my money.

I find this attitude weird. It takes a lot of time to actually make and settle a bet. (E.g. I don't pay attention to Artificial Analysis and would want to know something about how they compute their numbers.) I value my time quite highly; I think one of us would have to be betting seven figures, maybe six figures if the disagreement was big enough, before it looked good even in expectation (ie no risk aversion) as a way for me to turn time into money.

I think it's more reasonable as a matter of group rationality to ask that an interlocutor say what they believe, so in that spirit here's my version of your prediction, where I'll take your data at face value without checking:

[DeepSeek-V3.2-Exp is estimated by Epoch to be trained with 3.8e24 FLOP. It reached an AAII index score of 65.9 and was released on September 29, 2025. It is on the compute-efficiency frontier.] I predict that by September 29, 2026, the least-compute-used-to-train model that reaches a score of 65 will be trained with around 3e23 FLOP, with the 80% CI covering 6e22–1e24 FLOP.

Note that I'm implicitly doing a bunch of deference to you here (e.g. that this is a reasonable model to choose, that AAII will behave reasonably regularly and predictably over the next year), though tbc I'm also using other not-in-post heuristics (e.g. expecting that DeepSeek models will be more compute efficient than most). So, I wouldn't exactly consider this equivalent to a bet, but I do think it's something where people can and should use it to judge track records.

I think it's more reasonable as a matter of group rationality to ask that an interlocutor say what they believe

Super fair. I probably should have just asked what you anticipate observing that might differ from my expectation. I appreciate you writing your own version of the prediction, that's basically what I wanted. And it sounds like I don't even have enough money to make a bet you would consider worth your time!

As to our actual predictions, they seem quite similar to me, which is clarifying. I was under the impression you expected slower catch-up progress. A main prediction of 3e23 FLOP implies 1/(3e23/3.8e24) = 12.7× reduction in FLOP over a year, which I also consider quite likely!

Thanks for your engagement!

I was under the impression you expected slower catch-up progress.

Note that I think the target we're making quantitative forecasts about will tend to overestimate that-which-I-consider-to-be "catch-up algorithmic progress" so I do expect slower catch-up progress than the naive inference from my forecast (ofc maybe you already factored that in).

Some observations (not particularly constructive):

Training compute is relevant for the most compute-hungry models, because there it can be a taut constraint (and even then only when there would be inference hardware to serve the model once it's trained, which isn't always the case). For smaller and catch-up models, other things become more relevant as constraints, and even the relevant compute needed to make them well is no longer training compute, but research compute, or the compute that went into earlier or larger models that contributed to making the smaller model possible.
Inference cost/speed doesn't have this issue for the smaller models, it remains both relevant and legible. Research compute and compute for implicit predecessor dependencies needed to develop smaller models are still relevant for the reproduce-much-cheaper hypothetical, but it's far less legible.
Measuring parity based on benchmarks is suspect (even as pragmatically it's hard to use anything else), a big confused pre-RLVR model and a small competition-minded post-RLVR model will be doing things differently.

I am also unconvinced that ECI is a better metric to use than AAII. One issue with ECI scores is that they are often calculated using just 2 benchmark scores for a particular model

We use a minimum of 4 benchmark scores, not sure where the 2 is coming from?

Thanks for pointing this out and for our discussion elsewhere. This was an error in the post and I have updated the text. The 2 came from me just looking at the "Epoch AI internal runs" table but not also the "External runs" table.

For some models (especially older ones), the Artificial Analysis Intelligence Index score is labeled as “Estimate (independent evaluation forthcoming)”. It is unclear how these scores are determined, and they may not be a reliable estimate. The Artificial Analysis API does not clearly label such estimates and I did not manually remove them for secondary analysis. Ideally the capability levels that have these models (probably typically lower levels) would be weighted less, but I don’t do this due to uncertainty about which models have Estimates vs. independently tested scores.

IMO this is a potentially significant issue that this post should have spent more time addressing, since it means that the earlier sections of the trend lines are coming from a source we know nothing about.

I agree it's potentially a significant issue. One reason I'm relatively less concerned with it is that the AAII scores for these models seem generally pretty reasonable. Another reason is that the results look pretty similar if we only look at more recent models (which by and large have AAII-run benchmarks). E.g., starting July 2024 yields median 1.22 OOMs and weighted 1.85 OOMs.

There are many places for additional and follow-up work and this is one of them, but I don't think it invalidates the overall results.

Thank you for this excellent analysis! However, it also makes me wonder whether mankind is close to exhausting the algorithmic insights usable in CoT-based models (think of my post with a less credible analysis written in October 2025) and/or mankind has already found a really cheap way to distill models into smaller ones (think of my most recent quick take and ARC-AGI-1 performance of Gemini 3 Flash, GPT-5-mini, GPT-5.2 and Grok 4 Fast Reasoning along with the cluster of o3, o4-mini, GPT-5, GPT-5.1 and the three Claudes 4.5).

The cheap way to distill models into smaller ones would mean that the implications for governance are not so dire. For example, Kokotajlo predicted in May that the creation of GPT-5 would require a dose of elicitation techniques applied to GPT-4.5, meaning that GPT-5's creation was impossible without having spent ~2E26 compute on making GPT-4.5 beforehand. Similarly, unlike Qwen 3 Next 80B A3B, GPT-oss-20b could have been distilled from another model. Alas, it doesn't tell us anything about DeepSeek v. 3.2 and the potential to create a cheaper analogue...

Exhausting the insights would mean that the prediction related to frontier models continuing the trend is falsified unless mankind dares to do something beyond the CoT, like making the models neuralese. For example, Claude 3.7 Sonnet displays different results (50 points for reasoning model, 41 pt for non-reasoning model; why wasn't it placed into the AA>= 50 list? It could also make the slope less steep) depending on whether it uses reasoning or not. But the shift to reasoning models is a known technique which increases the AA index and was already used for models like DeepSeek, meaning that anyone who tries to cheapen the creation of models with AA>=65 will have to discover a new technique.

why wasn't it placed into the AA>= 50 list?

It's in this appendix section as a lower confidence compute estimate and is in the >=45 AAII score bucket. Looking at the data, the reason it is not in the >=50 bucket is that it's AAII score, pulled from the Artificial Analysis API, is 49.9. I see that they round to 50 on the main webpage. I just used the raw scores from the API without any rounding. Thanks for the check!

it also makes me wonder whether mankind is close to exhausting the algorithmic insights usable in CoT-based models (think of my post with a less credible analysis written in October 2025) and/or mankind has already found a really cheap way to distill models into smaller ones

To be clear about my position, I don't think the analysis I presented here points at all toward humanity exhausting algorithmic insights. Separate lines of reasoning might lead somebody to that conclusion, but this analysis either has little bearing on the hypothesis or points toward us not running out of insights (on account of the rate of downstream progress being so rapid).

I find this reasoning unconvincing because their appendix analysis (like that in this blog post) is based on more AI models than their primary analysis!

Your results are primarily driven by the inclusion of Llama 3.1-405B and Grok 3

There's a related point, which is maybe what you're getting at, which is that these results suffer from the exclusion of proprietary models for which we don't have good compute estimates.

Another reason to think these models are not the main driver of the results is that there are high slopes in capability buckets that don't include these models, such as 30, 35, 40 (log10 slopes of 1.22, 1.41, 1.22).

This corresponds to 16-26x drop in cost per year? Those estimates seem reasonable (maybe slightly high) given you're measuring drop in cost to achieve benchmark scores.

I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:

Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time
Later models are more likely to have reasoning training

I do think 3x is too low as an estimate for catch-up algorithmic progress, inasmuch as your main claim is "it's a lot bigger than 3x" I'm on board with that.

This corresponds to 16-26x drop in cost per year?

Yep.

I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time

These are important limitations, thanks for bringing them up!

Later models are more likely to have reasoning training

Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying "later models are more likely to use the transformer architecture," where my response is "that's algorithmic progress for ya". One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.

Your detailed results are also screaming at you that your method is not reliable

The primary evidence that the method is unreliable is not that the dataset is too small, it's that the results span such a wide interval, and it seems very sensitive to choices that shouldn't matter much.

This was helpful clarification, thanks. In the present analysis, the results span a wide interval, but the lower end of that interval is still generally higher than my prior!

As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I'm updating too much based on unreliable methods? Okay come take my money.

Speaking colloquially, I might say "these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think". And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it's based only on pre-training.

Okay fair enough, I agree with that.

As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I'm updating too much based on unreliable methods? Okay come take my money.

[DeepSeek-V3.2-Exp is estimated by Epoch to be trained with 3.8e24 FLOP. It reached an AAII index score of 65.9 and was released on September 29, 2025. It is on the compute-efficiency frontier.] I predict that by September 29, 2026, the least-compute-used-to-train model that reaches a score of 65 will be trained with around 3e23 FLOP, with the 80% CI covering 6e22–1e24 FLOP.

I think it's more reasonable as a matter of group rationality to ask that an interlocutor say what they believe

Thanks for your engagement!

I was under the impression you expected slower catch-up progress.

Some observations (not particularly constructive):

Training compute is relevant for the most compute-hungry models, because there it can be a taut constraint (and even then only when there would be inference hardware to serve the model once it's trained, which isn't always the case). For smaller and catch-up models, other things become more relevant as constraints, and even the relevant compute needed to make them well is no longer training compute, but research compute, or the compute that went into earlier or larger models that contributed to making the smaller model possible.
Inference cost/speed doesn't have this issue for the smaller models, it remains both relevant and legible. Research compute and compute for implicit predecessor dependencies needed to develop smaller models are still relevant for the reproduce-much-cheaper hypothetical, but it's far less legible.
Measuring parity based on benchmarks is suspect (even as pragmatically it's hard to use anything else), a big confused pre-RLVR model and a small competition-minded post-RLVR model will be doing things differently.

I am also unconvinced that ECI is a better metric to use than AAII. One issue with ECI scores is that they are often calculated using just 2 benchmark scores for a particular model

We use a minimum of 4 benchmark scores, not sure where the 2 is coming from?

For some models (especially older ones), the Artificial Analysis Intelligence Index score is labeled as “Estimate (independent evaluation forthcoming)”. It is unclear how these scores are determined, and they may not be a reliable estimate. The Artificial Analysis API does not clearly label such estimates and I did not manually remove them for secondary analysis. Ideally the capability levels that have these models (probably typically lower levels) would be weighted less, but I don’t do this due to uncertainty about which models have Estimates vs. independently tested scores.

There are many places for additional and follow-up work and this is one of them, but I don't think it invalidates the overall results.

why wasn't it placed into the AA>= 50 list?

it also makes me wonder whether mankind is close to exhausting the algorithmic insights usable in CoT-based models (think of my post with a less credible analysis written in October 2025) and/or mankind has already found a really cheap way to distill models into smaller ones

LESSWRONG
LW

LESSWRONG
LW

92

Catch-Up Algorithmic Progress Might Actually be 60× per Year

92

Summary

What do I mean by ‘algorithmic progress’?

Methods and Results

Sanity check: Qwen2.5-72B vs. Qwen3-30B-A3B

Discussion

How does this compare to the recent analysis in A Rosetta Stone for AI Benchmarks?

How does this compare to other previous estimates of algorithmic progress

How should we update on this analysis?

Appendices

Appendix: Filtering by different confidence levels of compute estimates

All models

Confident compute estimates

Appendix: How fast is the cost of AI inference falling?

Appendix: Histogram of 1 point buckets

Appendix: Qwen2.5 and Qwen3 benchmark performance

Appendix Leave-One-Out analysis

Appendix: Limitations

Outlier models

Lack of early, weak models

Post-training compute excluded

Inference-time compute excluded

Some AAII scores are estimates

Comparing old and new models on the same benchmark

92

92