Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
I agree it's potentially a significant issue. One reason I'm relatively less concerned with it is that the AAII scores for these models seem generally pretty reasonable. Another reason is that the results look pretty similar if we only look at more recent models (which by and large have AAII-run benchmarks). E.g., starting July 2024 yields median 1.22 OOMs and weighted 1.85 OOMs.
There are many places for additional and follow-up work and this is one of them, but I don't think it invalidates the overall results.
Thanks for pointing this out and for our discussion elsewhere. This was an error in the post and I have updated the text. The 2 came from me just looking at the "Epoch AI internal runs" table but not also the "External runs" table.
I think it's more reasonable as a matter of group rationality to ask that an interlocutor say what they believe
Super fair. I probably should have just asked what you anticipate observing that might differ from my expectation. I appreciate you writing your own version of the prediction, that's basically what I wanted. And it sounds like I don't even have enough money to make a bet you would consider worth your time!
As to our actual predictions, they seem quite similar to me, which is clarifying. I was under the impression you expected slower catch-up progress. A main prediction of 3e23 FLOP implies 1/(3e23/3.8e24) = 12.7× reduction in FLOP over a year, which I also consider quite likely!
Thanks for your engagement!
This corresponds to 16-26x drop in cost per year?
Yep.
I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
- Later models are more likely to be benchmaxxed
- (Probably not a big factor, but who knows) Benchmarks get more contaminated over time
These are important limitations, thanks for bringing them up!
- Later models are more likely to have reasoning training
Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying "later models are more likely to use the transformer architecture," where my response is "that's algorithmic progress for ya". One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.
Your detailed results are also screaming at you that your method is not reliable
I seems to me that they are screaming that we can't be confident in the particular number output by these methods. And I'm not. I tried to be clear in this post that what I would consider the results from this method (16×–60× per year) are not my all-things-considered view (20×, with an 80% CI from 2×–200×).
Speaking colloquially, I might say "these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think". And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it's based only on pre-training.
The primary evidence that the method is unreliable is not that the dataset is too small, it's that the results span such a wide interval, and it seems very sensitive to choices that shouldn't matter much.
This was helpful clarification, thanks. In the present analysis, the results span a wide interval, but the lower end of that interval is still generally higher than my prior!
As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I'm updating too much based on unreliable methods? Okay come take my money.
Your results are primarily driven by the inclusion of Llama 3.1-405B and Grok 3
I'm fairly sure this is not the case. In this appendix when I systematically drop one frontier model at a time and recalculate the slope for each bucket, Llama 3.1-405B isn't even the most influential model for the >=25 bucket (the only bucket it's frontier for)! And looking at the graph, that's not surprising, it looks right on trend. Grok 3 also looks surprisingly on trend, and looking at that leave-one-out analysis, it is pretty influential, but even without it, the slope for that capability bucket is -3.5 order of magnitude per year. Another reason to think these models are not the main driver of the results is that there are high slopes in capability buckets that don't include these models, such as 30, 35, 40 (log10 slopes of 1.22, 1.41, 1.22).
For thoroughness, I also just reran the analysis and totally excluded these data points and the results are basically the same, for confident and likely compute estimates (main result in the post) we get a weighted log10 mean of 1.64 (44×) and median of 1.21 (16×). I consider these to be quite in line with the main results (1.76, 1.21).
There's a related point, which is maybe what you're getting at, which is that these results suffer from the exclusion of proprietary models for which we don't have good compute estimates. For example, o1 would have been the first models in Grok 3's performance tier and plausibly used less compute—if we had a better compute estimate for it and it was less than Grok 3, Grok 3 wouldn't have made the frontier. By definition the slope for that capability bucket would be less steep. I thought about trying to make my own compute estimates for such models but decided not to for the sake of project scope.
why wasn't it placed into the AA>= 50 list?
It's in this appendix section as a lower confidence compute estimate and is in the >=45 AAII score bucket. Looking at the data, the reason it is not in the >=50 bucket is that it's AAII score, pulled from the Artificial Analysis API, is 49.9. I see that they round to 50 on the main webpage. I just used the raw scores from the API without any rounding. Thanks for the check!
it also makes me wonder whether mankind is close to exhausting the algorithmic insights usable in CoT-based models (think of my post with a less credible analysis written in October 2025) and/or mankind has already found a really cheap way to distill models into smaller ones
To be clear about my position, I don't think the analysis I presented here points at all toward humanity exhausting algorithmic insights. Separate lines of reasoning might lead somebody to that conclusion, but this analysis either has little bearing on the hypothesis or points toward us not running out of insights (on account of the rate of downstream progress being so rapid).
Thanks for the suggestion. We considered this but decided against it for various reasons (though we did cut down the app length from our first draft). I agree that it's frustrating that application time costs are high. One consideration is that we often find ourselves relying on free-response questions for app review, even in an initial screen, and without at least some of those it would be considerably harder to do initial screening.
I don't think it helps support the idea that it's data and not algorithms
Agreed. Gundlach et al. are able to find and categorize specific algorithmic advances (non-data) that they claim explain 6,930× of gains, out of a total amount of gains estimated ("naively extrapolating") by Ho et al. of 22,000×. That is, they explain all but another factor of 3 with algorithms. Quoting from the paper:
Though our experiments do not claim to be exhaustive, we compare our findings with estimates from the literature. Namely, between 2012 to 2023, Ho et al. [2024] found a doubling time of 8 months, or 2.83× per year, for a total efficiency gain of 22, 000×. In contrast, the growth rate of our CEG multiplier is approximately 2.23× annually, for a total of 6, 930×, of which 2, 700× (89%) is due to scale-dependent changes. This leaves a gap of 3.18× from our estimates, which could be from data selection, tokenizer advancements, or a long tail of innovations not captured in our analysis.
First off, making up for all but 3× is very good (frankly, I think too good and should be taken with a grain of salt). Second, naively reading this would imply data has contributed at most a factor of 3 over 11 years.
But I think the experiment in this paper use validation loss on a pretraining dataset, whereas performance on downstream tasks seems especially likely to be affected by better data (i.e., the 22,000× they are trying to account for might not even be influenced much by better data, as it too is based on loss).
(This comment is not meant to take a stand on the overall question of how much data vs. non-data algorithmic innovation has contributed, just the bearing of Gundlach et al., on this question.)
(adding my takes in case they are useful for MATS fellows deciding what to do) I have seen many MATS projects via attending the MATS symposiums, but am relying on my memory of them. I would probably consider Ryan's posts to each be like 60-70th percentile MATS project. But I expect that a strong MATS scholar could do 2-5 mini projects like this during the duration of MATS.