Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
I was under the impression you expected slower catch-up progress.
Note that I think the target we're making quantitative forecasts about will tend to overestimate that-which-I-consider-to-be "catch-up algorithmic progress" so I do expect slower catch-up progress than the naive inference from my forecast (ofc maybe you already factored that in).
Thanks, I hadn't looked at the leave-one-out results carefully enough. I agree this (and your follow-up analysis rerun) means my claim is incorrect. Looking more closely at the graphs, in the case of Llama 3.1, I should have noticed that EXAONE 4.0 (1.2B) was also a pretty key data point for that line. No idea what's going on with that model.
(That said, I do think going from 1.76 to 1.64 after dropping just two data points is a pretty significant change, also I assume that this is really just attributable to Grok 3 so it's really more like one data point. Of course the median won't change, and I do prefer the median estimate because it is more robust to these outliers.)
There's a related point, which is maybe what you're getting at, which is that these results suffer from the exclusion of proprietary models for which we don't have good compute estimates.
I agree this is a weakness but I don't care about it too much (except inasmuch as it causes us to estimate algorithmic progress by starting with models like Grok). I'd usually expect it to cause estimates to be biased downwards (that is, the true number is higher than estimated).
Another reason to think these models are not the main driver of the results is that there are high slopes in capability buckets that don't include these models, such as 30, 35, 40 (log10 slopes of 1.22, 1.41, 1.22).
This corresponds to 16-26x drop in cost per year? Those estimates seem reasonable (maybe slightly high) given you're measuring drop in cost to achieve benchmark scores.
I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
None of these apply to the pretraining based analysis, though of course it is biased in the other direction (if you care about catch-up algorithmic progress) by not taking into account distillation or post-training.
I do think 3x is too low as an estimate for catch-up algorithmic progress, inasmuch as your main claim is "it's a lot bigger than 3x" I'm on board with that.
Speaking colloquially, I might say "these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think". And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it's based only on pre-training.
Okay fair enough, I agree with that.
As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I'm updating too much based on unreliable methods? Okay come take my money.
I find this attitude weird. It takes a lot of time to actually make and settle a bet. (E.g. I don't pay attention to Artificial Analysis and would want to know something about how they compute their numbers.) I value my time quite highly; I think one of us would have to be betting seven figures, maybe six figures if the disagreement was big enough, before it looked good even in expectation (ie no risk aversion) as a way for me to turn time into money.
I think it's more reasonable as a matter of group rationality to ask that an interlocutor say what they believe, so in that spirit here's my version of your prediction, where I'll take your data at face value without checking:
[DeepSeek-V3.2-Exp is estimated by Epoch to be trained with 3.8e24 FLOP. It reached an AAII index score of 65.9 and was released on September 29, 2025. It is on the compute-efficiency frontier.] I predict that by September 29, 2026, the least-compute-used-to-train model that reaches a score of 65 will be trained with around 3e23 FLOP, with the 80% CI covering 6e22–1e24 FLOP.
Note that I'm implicitly doing a bunch of deference to you here (e.g. that this is a reasonable model to choose, that AAII will behave reasonably regularly and predictably over the next year), though tbc I'm also using other not-in-post heuristics (e.g. expecting that DeepSeek models will be more compute efficient than most). So, I wouldn't exactly consider this equivalent to a bet, but I do think it's something where people can and should use it to judge track records.
Your results are primarily driven by the inclusion of Llama 3.1-405B and Grok 3 (and Grok 4 when you include it). Those models are widely believed to be cases where a ton of compute was poured in to make up for poor algorithmic efficiency. If you remove those I expect your methodology would produce similar results as prior work (which is usually trying to estimate progress at the frontier of algorithmic efficiency, rather than efficiency progress at the frontier of capabilities).
I could imagine a reply that says "well, it's a real fact that when you start with a model like Grok 3, the next models to reach a similar capability level will be much more efficient". And this is true! But if you care about that fact, I think you should instead have two stylized facts, one about what happens when you are catching up to Grok or Llama, and one about what happens when you are catching up to GPT, Claude, or Gemini, rather than trying to combine these into a single estimate that doesn't describe either case.
Your detailed results are also screaming at you that your method is not reliable. It is really not a good sign when your analysis that by construction has to give numbers in produces results that on the low end include 1.154, 2.112, 3.201 and on the high end include 19,399.837 and even (if you include Grok 4) 2.13E+09 and 2.65E+16 (!!)
I find this reasoning unconvincing because their appendix analysis (like that in this blog post) is based on more AI models than their primary analysis!
The primary evidence that the method is unreliable is not that the dataset is too small, it's that the results span such a wide interval, and it seems very sensitive to choices that shouldn't matter much.
Cool result!
these results demonstrate a case where LLMs can do (very basic) meta-cognition without CoT
Why do you believe this is meta-cognition? (Or maybe the question is, what do you mean by meta-cognition?)
It seems like it could easily be something else. For example, probably when solving problems the model looks at the past strategies it has used and tries some other strategy to increase the likelihood of solving the problem. It does this primarily in the token space (looking at past reasoning and trying new stuff) but this also generalizes somewhat to the activation space (looking at what past forward passes did and trying something else). So when you have filler tokens the latter effect still happens, giving a slight best-of-N type boost, producing your observed results.
Filler tokens don't allow for serially deeper cognition than what architectural limits allow
This depends on your definition of serial cognition, under the definitions I like most the serial depth scales logarithmically with the number of tokens. This is because as you increase parallelism (in the sense you use above), that also increases serial depth logarithmically.
The basic intuitions for this are:
Ah, I realized there was something else I should have highlighted. You mention you care about pre-ChatGPT takes towards shorter timelines -- while compute-centric takeoff was published two months after ChatGPT, I expect that the basic argument structure and conclusions were present well before the release of ChatGPT.
While I didn't observe that report in particular, in general Open Phil worldview investigations took > 1 year of serial time and involved a pretty significant and time-consuming "last mile" step where they get a bunch of expert review before publication. (You probably observed this "last mile" step with Joe Carlsmith's report, iirc Nate was one of the expert reviewers for that report.) Also, Tom Davidson's previous publications were in March 2021 and June 2021, so I expect he was working on the topic for some of 2021 and ~all of 2022.
I suppose a sufficiently cynical observer might say "ah, clearly Open Phil was averse to publishing this report that suggests short timelines and intelligence explosions until after the ChatGPT moment". I don't buy it, based on my observations of the worldview investigations team (I realize that it might not have been up to the worldview investigations team, but I still don't buy it).
I guess one legible argument I could make to the cynic would be that on the cynical viewpoint, it should have taken Open Phil a lot longer to realize they should publish the compute-centric takeoff post. Does the cynic really think that, in just two months, a big broken org would be able to:
That's just so incredibly fast for big broken orgs to move.
I think I agree with all of that under the definitions you're using (and I too prefer the bounded rationality version). I think in practice I was using words somewhat differently than you.
(The rest of this comment is at the object level and is mostly for other readers, not for you)
Saying it's "crazy" means it's low probability of being (part of) the right world-description.
The "right" world-description is a very high bar (all models are wrong but some are useful), but if I go with the spirit of what you're saying I think I might not endorse calling bio anchors "crazy" by this definition, I'd say more like "medium" probability of being a generally good framework for thinking about the domain, plus an expectation that lots of the specific details would change with more investigation.
Honestly I didn't have any really precise meaning by "crazy" in my original comment, I was mainly using it as a shorthand to gesture at the fact that the claim is in tension with reductionist intuitions, and also that the legibly written support for the claim is weak in an absolute sense.
Saying it's "the best we have" means it's the clearest model we have--the most fleshed-out hypothesis.
I meant a higher bar than this; more like "the most informative and relevant thing for informing your views on the topic" (beyond extremely basic stuff like observing that humanity can do science at all, or things like reference class priors). Like, I also claim it is better than "query your intuitions about how close we are to AGI, and how fast we are going, to come up with a time until we get to AGI". So it's not just the clearest / most fleshed-out, it's also the one that should move you the most, even including various illegible or intuition-driven arguments. (Obviously scoped only to the arguments I know about; for all I know other people have better arguments that I haven't seen.)
If it were merely the clearest model or most fleshed-out hypothesis, I agree it would usually be a mistake to make a large belief update or take big consequential actions on that basis.
I also want to qualify / explain my statement about it being a crazy argument. The specific part that worries me (and Eliezer, iiuc) is the claim that, at a given point in time, the delta between natural and artificial artifacts will tend to be approximately constant across different domains. This is quite intuition-bending from a mechanistic / reductionist viewpoint, and the current support for it seems very small and fragile (this 8 page doc). However, I can see a path where I would believe in it much more, which would involve things like:
I anticipate someone asking the followup "why didn't Open Phil do that, then?" I don't know what Open Phil was thinking, but I don't think I'd have made a very different decision. It's a lot of work, not many people can do it, and many of those people had better things to do, e.g. imo the compute-centric takeoff work was indeed more important and caused bigger updates than I think the work above would have done (and was probably easier to do).
Yeah it's just the reason you give, though I'd frame it slightly differently. I'd say that the point of "catch-up algorithmic progress" was to look at costs paid to get a certain level of benefit, and while historically "training compute" was a good proxy for cost, reasoning models change that since inference compute becomes decoupled from training compute.
I reread the section you linked. I agree that the tasks that models do today have a very small absolute cost such that, if they were catastrophically risky, it wouldn't really matter how much inference compute they used. However, models are far enough from that point that I think you are better off focusing on the frontier of currently-economically-useful-tasks. In those cases, assuming you are using a good scaffold, my sense is that the absolute costs do in fact matter.