GPT-4.5 is going to be quickly deprecated
It's still a data point saying that OpenAI chose to do a large training run, though, right? Even if they're currently not planning to make sustained use of the resulting model in deployment. (Also, my shaky understanding is that expectations are for a GPT-5 to be released in the coming months and that it may be a distilled + post-trained derivative of GPT-4.5, meaning GPT-5 would be downstream of a large-compute-budget training process?)
This succinct summary is highly useful, thanks!
Just to quibble a little: I don't think it's wise to estimate scaffolding improvements for general capabilities as near zero. By scaffolding, I mean prompt engineering and structuring systems of prompts to create systems in which calls to LLMs are component operations, roughly in line with the original definition. I've been surprised that scaffolding didn't make more of a difference faster and would agree that it's been near zero for general capabilities. I think this changed when Perplexity partly replicated OAI's Deep Research accomplishments in both subjective usefulness and performance on Humanity's Last Exam (~21% vs ~27% ), and they did it in around two weeks, indicating little model training. These capabilities are fairly general, and this seems like pretty clearly a success primarily of scaffolding, showing that Deep Research was partly based on o3's smarts and but also heavily helped by the scaffold for iterative search, consolidation, and refinement.
This technically falls into the algorithmic improvement category, but it's a reason to think that pace increases or even accelerates by having a largely-separate route to improvement. Or maybe this is more like a one-time boost - but it could be a large one and in particularly important directions, like solving entirely novel problems.
DeepMind's co-scientist also seems like a genuine accomplishment in scaffoldng. It used Gemini Pro 2.0 which just wasn't that good, and seemed to do something extremely impressive - predict cutting-edge research hypotheses in one shot by reviewing the relevant literature and "thinking" about it deeply. I haven't looked deeply enough to know if that story is exaggerated or not. If not, similar scaffolds running 2.5 Pro or o3+ might be ready to genuinely accelerate research in ML and elsewhere.
Ege Erdil's argument for long timelines to Dwarkesh was that algorithmic progress is driven by compute progress, so we should expect it to slow once we run out of compute to just buy. I think this is probably partly true, but the opposite trend, that more effort will go into algorithmes if/when compute is a less promising route, also probably holds.
So compute expansion will probably slow down, and it may well produce less gains going forward. But there is little reason to expect algorithmic progress to slow, and it may even speed up somewhat as more effort focuses on it, we work out how scaffolding can be useful, and more effort is devoted to it (some money that can't buy compute may go toward working on algorithms of various types).
I'm not sure I particularly disagree. My exact claim is just:
my sense is that scaffolding has yielded extremely minimal gains on general purpose autonomous software engineering to date
So, I just think no one has really outperformed baselines for very general domains. In more specific domains people have outperformed, e.g. for writing kernels.
I also think it's probably possible to do a bunch of scaffolding improvements on general purpose autonomous software engineering (potentially scaffolds that use a bunch more runtime compute), if by no other mechanism than by writing a bunch of more specialized scaffolds that the model sometimes chooses to apply.
That said my guess is that it's pretty likely that scaffolding doesn't matter that much in practice in the future, at least until AIs are writing a ton of scaffolds autonomously. This is despite it being possible for scaffolding to improve performance: you can get much/all of the benefits of extensive scaffolding via giving AIs a small number of general purpose tools and doing a bunch of RL, and it looks like this is the direction people are going in for better or worse.
Ah yes. I actually missed that you'd scoped that statement to general purpose software engineering. That is indeed one of the most relevant capabilities. I was thinking of general purpose problem-solving, another of the most critical capabilities for AI to become really dangerous.
I agree that even if scaffolding could work, RL on long CoT does something similar, and that's where the effort and momentum is going.
AIs writing and testing scaffolds semi-autonomously is something I hadn't considered. There might be a pretty tight loop that could make that effective.
quantity of useful environments that AI companies have
Meaning, the number of distinct types of environments they've built (e.g. one to train on coding tasks, one on math tasks, etc.)? Or the number of instances of those environments they can run (e.g. how much coding data they can generate)?
Number of distinct tasks (with verification etc). The term "environment" is often used for this, but maybe the way I'm using the term is confusing. As in, I'd count each different SWE-bench task as a distinct environment even though they all have the same basic setup (but a different objective + test cases). That said, for a environment/tasks to be decently useful, it's important that it be sufficiently distinct from other tasks, so you could hit dimishing returns. Like, even though it's trivial for openai to generate gradeschool math problems, the marginal problem is worthless for RL.
AI progress is driven by improved algorithms and additional compute for training runs. Understanding what is going on with these trends and how they are currently driving progress is helpful for understanding the future of AI. In this post, I'll share a wide range of general takes on this topic as well as open questions. Be warned that I'm quite uncertain about a bunch of this!
This post will assume some familiarity with what is driving AI progress, specifically it will assume you understand the following concepts: pre-training, RL, scaling laws, effective compute.
Training compute trends
Epoch reports a trend of frontier training compute increasing by 4.5x per year. My best guess is that the future trend will be slower, maybe more like 3.5x per year (or possibly much lower) for a few reasons:
What's going on with pretraining runs much larger than GPT-4 and why might we not be seeing them?
There have been reports that pretraining has delivered unsatisfying returns or at least has been difficult to scale up. Mostly, people have reported that this is due to running out of (reasonable quality) data, but the public evidence is also consistent with this being due to the returns to further pretraining on reasonable quality data being low.[1]
Grok 3 and GPT-4.5 (the models with the largest pretraining runs we're aware of) don't seem qualitatively that much better than models trained with much less compute. That said, GPT-4.5 is actually substantially better than other OpenAI models in many respects and it probably hasn't been RL'd very effectively (due to this being inconvenient on a model of this scale). It's also worth noting that Grok 3 is trained by xAI, so it probably has somewhat worse algorithms than models trained by OpenAI, Anthropic, and GDM. Additionally, Grok 3 appears to be substantially smaller than we'd expect a compute optimal model of that scale to be, in particular it appears much smaller than GPT-4.5 (based on inference speeds[2]) which I'd guess is roughly compute optimal. This is probably so that Grok 3 is better to utilize for RL and inference, but this shaves some off the gain from scaled up pretraining. It's also possible that public reports about Grok 3's compute are overestimates. Another candidate model which might have used much more pretraining compute than GPT-4 is Gemini 2.5 Pro; it currently seems unlikely this model used 10x more (pretraining) compute than GPT-4, but this isn't inconceivable.
Regardless of exactly what is going on with pretraining, it minimally seems like returns to scaling up RL currently look better than returns to scaling up pretraining. For instance, I'd guess that the improvement from o1 to o3 was mostly more (and possibly better) RL, with OpenAI reporting it used 10x more RL compute than o1.[3] (o3 probably used GPT-4.1 as a base model and this model isn't that much better than the base model used for o1 which is GPT-4o. Additionally, OpenAI seems to imply most of the improvement relative to o1 is from RL.) Overall, rumor and other sources seem to indicate that at least OpenAI is mostly focused on scaling up RL. My guess is that the RL run for o3 probably used an amount of compute similar to what was used for pretraining GPT-4 (including RL inefficiency relative to pretraining, as in measuring by H100 equivalent GPU time), so there should be room for substantial scaling with the compute AI companies current have, probably to around 30x more RL compute than o3 used (or enough for o4.5 if the 10x compute per reasoning model generation trend continues). My understanding is that right now RL scaling is limited by the quantity of useful environments that AI companies have (a type of data constraint), but AI companies are scaling this up rapidly and we should naively expect that RL scaling catches up to the available compute pretty soon. While RL scaling might soon eat the available compute overhang for RL, I'd expect that there is still far more algorithmic progress which is possible.
While RL scaling can continue, it's unclear whether it will hit diminishing returns. My guess is that relevant downstream metrics (e.g., performance on agentic software engineering tasks) will have reasonably smooth log linear scaling (at least with well optimized RL algorithms) for at least the next 2 or 3 orders of magnitude. Of course, progress may still qualitatively slow once RL compute catches up to the available compute.
Another important aspect of training compute scaling trends is that we expect compute scaling to slow down in a few years (maybe by around 2029) due to limited investment or limited fab capacity. See here for discussion of investment slow down. I think we're currently at maybe ~15% of fab capacity being used for ML chips which only permits a small number of additional doubling before fabs will mostly be producing ML chips and further production will be bottlenecked on needing more fabs. That said, note that there will probably be a lag (maybe 2-3 years?) between compute scaling slowing and AI progress slowing as I discuss here.
This yields some important open questions:
Algorithmic progress
Epoch reports that we see a 3x effective compute increase per year due to algorithmic progress in LLMs. More precisely, Epoch finds that the compute required to achieve a given level of loss at next token prediction decreased by around 3x per year.
Effective compute is a way of converting algorithmic progress into an equivalent amount of bare metal compute. It corresponds to "how much compute would get you the same level of performance without these algorithmic improvements". While conversion into a single effective compute number is missing some nuance, I still think this is a useful baseline model for understanding the situation.[4]
However, loss doesn't accurately capture downstream performance and many aspects of the algorithms (data mix, RL, training objectives other than next token prediction, inference code, scaffolding) can improve downstream performance without improving loss. Given that there are these additional ways to improve downstream performance (particularly RL), my sense is that overall algorithms progress on the downstream tasks we care about (e.g., autonomous software engineering given some high inference budget) has actually substantially outpaced 3x effective compute per year in the last 2 years and probably overall algorithmic progress looks more like 4.5x effective compute per year[5], though I'd be sympathetic to substantially higher estimates. I'd also guess that the pure loss algorithmic progress over the last 2-3 years is somewhat lower than 3x, partially due to focus shifting away from pretraining loss towards focusing on downstream performance through RL and better data (which might improve downstream performance much more than it improves loss). I don't currently have a good sense of the breakdown in recent years between loss improvements and purely downstream improvements, particularly because these might funge, so AI companies might shift focus without this being super obvious. However, it's more confusing to reason about effective compute increases in the regime where pretraining scaling appears to no longer work. I don't have a particularly great estimation methodology here, I'm just taking a best guess by eyeballing the gains from additional compute in the current regime and comparing that to progress that I think is algorithmic. This will be very sensitive to what benchmark mixture you use when assessing the improvements: there have been much larger algorithmic improvements in some domains like solving competition programming problems.
More minimally, I'd say that most progress in the last 2 years is down to algorithmic progress rather than due to scaling up the overall amount of compute used for training. (As in, if you used GPT-4 level compute to train a new AI system with the latest algorithms, it wouldn't be that much worse than current frontier systems like o3.)
Progress has been somewhat bumpy in recent years due to a small paradigm shift to increased RL and reasoning models with more inference compute usage in chain of thought. It's worth keeping in mind that trends might not be super smooth and progress could go much faster (or slower) at times. That said, the resulting performance on METR's horizon length metric looks surprisingly smooth over this period despite the underlying source of performance gains likely shifting some over time. So, we might expect that while new approaches emerge, underlying trends in downstream metrics broadly continue (we see something similar in a bunch of industries, likely due to these approaches and improvements generally averaging out).
An open question here is what fraction of the progress from o1 to o3 was algorithmic progress. My sense is that this question is somewhat a mess to answer as scaling up to o3 probably required building a bunch of additional environments. Should we count work on figuring out strategies for getting AIs to generate environments as algorithmic progress or something else? What about when humans write code for purely programmatically generating the environments? (If generating more environments just involves scaling up inference compute on environment generation it seems reasonably natural to count this under training compute scaling as you can just consider RL environment construction as a particular type of self-play training.) Either way, my sense is that large scale human labor on generating environments or buying data is unnatural to include in the algorithmic progress category as I'll discuss in the data section below.
Given that there was a bunch of algorithmic research on achieving lower loss for pretraining, a natural question is whether this transfers to RL (by helping achieve lower on the RL loss objective faster). My overall sense is that it mostly will transfer, but it's very unclear as the learning regime for RL is quite different.
To get a better sense of what's going on with algorithmic progress, I find it is slightly helpful to break things down into improvements that help with next token prediction loss vs improvements that don't.
Improvements which help with next token prediction loss:
Basically all types of improvements could in principle help with downstream performance without helping with next token prediction loss. So, I'll discuss the ones which I expect to differentially have this property.
Data
When I talk about data, I'm including both RL environments and pretraining data, though these might have somewhat different properties. Above I mentioned how data can be unnatural to lump under algorithmic progress. This depends a lot on what the data construction process looks like. In the case where humans are just writing better web scraping or data filtering code, it seems to clearly have similar properties as other algorithmic research. However, if it looks more like a large scale human labor project on building RL environments (rather than writing a smaller amount of code which makes AIs more effectively generate RL environments) or like buying data, then this has properties which are quite different from algorithmic research.
In particular, acquiring data in this way doesn't cleanly scale with effective compute in the same way (it isn't a multiplier on the training compute), you can keep accumulating this data in a capital intensive way (more similar to compute than algorithms), and it doesn't reduce down to a relatively small amount of code or a small number of ideas.
An important open question around data is where the RL environments for models like o3 mostly come from. If these basically come from paying humans to substantially manually make RL environments, then we might expect that RL scaling will slow some due to a human labor bottleneck while if environments are mostly AI generated or programmatically generated in a relatively scalable way then RL scaling can more easily continue.
Part of the story could be that data filtering and identification has improved over time and as we've identified the higher quality data, training on the marginal tokens has become less valuable as this is no longer getting us a bit more of the high quality data that we were unable to isolate. (This sort of consideration implies that the returns to data filtering aren't scale invariant.) ↩︎
My understanding is that GPT-4.5 generates tokens substantially slower than Grok 3; I get around 2x slower in my low effort testing. This would naively imply GPT-4.5 is more than 2x more (active) params, all else equal. ↩︎
Source: the graph they briefly showed in the youtube video introducing o3. ↩︎
Examples of missing nuance include: some algorithmic improvements aren't scale invariant and help more and more as more training compute is used, algorithmic improvement is at least a bit discrete historically and larger paradigm shifts than what we've seen recently are possible, algorithmic improvements might change frontier effective compute quite differently from improving efficiency at a given level of capability (I think efficiency is initially easier, but probably has lower absolute limits for improvement), and algorithmic improvements are sometimes much larger in some domains than in others (e.g., we've recently seen very large algorithmic improvements in math and competition programming problems that seemingly don't transfer as well to other domains). ↩︎
Multiplying this with the trend for 3.5x more training compute per year, you get 16x effective compute per year. ↩︎