It'll take until ~2050 to repeat the level of scaling that pretraining compute is experiencing this decade, as increasing funding can't sustain the current pace beyond ~2029 if AI doesn't deliver a transformative commercial success by then. Natural text data will also run out around that time, and there are signs that current methods of reasoning training might be mostly eliciting capabilities from the base model.

If scaling of reasoning training doesn't bear out actual creation of new capabilities that are sufficiently general, and pretraining at ~2030 levels of compute together with the low hanging fruit of scaffolding doesn't bring AI to crucial capability thresholds, then it might take a while. Possibly decades, since training compute will be growing 3x-4x slower after 2027-2029 than it does now, and the ~6 years of scaling since the ChatGPT moment stretch to 20-25 subsequent years, not even having access to any more natural text data that pretraining of this decade is putting to use.

Training Compute Slowdown

GPT-4 was pretrained in 2022 using ~24K A100 chips (0.3e15 BF16 FLOP/s per chip, 7.2e18 FLOP/s total) for ~2e25 FLOPs[1]. The current batch of frontier models (Grok 3, GPT-4.5) were pretrained in 2024 using ~100K H100 chips (1e15 BF16 FLOP/s per chip, 1e20 FLOP/s total), possibly for about 3e26 BF16 FLOPs[2] (or ~2x more in FP8). Abilene site of Crusoe/Stargate/OpenAI will have 400K-500K Blackwell chips in NVL72 racks in 2026 (2.5e15 BF16 FLOP/s per 2-die chip/package, ~1.1e21 FLOP/s total), enough to pretrain models for about 4e27 BF16 FLOPs.

Thus raw compute of a frontier training system is increasing about 160x in 4 years, or about 3.55x per year. Epoch AI estimates the trend of increasing price-performance of compute at 1.39x per year (adjusted for inflation). The 100K H100s training systems cost about $4-5bn (all-in, not just chips), which is ~$45K per 1e15 FLOP/s chip, and the GB200 NVL72 seem to be about $4M per 72-chip rack (again, all-in), which is ~$55K per 2.5e15 FLOP/s chip. Counting this as 2 years between the training systems, we get a 1.42x per year increase in price-performance, which fits the trend.

At $4M per rack (all-in), compute at the Abilene site will cost about $22-28bn to build, which is an increase in cost of 2.35x per year. At this pace, a 2028 training system would need to cost $140bn, which is still borderline plausible. But by 2030 we would get to $770bn, which probably can't actually happen if AI doesn't cross enough capability thresholds by then.

When funding stops increasing, the current pace of 3.55x per year (fueled by increasing funding) regresses to the pace of improvement in price-performance of compute of 1.4x per year, which is 3.7x slower. If the $140bn training systems of 2028 do get built, they'll each produce about 1.5e22 BF16 FLOP/s of compute, enough to train models for about 5e28 BF16 FLOPs.

Thus from 2022 to 2028, training systems (and pretrained models) might scale about 2,000x in FLOP/s. When increasing funding further is no longer feasible, the rate of scaling of 1.4x per year will take another 22 years to match this feat, increasing about 2,000x by year 2050.

Bounded Potential of Thinking Training

The current publicly known method of long horizon reasoning training was revealed in DeepSeek-R1 report (GRPO, with multiple incremental improvements since then such as DAPO). The collective term for these methods that might be catching on is RLVR, "RL with Verifiable Rewards".

So far, there is no public evidence that RLVR usefully scales beyond a very modest amount of training, or that training generalizes far beyond competition math and coding. Most thinking models only had a chance to demonstrate the first success in applying these methods, not yet sustained advancement beyond previous successes. Even for o1 and o3 from OpenAI, their base models are very likely different (GPT-4o and GPT-4.1). Only time will tell, or a paper about scaling laws that shows how well RLVR converts non-scarce inputs (such as compute) into capabilities. By 2028, this crucial uncertainty should mostly resolve.

If RLVR is mostly elicitation, its potential is bounded by capabilities of the pretrained model. In the s1 paper, the authors construct a training set of 1K reasoning traces for supervised finetuning sufficient to lift the capabilities of instruct models to a level comparable to those directly trained with RLVR/GRPO.

There is a clear possibility that with more compute poured into RLVR, it'll start creating capabilities rather than only eliciting them, like RL did with AlphaZero. But this hasn't been publicly demonstrated to happen yet, and another recent paper gives some indication that it might fail to work. The authors measure pass@k performance[3] of models before and after RLVR, and the RL-trained models consistently do better at low k, but then they do worse at high k, so that the base model starts outperforming them, after some point between 8 and 400 attempts in the benchmarks the paper uses (Figure 2).

The point at which the pass@k curves before and after RLVR training intersect seems remarkably stable for any given type of tasks (benchmark). It barely moves across multiple variations on GRPO (some of which mitigate loss of entropy that it suffers from), or from applying training in the range from 150 to 450 steps (Figure 7). The point of intersection even moves lower with more training, suggesting that performance of the base model at the intersection point with a weak RLVR model might remain an upper bound for performance of a much stronger RLVR model. Since the reliability of the base model is not yet very high even at pass@400 for many important tasks, this kind of bound on capabilities would be crippling for RLVR's potential.

Data Inefficiency of MoE

A paper from Jan 2025 finally gave clear measurements about the scaling laws for MoE models of various levels of sparsity, directly comparing them to dense. Turns out that even though MoE models are more compute efficient (it's possible to train a better model with less training compute), training them compute optimally[4] needs a lot more data than for dense models. Namely, at 1:8 (87%) sparsity, a compute optimal run would use ~3x more tokens per active param than a dense model, and at 1:32 (97%) sparsity it'll use ~6x more tokens per active param (Figure 12, left).

Famously, Chinchilla shows that using about 20 tokens per param was compute optimal at 6e23 FLOPs, and more recently Llama 3 405B report measured 40 tokens per param to be optimal with their dataset at 4e25 FLOPs. Larger datasets might continue this trend, since average data quality is going down, and for a 1:32 sparse MoE model (like DeepSeek-V3) this suggests 240 tokens per param might well turn out to be optimal.

This means that the natural text data will be running out much faster than the Chinchilla's 20 tokens per param used to suggest. At 240 tokens per param, even a 4e27 FLOPs model trained in 2026 would want 400T tokens, which exceeds even the unfiltered Common Crawl dataset (estimated to contain about 240T tokens prior to 2023, see Section 3.1). To some extent this can be mitigated by training for multiple epochs, repeating the same data, and a May 2023 paper shows that repeating the data up to about 5 times is not meaningfully worse than using 5 times more of unique data from the same distribution (Figure 4), with up to 15 repetitions remaining somewhat useful.

The recent Qwen 3 and Llama 4 Scout releases disclose using 36T and 40T token datasets in pretraining. Repeated 5 times, that gives up to 200T tokens, which would almost bridge the gap to the compute optimal 400T tokens for the hypothetical 1:32 sparse 4e27 FLOPs MoE model of 2026. At 1:8 sparsity instead, the compute optimal ratio might be about 120 tokens per param, asking for 280T tokens, which would be a little more manageable and only need 46T tokens repeated 6 times.

This completely breaks down with a 5e28 FLOPs model of 2028-2029 trained on a $140bn training system, which at 1:32 sparsity would be asking for 1,400T tokens. This might somewhat work as 100T tokens repeated 14 times, but probably not that well, and the compute will likely be put to other uses, such as training on video (assuming RLVR didn't scale to the point where it needs most of the 5e28 FLOPs of compute). It only gets worse as the scaling continues beyond 2030, so even with the same 2,000x increase in compute by ~2050, its straightforward impact on pretraining capabilities will be lower than what we are observing in 2022-2028, since there won't be any more natural text data to work with.


  1. Various rumors and estimates give the range between 20K and 25K A100 chips, SemiAnalysis gives the 24K figure. ↩︎

  2. The estimates assume 40% compute utilization and training for 3.5 months. ↩︎

  3. A model solves a task at pass@k if at least one of k attempts succeeds (perhaps by chance). This is not very useful in practice, because outside an eval there is often no way to tell which solutions were correct. ↩︎

  4. Pretraining is compute optimal if it achieves as low loss/perplexity as possible using a fixed amount of training compute. In practice it's often useful to overtrain models, which wastes some compute but makes models smaller and thus cheaper to inference. ↩︎

New Comment
22 comments, sorted by Click to highlight new comments since:

Excellent analysis as always.

a May 2023 paper shows that repeating the data up to about 5 times is not meaningfully worse than using 5 times more of unique data from the same distribution (Figure 4), with up to 15 repetitions remaining somewhat useful

Do we know that this doesn't effectively already happen in the biggest training datasets? If I'm reading it right, the paper only experimented with sub-1T datasets, compared to 40T+ datasets to which you try to generalize its findings. My guess would be that these larger datasets already effectively repeat the data: not in the sense of having copies of same texts (which I expect there are efforts to avoid), but in the sense of exposing models to the same underlying features of the world. (Twenty diversely phrased explanations of the Schrödinger Equation may shed some new light on how to best phrase things, but as far as understanding the SE itself is concerned, this is probably isomorphic to training on the first ten twice.)

Further, the situation there might be even worse. Inasmuch as AGI Labs want LLMs to learn deep, generalizable patterns, even bigger sets of seemingly distinct datapoints might be effectively repeats of the same datapoint. (Descriptions of 10,000 different churches may be "diverse training data" in a shallow sense, but if we want the model to learn an "analysis of the building's purpose" function, this is just 10,000 repeats of "analyze a church".)

Which, if true, would mean that you can't squeeze nontrivially more performance out of repeating trillions-sized datasets several times. In fact, datasets of those sizes may actually amount to much fewer "actually useful" tokens than the raw count suggests.

That's what the usual scaling laws are already estimating, marginal capability improvements for exponentially more data. Repetition makes it even worse, but the baseline is already logarithmic in data and compute. The power of scaling is that with real unique data, however unoriginal, the logarithmic progress doesn't falter, it still continues its logarithmic slog at an exponential expense rather than genuinely plateauing.

(Of course, humans are vastly more data efficient, so much more data is not truly needed, but that depends on discovering new methods, which is not something that absolutely has to happen by 2027 or 2035 or whatever. Only the current stretch of breakneck scaling will be predictably continuing for another 2-4 years, after that we are back to unknown unknowns.)

The power of scaling is that with real unique data, however unoriginal, the logarithmic progress doesn't falter, it still continues its logarithmic slog at an exponential expense rather than genuinely plateauing.

How to make sense of this? If the additional training data is mostly low quality (AI labs must have used the highest quality data first?) or repetitive (contains no new ideas/knowledge), perplexity might go down but what is the LLM really learning?

AI labs must have used the highest quality data first

The usual scaling laws are about IID samples from a fixed data distribution, so they don't capture this kind of effect.

But even with IID samples, we'd expect to get diminishing marginal returns, and we do.  And you're asking: why, then, do we keep getting returns indefinitely (even diminishing ones)?

I think the standard answer is that reality (and hence, the data) contains a huge number of individually rare "types of thing," following some long-tailed distribution.  So even when the LLM has seen trillions of tokens, there are probably some elements of the world that it still has a shaky grasp of because they only appear in the data once every hundred billion tokens; and once it has mastered those, it will have to chase down the even rarer stuff that only comes up once every trillion tokens; and so on.

Even if it were true that the the additional data literally "contained no new ideas/knowledge" relative to the earlier data, its inclusion would still boost the total occurrence count of the rarest "ideas" – the ones which are still so infrequent that the LLM's acquisition of them is constrained by their rarity, and which the LLM becomes meaningfully stronger at modeling when more occurrences are supplied to it.

You may find Beren's post on this illuminating.

The usual scaling laws are about IID samples from a fixed data distribution, so they don’t capture this kind of effect.

Doesn't this seem like a key flaw in the usual scaling laws? Why haven't I seen this discussed more? The OP did mention declining average data quality but didn't emphasize it much. This 2023 post trying to forecast AI timeline based on scaling laws did not mention the issue at all, and I received no response when I made this point in its comments section.

Even if it were true that the the additional data literally “contained no new ideas/knowledge” relative to the earlier data, its inclusion would still boost the total occurrence count of the rarest “ideas” – the ones which are still so infrequent that the LLM’s acquisition of them is constrained by their rarity, and which the LLM becomes meaningfully stronger at modeling when more occurrences are supplied to it.

I guess this is related to the fact that LLMs are very data inefficient relative to humans, which implies that a LLM needs to be trained on each idea/knowledge multiple times in multiple forms before it "learns" it. It's still hard for me to understand this on an intuitive level, but I guess if we did understand it, the problem of data inefficiency would be close to being solved, and we'd be much closer to AGI.

[-]cubefox10-1

I guess this is related to the fact that LLMs are very data inefficient relative to humans, which implies that a LLM needs to be trained on each idea/knowledge multiple times in multiple forms before it "learns" it. It's still hard for me to understand this on an intuitive level, but I guess if we did understand it, the problem of data inefficiency would be close to being solved, and we'd be much closer to AGI.

LeCun has written about this. Humans are already pretrained on large amounts of sensory data before they learn language, while language models are trained from scratch with language. The current pretraining paradigm only works well with text, as this text data is relatively low-dimensional (e.g. 2^16≈65.000 for a 16 bit tokenizer), but not with audio or video, as the dimensionality explodes. Predicting a video frame is much harder than predicting a text token, as the former is orders of magnitude larger.

From a blog post:

To better understand this challenge, we first need to understand the prediction uncertainty and the way it’s modeled in NLP [Natural Language Processing] compared with CV [Computer Vision]. In NLP, predicting the missing words involves computing a prediction score for every possible word in the vocabulary. While the vocabulary itself is large and predicting a missing word involves some uncertainty, it’s possible to produce a list of all the possible words in the vocabulary together with a probability estimate of the words’ appearance at that location. Typical machine learning systems do so by treating the prediction problem as a classification problem and computing scores for each outcome using a giant so-called softmax layer, which transforms raw scores into a probability distribution over words. With this technique, the uncertainty of the prediction is represented by a probability distribution over all possible outcomes, provided that there is a finite number of possible outcomes.

In CV, on the other hand, the analogous task of predicting “missing” frames in a video, missing patches in an image, or missing segment in a speech signal involves a prediction of high-dimensional continuous objects rather than discrete outcomes. There are an infinite number of possible video frames that can plausibly follow a given video clip. It is not possible to explicitly represent all the possible video frames and associate a prediction score to them. In fact, we may never have techniques to represent suitable probability distributions over high-dimensional continuous spaces, such as the set of all possible video frames.

This seems like an intractable problem.

LeCun says that humans or animals, when doing "predictive coding", predict mainly latent embeddings rather than precise sensory data. It's currently not clear how this can be done efficiently with machine learning.

I guess what I'm hypothesizing is that there's a relationship between the amount of data that goes into pretraining and how many times it's productive to repeat that data. For sub-1T datasets, the paper shows that repeating it 5 times is similar to using 5x as much data and repeating it 10 more times is useful but introduces diminishing returns. But perhaps for 10T-sized datasets, you could only repeat the data once before running into diminishing returns, and then just two more repeats exhaust all usefulness.

This should be easy to test. I don't think the May 2023 paper tests this particular hypothesis, though.

One problem is that with a different dataset/distribution, the baseline is also different, adding bad unique data to a mix can damage the outcome more than repeating the good data too many times. In the paper, there is a measurement of what happens if you replace a dataset of unique data with its perplexity-filtered half more similar to Wikipedia repeated two times, and it turns out that the repeated data does better (Figure 6, right, yellow star at 50%). So a more developed methodology would be repeating different individual pieces of data in the dataset different numbers of times, rather than repeating everything equally.

In principle, even 15 repetition is only a soft cap, at a compute optimal ratio of tokens per param perplexity keeps reducing for up to 60 repetitions (Figure 3, Figure 10). Though that's C4, perhaps the location of the minimum changes for a different dataset. And then after hundreds of repetitions there is double descent (Figure 9). The slop you'd need to add to feed a trillion dollar training system (perhaps synthetic data) could be even worse than the 60th repetition of the best tokens you have, or 10,000th repetition of Wikipedia.

Thanks for the post, the discussion about compute scaling slowing seems right to me.[1]

Some important additional things to note:

  • AI progress trails behind compute scaling because it takes a while to figure out how to effectively run much bigger training runs (this includes infrastructure, generally not messing up the training run, and things like figuring out where the data will come from) and because algorithmic progress will more slowly hit diminishing returns on the same compute base to use for experiments. So, I expect that if compute scaling slows greatly by 2029, it will take until around 2031 or 2032 before progress is obviously much slower. (This is putting aside the evidential update about AI progress based on the fact that compute scaling is slowing.) My low confidence guess is that if we paused all compute scaling and hiring now, it would take around 2 or 3 years before the rate of algorithmic progress halved (part of this is that AI companies can adapt to more effectively utilizing the employees and compute they have now if they aren't relentlessly scaling compute and employees).
  • We'll be running into fab capacity issues around 2028 as well. (I don't know the exact numbers.) So, even if investment continues due to impressive capabilities, progress will slow unless AIs can substantially accelerate AI R&D or building fabs.

  1. That said, I think we might see atypically high improvements in FLOP/s / $ between now and 2030 due to increased R&D spending on ML accelerators paying off and Nvidia's margins mostly going away (which would reduce the costs of datacenters with Nvidia GPUs by 30-60% depending on the fraction of the cost which is GPUs). ↩︎

[-]romeo102

When funding stops increasing, the current pace of 3.55x per year (fueled by increasing funding) regresses to the pace of improvement in price-performance of compute of 1.4x per year, which is 3.7x slower. If the $140bn training systems of 2028 do get built, they'll each produce about 1.5e22 BF16 FLOP/s of compute, enough to train models for about 5e28 BF16 FLOPs.


This is a nice way to break it down, but I think it might have weird dependencies e.g., chip designer profit margins.

Instead of: 

training run investment ($) x hardware price performance (FLOP/$) = training compute (FLOP)

Another possible breakdown is: 

hardware efficiency per unit area (FLOP/s/mm^2) x global chip production (mm^2) x global share of chips used in training run (%) * training time (s) = training compute (FLOP)

This gets directly at the supply side of compute. It's basically 'moore's law x AI chip production x share of chips used'. In my model for the next three years are 1.35x1.65x1.5 ~= 3.4x, so matches your 3.55x/year pretty closely. Where we differ slightly I think would be in the later compute slowdown. 

Under my model there are also one-time gains happening in AI chip production and share of chips used (as a result of the one-time spending gains in your model). Chip production has one-time gains because AI only uses 5-10% of TSMC leading nodes and is using up spare capacity as fast as packaging/memory can be produced. Once this caps out, I think the 1.65x will default to being something like 1.2-4x as it gets bottlenecked on fab expansion (assuming like you said an investment slowdown). 'Share of chips used' growth goes to 1x by definition. 

Even taking the lower end of the estimate would mean that 'moore's law' hardware gains would have to slow down ~2x to 1.16x/year to match your 1.4x number. I do think hardware gains will slow somewhat but 1.16x is below what I would bet. Taking my actual medians, I think i'm at like 1.3x (production) x 1.2x (hardware) = 1.56x/year so more like a 2.8x slowdown, not 3.7x slowdown.

So resolving the discrepancy, it seems like my model is basically saying that your model overestimates the slowdown because it assumes profit margins stay fixed, but instead under slowing investment growth these should collapse? That feels like it doesn't fully explain it though since it seems like it should be a one time fall (albeit a big one). Maybe in the longer term (i.e. post-2035) I agree with you more and my 1.3x production number is too bullish. 

Avoidable chip margins are maybe 60% at most (of all-in training system cost)[1], which saves 2.8 years out of 22.5 years, to reach 2000x at 1.4x per year (and Google with their TPUs is already somewhat past this, so it should be concerning if this tries to become a crux of an argument). If the cost reduction happens quickly, it instead concentrates the faster scaling into 2022-2029, making the contrast with 2030-2050 starker.

The wafers are getting more complicated to produce, so the cost per unit area probably won't be going down, I don't see changes in total wafer supply contributing to the size of individual training systems constrained by cost.

Introduction of packaging plausibly helps short term with non-chip costs that scale with the number of chips (we get fewer chips per FLOP/s), similarly for higher power and chip density within a rack (Rubin Ultra is 600 KW per rack). But this also precedes 2030-2050.

(I worry about accounting for any specific "one-time" factors that affect Moore's law, because a lot of it is probably built out of such factors. Different people will be aware of different factors at different times, therefore considering them "one-time".)


  1. The reference design for a 1024-chip HGX H100 system has $33.8 per chip for the actual servers, and $8.2K per chip for networking. So if various non-IT expenses on a 100K H100 campus are on the order of $500M (land, buildings, cooling, power, construction), and networking 100K H100s costs a bit more than letting the 1K parts remain unconnected, let's say another $2K per chip, we get about $49K per chip (which is to say, $4.9bn per 100K H100s). If 90% (!) of the H100 server is gross margin, it saves us $30.9K per chip, or 62%. ↩︎

To be clear I don't think the profit margin is the only thing that explains the discrepancy. 

I think the relevant question is more like: under my method, is 1.3x (production) x 1.2x (hardware) = 1.56x/year realistic over 2-3 decades or am I being too bullish? You could ask an analagous thing about your method (i.e., is 1x investment and 1.4x price performance realistic over the next 2-3 decades?) Two different ways of looking at it that should converge. 

If i'm not being too bullish with my numbers (which is very plausible, e.g., it could easily be 1.2x, 1.2x), then i'd guess the discrepancy with your method comes from it being muddled with economic factors (not just chip designer profit margins but supply/demand factors affecting costs across the entire supply chain, e.g., down to like how much random equipment costs and salaries for engineers). Maybe 1x investment is too low, maybe should be multiplied with inflation and GDP growth?

The largest capex projects (credible-at-the-time announcements / money raised / assets minus debts) of dot-com bubble were about $10-14bn nominal around 2000 (Global Crossing, Level 3 Communications), or $19-26bn in 2025 dollars. Now that it's 25 years later, we have hyperscalers that can afford ~$80bn capex per year, which is 3x-4x more than $19-26bn.

Similarly, a 2029-2030 slowdown would turn a lot of AI companies or AI cloud companies irrelevant or bankrupt, and the largest projects of 2027-2029 won't be the baseline for gradual growth towards 2050 for the most powerful companies found in 2050 (except possibly for Google). I'm looking at 22 years instead of 25, it's unclear how well the AI profits can be captured by a single company, and because of inference the cost of revenue is higher than for software. So let's say capex per year (CPI-adjusted, in 2025 dollars) for the largest AI companies in 2050 is 3x the largest projects of 2027-2029, which I was estimating at $140bn. Anchoring to Nvidia's process of 2 years per major hardware refresh, a training system can eat two years' worth of capex, so in total we get $840bn for a 2050 training system (in 2025 dollars).

This is 5 years ahead of the zero change in cost of training system projection from the post (at 1.4x per year for CPI-adjusted price-performance), so maybe the second 2,000x of scaling should be reached by 2045 instead.

so maybe the second 2,000x of scaling should be reached by 2045 instead.

Yeah sounds reasonable, that would match up with my 1.56x/year number, so to summarize, we both think this is roughly plausible for 2028-2045?

1.3x/year (compute production) x 1.2x/year (compute efficiency) ~= 1.55x/year (compute available)

1.1x/year (investment) x 1.4x/year (price performance) ~= 1.55x/year (compute available)

So a 3x slowdown compared to the 2022-2028 trend (~3.5x/year).

With the caveat that I'm not expecting gradual increase in investment at 1.1x/year, but instead wailing and gnashing of teeth in the early 2030s, followed by a faster increase in investment from a lower baseline. And this is all happening within the hypothetical where RLVR doesn't work out at scale and there are no other transformative advancements all the way through 2045. I don't particulary expect this hypothetical to become reality, but can't confidently rule it out (hence this post).

I don't have a sense of the supply-side picture, it seems more relevant for global inference buildout, I don't see how it anchors capex of an individual company. The fraction an individual company contributes to the global chip market doesn't seem like a meaningful number to me, more like a ratio of two unrelated numbers (as long as it's not running towards the extremes, which it doesn't in this case).

Are there any papers on current efforts to tokenize video and estimating the size of available data for that?

The point at which the pass@k curves before and after RLVR training intersect seems remarkably stable for any given type of tasks (benchmark). It barely moves across multiple variations on GRPO (some of which mitigate loss of entropy that it suffers from), or from applying training in the range from 150 to 450 steps (Figure 7). The point of intersection even moves lower with more training, suggesting that performance of the base model at the intersection point with a weak RLVR model might remain an upper bound for performance of a much stronger RLVR model. Since the reliability of the base model is not yet very high even at pass@400 for many important tasks, this kind of bound on capabilities would be crippling for RLVR's potential.


DeepSeek-Prover-V2 on MiniF2F improves from 86.6% (pass@1024) to 88.9% (pass@8192). Kimina-Prover also report its best performance at pass@8192. What makes proving so special? This should oppose for any given type of tasks. Does it implies proving is acutally under-trained for base models so RLVR can consistently improving performance.

To contrast with findings of the Yue et al. paper, we still need a comparison with pass@k of the base model, which needs to already be able to code in Lean, and most of the subsequent training needs to be just RLVR. DeepSeek-Prover-V2's description in the paper seems more involved than that, compared to the summary in github's README.md. Also with long reasoning traces, it might be necessary to use something like the "Wait" trick of s1 (Figure 3) to force the token budget of the base model (otherwise the RLVR model might start winning simply because it prefers to output longer sequences). And there is a question of where the intersection point with pass@k after RLVR for only 50-150 steps is: is it already far off from the outset, or does it move up with more training?

This is similar to a question about pass@10K performance of a variant of OpenAI o1 in this paper comparing it with o3 (Figure 7): does it actually follow from good performance of o1-ioi at pass@10K that the base model is doing worse at pass@10K? The language monkeys paper demonstrates that pass@k improves a lot at 10K samples (Figure 3, left), to a degree that's intuitively completely implausible if you've only seen pass@1 performance as a baseline. Or alternatively, training is not just RLVR and this is essential to how it breaks through the upper bound of base model's pass@k, making the result in the Yue et al. paper inapplicable.

But by 2030 we would get to $770bn

The cost estimates seem to include the large margins that Nvidia is currently charging? They could build their own data centers with their own chips (like Google is doing) and presumably achieve the same for much less.

This is a story about a trend in total spending and the financial constraints it runs into. Training systems with cheaper chips would offer more compute (using even more chips), not ask for less money. There still won't be a $770bn training system in 2030, but a $140bn training system might hold more compute, concentrating even more scaling in 2022-2029 and leaving less for 2030-2050 (where the cost of chip manufacturing eventually dominates). Google probably already has this advantage, and AWS is getting there with Trainium.

(More carefully, the 2030-2050 scaling by another 2,000x is also an oversimplification, since AI will be integrating into the economy, and so even at the current level of AI capabilities the largest AI companies will be gradually getting wealthier. Also when training system growth is slower and the training methodology more settled, so too training runs will get longer than ~3.5 months, increasing training compute per model.)

„or that training generalizes far beyond competition math and coding“ well, it does scale beyond math and coding, at least if you mean benchmark performance increasing of non stem fields with RL on math and/or coding. But if you mean setting up a good environment and rewards for RLVR that’s indeed hard, but there has been some interesting research. I think scaling will continue, whether it’s pre training or inference or RL. And I think funding will still flow, yes capabilities will need to get better and there’s no capability it would need to reach for funding to continue (maybe it does but nobody knows yet) but they are already (models getting better), whether they continue to do so is another question, since RL needs computer and data and how to do RL on non verifiable tasks is still in research. But I’m kind of optimistic we’ll get to very good capabilities with long context, continuous reasoning, tool calling and more agentic and vision stuff.

As I wrote in LLMs May Find It Hard to FOOM, sooner or later we're going to need to use LLMs to generate larde amounts of new synthetic training data. We already know that doing that naively without using inference-time-scaling leads to model collapse. The interesting question is whether some kind of inference-time training approach can allow an LLM to think long and hard and generate large amounts of higher-quality training data that can be used to train a model that's better than it is. That's not theoretically impossible: science and mathematics are real things, truth can be discovered with enough work; but it's also not trivial: you can't simply have an LLM generate 1000T tokens, train on that, and get a better LLM.

Even if all RLVR does is raise the pass@1 towards the pass@100 of the base model, if that can trained model generate enough training data to train a new base model with a similar pass@1 (before applying RLVF to it), the pass@100 of that new model must be higher than its pass@1, and RLVR should be able to elicit that to an improved pass@1, so you've made forward progress. The question then becomes whether you can repeat this cycle and keep making progress, without it plateauing. At least in areas like Mathematics and Coding, where verifiable truth exists and can be discovered with enough effort, this seems plausible. AlphaGo certainly did (though I gather its superhuman skills also have some blind-spots, corresponding to tactics it apparently never thought of — suggesting that involving humans, or at least multiple different LLM models, in the training data generation cycle might be useful here to avoid comparable blindspots.) Doing the same in other STEM topics would require your AI to be interacting with the world while generating training data, running experiments and designing and testing products — but then, that's rather the point of having a nation of researchers in a data-center

Curated and popular this week