How Well Does RL Scale?

Toby_Ord

LESSWRONG
LW

How Well Does RL Scale? — LessWrong

132 How Well Does RL Scale?

by Toby_Ord

22nd Oct 2025

Linkpost from www.tobyord.com

9 min read

132

Summary: RL-training for LLMs scales surprisingly poorly. Most of its gains are from allowing LLMs to productively use longer chains of thought, allowing them to think longer about a problem. There is some improvement for a fixed length of answer, but not enough to drive AI progress. Given the scaling up of pre-training compute also stalled, we'll see less AI progress via compute scaling than you might have thought, and more of it will come from inference scaling (which has different effects on the world). That lengthens timelines and affects strategies for AI governance and safety.

The current era of improving AI capabilities using reinforcement learning (from verifiable rewards) involves two key types of scaling:

Scaling the amount of compute used for RL during training
Scaling the amount of compute used for inference during deployment

We can see (1) as training the AI in more effective reasoning techniques and (2) as allowing the model to think for longer. I’ll call the first RL-scaling, and the second inference-scaling. Both new kinds of scaling were present all the way back in OpenAI’s announcement of their first reasoning model, o1, when they showed this famous chart:

I’ve previously shown that in the initial move from a base-model to a reasoning model, most of the performance gain came from unlocking the inference-scaling. The RL training did provide a notable boost to performance, even holding the number of tokens in the chain of thought fixed. You can see this RL boost in the chart below as the small blue arrow on the left that takes the base model up to the trend-line for the reasoning model. But this RL also unlocked the ability to productively use much longer chains of thought (~30x longer in this example). And these longer chains of thought contributed a much larger boost.

The question of where these capability gains come from is important because scaling up the inference compute has very different implications than scaling up the training compute. In this first round of reasoning models, they were trained with a very small amount of RL compute compared to the compute used in pre-training, meaning that the total cost of training was something like 1.01x higher than the base-model. But if most of the headline performance results require 30x as much inference compute, then the costs of deploying the those capabilities is 30x higher. Since frontier AI developers are already spending more money deploying their models than they did training them, multiplying those costs by 30x is a big deal. Moreover, these are costs that have to be paid every time you want to use the model at this level of capability, so can’t be made up in volume.

But that was just the initial application of RL to LLMs. What happens as companies create more advanced reasoning models, using more RL?

The seeds of the answer can be found all the way back in that original o1 chart.

The chart shows steady improvements for both RL-scaling and inference-scaling, but they are not the same. Both graphs have the same y-axis and (despite the numbers being removed from the x-axis) we can see that they are both on a logarithmic x-axis covering almost exactly two orders of magnitude of scaling (100x). In both cases, the datapoints lie on a relatively straight line, which is presumably the central part of a larger S-curve. However, the slope of the RL-scaling graph (on the left) is almost exactly half that of the slope of the inference-scaling graph (on the right). When the x-axis is logarithmic, this has dramatic consequences.

The graph on the right shows that scaling inference-compute by 100x is enough to drive performance from roughly 20% to 80% on the AIME benchmark. This is pretty typical for inference scaling, where quite a variety of different models and benchmarks see performance improve from 20% to 80% when inference is scaled by 100x.

For instance, this is what was found with Anthropic’s first reasoning model (Sonnet 3.7) on another AIME benchmark, with almost exactly the same scaling behaviour:

And ability on the ARC-AGI 1 benchmark also scales in a similar way for many of OpenAI’s different reasoning models:

We don’t always see this scaling behaviour for inference: some combinations of LLM, inference-scaling technique, and benchmark see the performance plateau below 80% or exhibit a different slope (often worse). But this climb from 20 to 80 with 100x more inference compute is pretty common (especially for reasoning-intensive benchmarks) and almost certainly what is happening on that original o1 graph.

In contrast, the slope of the RL-scaling trend is half as large, which means that it requires twice as many orders of magnitude to achieve the exact same improvement in capabilities. Increasing the RL training compute by 100x as shown in the o1 chart only improved performance from about 33% to 66%. At that rate, going from 20 to 80 would require scaling up the RL training compute by 10,000x.

We can confirm this trend — and that it continued beyond o1 — by looking at the following graph from the o3 launch video (with a line added showing the slope corresponding to going from 20 to 80 in 10,000x):

Using another version of the AIME benchmark, this shows o1’s training progress over 3 orders of magnitude and o3’s training over a further order of magnitude. In total, we see that scaling up the RL-training by 4 orders of magnitude takes the model from about 26% to 88%. This provides some confirmation for the rule-of-thumb that a 10,000x scale-up in RL training compute is required to improve this benchmark performance from 20 to 80.

To my knowledge, OpenAI hasn’t provided RL-training curves for other benchmarks, but they do have charts comparing o1 with o3 and o3 with GPT-5 at different inference-scaling levels on several benchmarks. Given that o3 used about 10x as much RL training as o1, we’d expect the RL boost going from o1 to o3 to be worth about the same as the inference boost of giving o1 just half an order of magnitude more inference (~3x as many tokens). And this is indeed what one sees on their performance/token graph comparing the two:

Similarly, o3 also requires about 3x as many tokens to match GPT-5 on the SWE-bench and GPQA Diamond benchmarks. This would fit the expected pattern of GPT-5 having been trained with a further 10x as much RL training compute as o3:

It is hard to verify that this trend holds for models from other companies, as this data on training curves for cutting-edge models is often treated as confidential. But the fact that other leading labs’ base models and reasoning models are roughly on par with OpenAI’s suggests none of them are scaling notably better than this.

So the evidence on RL-scaling and inference-scaling supports a general pattern:

a 10x scaling of RL is required to get the same performance boost as a 3x scaling of inference
a 10,000x scaling of RL is required to get the same performance boost as a 100x scaling of inference

In general, to get the same benefit from RL-scaling as from inference-scaling required twice as many orders of magnitude. That’s not good.

How do these compare to pre-training scaling?

The jumps from GPT-1 to 2 to 3 to 4 each involved scaling up the pre-training compute by about 100x. How much of the RL-scaling or inference-scaling would be required to give a similar boost? While I can’t say for sure, we can put together the clues we have and take an educated guess.

Jones (2021) and EpochAI both estimate that you need to scale-up inference by roughly 1,000x to reach the same capability you’d get from a 100x scale-up of training. And since the evidence from o1 and o3 suggests we need about twice as many orders of magnitude of RL-scaling compared with inference-scaling, this implies we need something like a 1,000,000x scale-up of total RL compute to give a boost similar to a GPT level.

This is breathtakingly inefficient scaling. But it fits with the extreme information inefficiency of RL training, which (compared to next-token-prediction) receives less than a ten-thousandth as much information to learn from per FLOP of training compute.

Yet despite the poor scaling behaviour, RL training has so far been a good deal. This is solely because the scaling of RL compute began from such a small base compared with the massive amount of pre-training compute invested in today’s models. While AI labs are reticent to share information about how much compute has actually been spent on RL (witness the removal of all numbers from the twin o1 scaling graphs), it is widely believed that even the 10,000x RL-scaling we saw for o3’s training still ended up using much less compute than the ~ FLOP spent on pre-training. This means that OpenAI (and their competitors) have effectively got those early gains from RL-training for free.

For example, if the 10x scaling of RL compute from o1 to o3 took them from a total of 1.01x the pre-training compute to 1.1x, then the 10x scale-up came at the price of a 1.1x scale-up in overall training costs. If that gives the same performance boost as using 3x as many reasoning tokens (which would multiply all deployment costs of reasoning models by 3) then it is a great deal for a company that deploys its model so widely.

But this changes dramatically once RL-training reaches and then exceeds the size of the pre-training compute. In July 2025, xAI’s Grok 4 launch video included a chart suggesting that they had reached this level (where pre-training compute is shown in white and RL-training compute in orange):

Scaling RL by another 10x beyond this point increases the total training compute by 5.5x, and beyond that it is basically the full 10x increase to all training costs. So this is the point where the fact that they get much less for a 10x scale-up of RL compute compared with 10x scale-ups in pre-training or inference really bites. I estimate that at the time of writing (Oct 2025), we’ve already seen something like a 1,000,000x scale-up in RL training and it required ≤2x the total training cost. But the next 1,000,000x scale-up would require 1,000,000x the total training cost, which is not possible in the foreseeable future.

Grok 4 was trained on 200,000 GPUs located in xAI’s vast Colossus datacenter. To achieve the equivalent of a GPT-level jump through RL would (according to the rough scaling relationships above) require 1,000,000x the total training compute. To put that in perspective, it would require replacing every GPU in their datacenter with 5 entirely new datacenters of the same size, then using 5 years worth of the entire world’s electricity production to train the model. So it looks infeasible for further scaling of RL-training compute to give even a single GPT-level boost.

I don’t think OpenAI, Google, or Anthropic have quite reached the point where RL training compute matches the pre-training compute. But they are probably not far off. So while we may see another jump in reasoning ability beyond GPT-5 by scaling RL training a further 10x, I think that is the end of the line for cheap RL-scaling.

Conclusion

The shift towards RL allowed the scaling era to continue even after pre-training scaling had stalled. It did so via two different mechanisms: scaling up the RL training compute and scaling up the inference compute.

Scaling RL training allowed the model to learn for itself how to achieve better performance. Unlike the imitation learning of next-token-prediction, RL training has a track record of allowing systems to burst through the human level — finding new ways of solving problems that go beyond its training data. But in the context of LLMs, it scales poorly. We’ve seen impressive gains, but these were only viable when starting from such a low base. We have reached the point where it is too expensive to go much further.

This leaves us with inference-scaling as the remaining form of compute-scaling. RL helped enable inference-scaling via longer chain of thought and, when it comes to LLMs, that may be its most important legacy. But inference-scaling has very different dynamics to scaling up the training compute. For one thing, it scales up the flow of ongoing costs instead of scaling the one-off training cost. This has many consequences for AI deployment, AI risk, and AI governance.

But perhaps more importantly, inference-scaling is really a way of improving capabilities by allowing the model more time to solve the problem, rather than by increasing its intelligence. Now that RL-training is nearing its effective limit, we may have lost the ability to effectively turn more compute into more intelligence.

First published on 20 October 2025

Frontpage

132

New Comment

23 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:00 PM

[-]Daniel Kokotajlo4mo3613

My hot take, thinking step by step, expecting to be wrong about things & hoping to be corrected:

What you basically doing is looking at the part of the s-curve prior to plateauing (the exponential growth part) and noticing that, in that regime, scaling up inference compute buys you more performance than scaling up training compute.

However, afaict, scaling up training compute lets you push the plateau part of the inference scaling curve out/higher. GPT5 pumped up with loads of inference compute is significantly better than GPT4 pumped up with loads of inference compute. Not just a little better. They aren't asymptoting to the same level.

So I think you are missing the important reason to do RL training. Obviously you shouldn't do RL training for a use-case that you can already achieve by just spending more inference compute with existing models! (Well, I mean, you still should, depending on how much you are spending on inference. The economics still works out depending on the details.) But the point of RL training is to unlock new levels of capability that you simply couldn't get by massively scaling up inference on current models.

Now, all that being said, I don't think it's actually super clear how much unlock you get. If the answer is "not much" then yeah RL scaling is doomed for exactly the reasons you mention. But there seems to have at the very least been a zero to one effect, where a little bit of RL scaling resulted in an increase in the level at which the inference scaling curve plateaus. Right?

Like, you say:

So the evidence on RL-scaling and inference-scaling supports a general pattern:
a 10x scaling of RL is required to get the same performance boost as a 3x scaling of inference
a 10,000x scaling of RL is required to get the same performance boost as a 100x scaling of inference

Grok 4 probably had something like 10,000x or more RL compared to the pure pretrained version of Grok 4. So would you predict therefore that xAI could take the pure pretrained version of Grok 4, pump it up with 100x inference compute (so, let it run 100x longer for example, or 10x longer and 10x in parallel) and get the same performance? (Or I'd run the same argument with the chatbot-finetuned version of Grok 4 as well. The point is, there was some earlier version that had 10,000x less RL.)

[-]Toby_Ord4mo132

I agree that separately from its direct boost to performance at the same inference-compute, RL training also helps enable more inference scaling. I talk about that above when I say "this RL also unlocked the ability to productively use much longer chains of thought (~30x longer in this example). And these longer chains of thought contributed a much larger boost."

A key thing I'm trying to get across is that I think this is where most of the benefit from RL is coming from. i.e. that while you pay the RL scaling costs at training time, you also need to pay the inference scaling costs at deployment time in order to get the benefit. Thus, RL is not an alternative to a paradigm where we cost-per-use is going to 10x, 100x, 1000x etc in order to keep seeing benefits.

Many people have said the opposite. The reason I looked into this is that people such as Dan Hendrycks and Josh You pushed back on my earlier statements that most of the gain has been in enabling longer CoT and the scaling of inference, saying respectively that it makes models much better at the same token-budget and that we're witnessing a one-off scale up of token budget but further gains will come from RL scaling. I think I've delivered clear evidence against those takes.

You'd probably enjoy my post on Inference scaling reshapes AI governance and the whole series on my website. I think they paint a picture where compute scaling is becoming a smaller tailwind for AI progress and where it is changing in character.

[-]Daniel Kokotajlo4mo71

That's reasonable, but it seems to be different from what these quotes imply:

So while we may see another jump in reasoning ability beyond GPT-5 by scaling RL training a further 10x, I think that is the end of the line for cheap RL-scaling.
... Now that RL-training is nearing its effective limit, we may have lost the ability to effectively turn more compute into more intelligence.

There are a bunch of quotes like the above that make it sound like you are predicting progress will slow down in a few years. But instead you are saying that progress will continue, and AIs will become capable of doing more and more impressive tasks thanks to RL scaling, but they'll require longer and longer CoTs to do those more and more impressive tasks? That's very reasonable and less spicy / contrarian, I think most people would already agree with that.

I like your post on inference scaling reshaping AI governance. I think I agree with all the conclusions on the margin, but think that the magnitude of the effect will be small in every case and thus not change the basic strategic situation.

My own cached thought, based on an analysis I did in '22, is that even though inference costs will increase they'll continue to be lower than the cost of hiring a human to do the task. I suppose I should revisit those estimates...

[-]Toby_Ord4mo60

I do think that progress will slow down, though its not my main claim. My main claim is to do with the tailwind of compute scaling will become weaker (unless some new scaling paradigm appears or a breakthrough saves this one). That is a piece in the puzzle of whether overall AI progress will accelerate or decelerate and I'd ideally let people form their own judgments about the other pieces (e.g. whether recursive self improvement will work, or whether funding will collapse in a market correction, taking away another tailwind of progress). But having a major boost to AI progress (compute scaling) become less of a boost is definitely some kind of an update towards lower AI progress than you were otherwise expecting.

Part of the issue with inference scaling as the main surviving form of scaling depends on how many more OOMs are needed. If it is 100x, there isn't so much impact. If we need to 1,000x or 1,000,000x it from here, it is more of an issue.

In that prior piece I talked about inference-scaling as a flow of costs, but it also scales with things beyond time:

costs grow in proportion to time (can’t make up the costs by longer use before the new model)
costs grow in proportion to number of users (can’t make up the costs through market expansion)
costs grow in proportion to the amount of use by each user (can’t make up costs through intensity of use)

This is a big deal. If you want to 100x the price of inference going into each query, how can you make that up and still be profitable? I think you need to 100x the willingness-to-pay from each user for each query. That is very hard. My guess is that the WTP doesn't scale with inference compute in this way, and thus that inference can only be 10x-ed when algorithmic efficiency gains and falling chip costs have divided the cost per token by 10. So I think that while previous rounds of training compute scaling could pay for themselves in the marketplace, I think that will stop for most users soon, and for specialist users a bit later.

The idea here is that the changing character of scaling affects the business model, making it so that it is no longer self-propelling to keep scaling, and that this will mean the compute scaling basically stops.

PS
Thanks for pointing out that second quote "Now that RL-training…" — I think that does come across a bit stronger than I intended.

[-]Daniel Kokotajlo4mo7-2

"inference scaling as the main surviving form of scaling " --> But it isn't though, RL is still a very important form of scaling. Yes, it'll become harder to scale up RL in the near future (recently they could just allocate more of their existing compute budget to RL, but soon they'll need to grow their compute budget) so there'll be a slowdown from that effect, but it seems to me that the next three OOMs of RL scaling will bring at least as much benefit as the previous three OOMs of RL scaling, which was substantial as you say (largely because it 'unlocked' more inference compute scaling. The next 3 OOMs of RL scaling will 'unlock' even more.)

Re: Willingness to pay going up: Yes, that's what I expect. I don't think it's hard at all. If you do a bunch of RL scaling that 'unlocks' more inference scaling -- e.g. by extending METR-measured horizon length -- then boom, now your models can do significantly longer, more complex tasks than before. Those tasks are significantly more valuable and people will be willing to pay significantly more for them.

[-]Toby_Ord4mo50

I'm a bit confused here. Your first paragraph seems to end up agreeing with me? i.e. that RL scaling derives most of its importance from enabling inference-scaling and is dependent on it. I'm not sure we really have any disagreement there — I'm not saying people will stop doing any RL.

Re WTP, I do think it is quite hard to scale. For example consider consumer use. Many people are paying ~$1 per day for AI access (the $20/month subscriptions). If companies need to 1000x inference in order to get the equivalent of a GPT level, then consumers would need to pay ~$1000 per day, which most people won't do (and can't do). Indeed, I think $10 per day is about the upper limit of what we could see for most people in the nearish future (=$3,650 per year, which is much more than they pay for their computer plus phone). Maybe $30 per day, if it reaches the total cost of owning a car (still only 1.5 OOM above current prices). But I can't really imagine it reaching that level for just the current amount of use (at higher intelligence) — I think that would only be reached if there were much more use too. Therefore, I see only 1 OOM increase in cost per query being possible here for consumer use, which means an initial 1 OOM of inference scaling after which the inference used could increase at the speed of efficiency gains (0.5 OOM per year) keeping a constant price (and meaning it absorbs the efficiency gains).

But it is different for non-consumer use-cases. Maybe there are industrial areas where it is more plausible to be willing to pay 100x or 1000x as much for the same number of queries to a somewhat more intelligent system (e.g. coding). I'm a bit skeptical though. I really think current scaling paying for itself was driven by being able to scale up the number of users and the amount of queries per API user, and these stop working here, which is a big deal.

[-]Wei Dai4mo166

While I appreciate this work being done, it seems a very bad sign for our world/timeline that the very few people with both philosophy training and an interest in AI x-safety are using their time/talent to do forecasting (or other) work instead of solving philosophical problems in AI x-safety, with Daniel Kokotajlo being another prominent example.

This implies one of two things: Either they are miscalculating the best way to spend their time, which indicates bad reasoning or intuitions even among humanity's top philosophers (i.e., those who have at least realized the importance of AI x-risk and are trying to do something about it). Or they actually are the best people (in a comparative advantage sense) available to work on these other problems, in which case the world must be on fire, and they're having to delay working on extremely urgent problems that they were trained for, to put out even bigger fires.

(Cross-posted to LW and EAF.)

[-]Jacob_Hilton4mo14-2

Thanks for this insightful analysis!

But it fits with the extreme information inefficiency of RL training, which (compared to next-token-prediction) receives less than a ten-thousandth as much information to learn from per FLOP of training compute.

If I am interpreting this correctly, there is a subtle mathematical error here: if RL requires a constant factor of 10,000 more compute than pretraining, this only shifts the graph of performance against log(compute), it doesn't change its slope. For RL to have a shallower slope, the information efficiency would have to decrease more quickly over the course of training for RL than for pretraining.

I think there are few potential reasons why information efficiency might decrease more quickly over the course of training for RL than for pretraining, but it is not so clear-cut:

Increased accuracy: you get fewer bits of information from a more biased coin flip than a fairer one, so information efficiency decreases as you approach 100% accuracy. But it's not clear whether this applies more to pretraining or to RL. Note also that in both cases the effect can potentially be alleviated by a curriculum.
Longer episodes: assuming RL just has a single binary reward at the end of each episode, information density decreases as episodes get longer. Since harder tasks require longer chains of thought, this one seems to clearly count against RL.
Overfitting: if there is a mismatch between the training distribution used for RL and the distribution used to benchmark the model, one might expect the density of information relevant to the benchmark to decrease as the model overfits to the training distribution. I think this one also counts against RL right now, but can be alleviated by improving data quality and quantity.

In particular, I think the fact that overfitting can be mitigated with better data cuts against your empirical observations. Since, as you correctly note, RL compute started from a very small base, it was initially much cheaper to scale up compute than to scale up data. But as RL compute becomes more expensive, it will become comparatively more cost-effective to scale up data. Once spending on both is being scaled up at a similar rate (as is economically inevitable as long as spending continues to increase), we should expect to see some regression towards the pretraining slope in my opinion.

Overall, I think the effect you spotted is real (due to things like episode length), but ultimately won't turn out to be as extreme as you estimated here. Quantitatively, I would guess that RL will look more like a power of 1.5-2 worse than pretraining rather a power of 3 worse, and there could be certain training regimes (e.g. fixed episode length) where they are closer than that.

[-]Toby_Ord4mo60

Thanks Jacob. It is less of a mathematical mistake and more me trying to make a qualitative connection between the observed poor scaling of RL training and theoretical mechanism I'd just written about of poor information efficiency, both of which look very big. I agree that the theoretical explanation doesn't seem to be quite the right shape to explain the empirical issue.

Of your potential reasons, I do think longer episodes is part of it. The R1 paper has a chart on page 8 showing that without trying to affect episode lengths, they increased linearly from 500 tokens to ~9000 tokens over 8000 episodes, suggesting pretty much 1 token increase per episode on average. Thus the information efficiency was going down linearly with episodes during training. It is a bit tricky to compare this with the o1 chart, whose x-axis is both logarithmic and also measuring training compute rather than episode number. I think this means it should be declining as 1/episodes = 1/root(compute) — since training compute would be growing as the square of the number of episodes. And I think that just accounts for the power being 0.5 lower than pretraining, rather than the 3 that I'm claiming. (But its late, and I haven't checked this through on paper.)

I do think you are right that the information inefficiency can't explain the whole issue, but it might be able to explain part of it. i.e. a shift to the left can't explain a line with a different slope, but the slope changing part of the way there could be part of the explanation.

[-]Toby_Ord4mo61

Actually, here is a slightly simpler way to think about it. How many more training steps do you do with RL when you 100x the compute? Given the linear episode length growth, you only do root(100) = 10x the number of training steps. So if capability gain were linear in the log of the number of training steps, it would grow as log(root(compute)) = log(compute)/2, whereas for pretraining it would grow as log(compute). So if inference-scaling were going as well as pre-training scaling (contra the 3/2 estimate I appealed to in my piece) then the information inefficiency theoretical explanation could exactly account for the observed scaling behaviour.

I'm not sure this is right (there were a couple of biggish assumptions there) but it does feel closer to being able to be a larger part of the actual explanation.

[-]Jacob_Hilton4mo20

Nice observation, and I agree with your calculation that linear episode length growth would account for a worse scaling exponent by a factor of 2 (or more generally, episode length growing with exponent k would account for a worse scaling exponent by a factor of k+1).

Note also that this suggests a potential remedy, namely controlling episode length, but there is less incentive to apply this when data is more of a constraint than compute.

[-]Vladimir_Nesov4mo20

In the long run, if (contribution to the quality of result from) RL scales slower than pretraining and both are used at a similar scale, that just means that RL doesn't improve the overall speed of scaling (in the model quality with compute) compared to pretraining-only scaling, and it wouldn't matter how much slower RL scaling is. But also, pretraining might face a scaling ceiling due to training data running out, while RL likely won't, in which case slower scaling of RL predicts slower scaling overall compared to pretraining-only scaling, once pretraining can no longer be usefully scaled.

I would guess that RL will look more like a power of 1.5-2 worse than pretraining rather a power of 3 worse

There's some compute optimal ratio of pretraining compute to RL compute (describing the tradeoff within a fixed budget of total compute or GPU-time), which depends on the amount of total compute. If usefulness of RL and pretraining scale differently, then that ratio will tend either up or down without bound (so that you'd want almost all compute to go to pretraining, or almost all compute to go to RL, if you have enough compute to extremize the ratio).

What matters in practice is then where that ratio is in the near future (at 1e26-1e29 FLOPs of total compute). Also, there's going to be some lower bound where at least 10-30% will always be spent on either as long as they remain scalable and enable that much in some way, because they are doing different things and one of them will always have an outsized impact on some aspects of the resulting models. In particular, RL enables training in task-specific RL environments, giving models competence in things they just can't learn from pretraining (on natural data), so there's going to be a growing collection of RL environments that teach models more and more skills, which in practice might end up consuming the majority of the compute budget.

So even if for capabilities usefully trainable with both pretraining and RL it turns out that allocating 5% to RL is compute optimal at 1e28 FLOPs, in practice 70% of compute (or GPU-time) might still go to RL, because the capabilities that are only trainable with RL end up being more important than doing a bit better on the capabilities trainable with either (by navigating the compute optimal tradeoff between the two). Also, natural text data for pretraining is running out (at around 1e27-1e28 FLOPs), while RL is likely to remain capable of making use of more compute, which also counts towards allocating more compute for RL training.

[-]Toby_Ord4mo30

Yes, you would get an optimal allocation with non-zero amounts to each. A simple calculation suggests 1:2 ratio of RL-OOMs : Inference-OOMs. e.g. scaling up RL by 100x and inference by 10,000x. So it could easily lead to RL compute becoming an ever-smaller fraction of FLOPs. But there are additional complications from the fact that inference is a flow of costs and also increases with the number of users, while RL is a fixed cost.

On the simple model and with my scaling numbers, the contribution of RL to capabilities (keeping token-use fixed) would be 20% — a 1:4 ratio with inference because half as many OOMs and half the effect per OOM.

The main relevance of all this to me is that even if people keep doing RL, RL alone won't contribute much to benchmark performance. I think it would need to 100,000x current total training compute to gain the equivalent of just 100x on pretraining in the early years. So if pre-training is slowing, AI companies lack any current method of effective compute scaling based solely around training compute and one-off costs.

[-]henryaj4mo3-2

Nit: scaling up RL by 100x and inference by 10,000x would be a 1:3 OOM ratio I think

[-]Vladimir_Nesov4mo2-1

RL can develop particular skills, and given that IMO has fallen this year, it's unclear that further general capability improvement is essential at this point. If RL can help cobble together enough specialized skills to enable automated adaptation (where the AI itself will become able to prepare datasets or RL environments etc. for specific jobs or sources of tasks), that might be enough. If RL enables longer contexts that can serve the role of continual learning, that also might be enough. Currently, there is a lot of low hanging fruit, and little things continue to stack.

So if pre-training is slowing, AI companies lack any current method of effective compute scaling based solely around training compute and one-off costs.

It's compute that's slowing, not specifically pre-training, because the financing/industry can't scale much longer. The costs of training were increasing about 6x every 2 years, resulting in 12x increase in training compute every 2 years in 2022-2026. Possibly another 2x on top of that every 2 years from adoption of reduced floating point precision in training, going from BF16 to FP8 and soon possibly to NVFP4 (likely it won't go any further). A 1 GW system of 2026 costs an AI company about $10bn a year. There's maybe 2-3 more years at this pace in principle, but more likely the slowdown will be gradually starting sooner, and then it's Moore's law (of price-performance) again, to the extent that it's still real (which is somewhat unclear).

[-]williawa4mo20

I'm getting somewhat confused about information-theoretic arguments around RL scaling. What makes sense to me is that: information density is constant per token in pre-training, no matter how long you make contexts, but decrease 1/n as you make RL trajectories longer. This means that if you look at just scaling context length, RL should get asymptotically less efficient.

What's not clear to me is the relationship between "bits getting into the weights" and capabilities. Using the information-theoretic argument above, you'd probably get that in o3, one millionth of the information in the weights comes from RL, or something like that, I'm not sure. But o3's advance in capabilities over 4o seem clearly far more than a millionth factor improvement. I think this would be true even if you work to disentangle inference time scaling and RL scaling. Eg ratio of bits in o1 vs o3. Number of bits in o3 over o1 is very small, but thinking for the same time, the difference is very noticeable.

[-]Vladimir_Nesov4mo51

The clues about the compute optimal pretraining-to-RL ratio are interesting. RLVR still has a long way to go with even longer reasoning traces (sequential inference scaling), since currently it's still mostly 20k-50k tokens, while 1M contexts are supported even for current models, and soon the suddenly increasing HBM capacity of scale-up worlds for newer hardware^[1] will enable much longer contexts, even for larger models. Contexts this long don't really work after pretraining, but RLVR might be able to make them work.

An AGI-relevant implication of longer contexts (as opposed to pretraining or RL scaling) is that continual learning (the crucial capability of automated adaptation) currently only really works within a context, as an aspect of in-context learning. If contexts were to increase to millions of tokens, they could in principle hold the equivalent of years of experience (3 years at 40k tokens per day is 44M tokens). If scaling and RLVR make such tokens do a good enough job at implementing continual learning, algorithmic innovations that go about it in a more reasonable way wouldn't be necessary to overcome this obstruction.

The change is from about 1 TB for 8-chip Nvidia servers to 14-20 TB for GB200/GB300 NVL72 getting online this year, and there's also 50 TB for a TPUv7 Ironwood pod. Thus in 2026-2027, it suddenly becomes possible to serve much larger models than in 2023-2025, as a step change rather than gradually. And for the same reason the models will be able to work with much longer contexts, at least in principle. ↩︎

[-]StanislavKrym4mo10

Except that the METR evaluation had GPT-5 use a budget of more than a hundred dollars per run and, as far as I understand, a context window of 400K tokens. The price of usage of GPT-5 is $10/M output tokens.

As for big contexts and their utility, I tried to cover it in my analysis^[1] of the ARC-AGI benchmark in the high-cost regime where Claude Sonnet 4.5, Grok 4, GPT-5-pro form another straight line with low inclination.

^{^}
However, it could have been invalidated by the appearance of Grok 4 Fast on the ARC-AGI-1 benchmark. While Grok 4 Fast is a low-cost model, it significantly surpassed the Pareto frontier.

[-]Vladimir_Nesov4mo30

I'm distinguishing sequential inference scaling (the length of a single reasoning trace, specifically the output tokens) from parallel scaling (or more generally agentic multi-trace interaction processes). GPT-5 Pro and such will work with more tokens per request by running things in parallel, agentic scaffolds will generate more tokens by spinning up new instances for subtasks as the situation develops, and often there are many more input tokens in a context than there are output tokens generated within a single reasoning trace.

[-]StanislavKrym4mo30

Now that RL-training is nearing its effective limit, we may have lost the ability to effectively turn more compute into more intelligence.

The problem is that we haven't tried many new architectures and don't know what exactly is the key to building a capable architecture. However, we are likely on track to finding that CoT-based AIs don't scale to superhuman coders which could be necessary for AI alignment.

[-]david reinstein1mo10

The Unjournal (unjournal.org) is considering commissioning this post for expert evaluation in our applied stream.

Looking for any feedback (here or privately) on whether this would be high-value, how to go about it (what particular issues/expertise), whether other research in this domain is higher value, etc.

[-]david reinstein2mo10

I asked some LLM models/agents to consider this post, in preparation for considering this for some form of Unjournal.org evaluation. FWIW:

1. Conversation started with GPTPro

2. RoastmyPoast.org "epistemic audit" (result: C+, 68/100, which is a bit below average iirc)

3. Claude 4.5 Opus

My take: they saw this post as plausible, no major errors, and generating some useful insights, but with some important limitations, and the main claims are not all 'obviously demonstrated'

Below. Some overall syntheses/pulled quotes that seemed relevant to me. All folded content is LLM

GPTPro

What holds up (probability ~0.6–0.8):

Public evidence supports that inference budgets buy big gains on current reasoning benchmarks, and RL post-training scaling appears meaningfully less compute-efficient (often by ~2 extra decades to cover similar 20→80 improvements).

RL post-training is now plausibly reaching “pretraining-scale” at least at xAI (and maybe elsewhere soon), so “RL is no longer a trivially cheap add-on” is real.

What’s uncertain / overconfident (probability ~0.2–0.5)
The specific conversion “100× training ≈ 1,000× inference” as a general rule, and thus the specific “1,000,000× RL for a GPT-level jump.” This rests on a non-robust mapping and then exponentiates it.

The implication that we’re “near the effective limit” of RL training gains, given recent public RL-scaling work emphasizing recipe dependence and improved efficiency/asymptotes.

...[Verdict] Ord is on solid ground that current reasoning improvements rely heavily on inference budgets ... he is on weak ground when he turns that into a near-term “end of scaling” claim via a brittle 1,000,000× extrapolation.

I asked what aspects were missing in the comments on LW and EA Forum; it noted a lack of discussion of ...

lack of discussion of ...

how sensitive the RL-vs-inference scaling gap is to model size, data quality, reuse, training recipe, domain/task type;
how recent empirical RLHF / RL‑post‑training research (on open, small-scale, or controlled setups) might affect that gap;
the analogy of “inefficiency gap = fundamental ceiling” vs. “inefficiency may be engineering‑level problem, solvable with better algorithms/research”;
the degree of uncertainty involved in extrapolating over many orders of magnitude;
the possibility that RL‑post-training inefficiency might be significantly reduced in the future (with better methodology).

So in short: the public conversation has touched some of the major “skeptical” themes, but not with the depth, technical framing, or caution that a more expert‑oriented review might use.how sensitive the RL-vs-inference scaling gap is to model size, data quality, reuse, training recipe, domain/task type;

how recent empirical RLHF / RL‑post‑training research (on open, small-scale, or controlled setups) might affect that gap;
the analogy of “inefficiency gap = fundamental ceiling” vs. “inefficiency may be engineering‑level problem, solvable with better algorithms/research”;
the degree of uncertainty involved in extrapolating over many orders of magnitude;
the possibility that RL‑post-training inefficiency might be significantly reduced in the future (with better methodology).

So in short: the public conversation has touched some of the major “skeptical” themes, but not with the depth, technical framing, or caution that a more expert‑oriented review might use

Claude 4.5 Opus

Bottom Line Assessment
Aspect Assessment
Methodological rigor Decent for available data; not peer-reviewed ML research
Alignment with expert consensus Broadly consistent with Sutskever, partially at odds with Epoch AI's optimism
Potential blind spots Algorithmic improvements, insider knowledge, IDA possibilities
Originality Useful synthesis, but not breakthrough technical analysis
Should you trust it? Trust it as informed policy analysis, not as definitive ML research
The honest answer: Ord is probably directionally correct that RL scaling is less efficient than pre-training was, and that we're approaching limits. But the specific numbers (10,000x, 1,000,000x) should be held loosely. Actual ML researchers at frontier labs know things that aren't public, and algorithmic breakthroughs could change the picture.

RoastMyPoast Epistemic Audit (C+, 68/100)

Uses agents and claude-sonnet-4-5-20250929

Noted overconfidence

Noted "Overconfidence" about

Causal claims about what RL "unlocks" or "allows" without establishing mechanism

.... Long-range extrapolations presented as reliable estimates

And "Single points of failure":

The assumption that observed slopes continue far beyond measured ranges
The causal interpretation of RL "unlocking" inference rather than teaching token-intensive strategies
The representativeness of OpenAI's published data

RE Unjournal.org potentially commissioning this for an evaluation of some form, we might consider

Is this post highly influential on its own (are funders and labs using this to guide important policy choices)?
Is there further expertise we could unlock that is not reflected in these comments? (The LLMs suggested some evaluators, but we sometimes find it hard to get people to accept the assignment and follow through)
Is there a more formal research output that covers this same ground, coming from ML researchers, scaling experts, etc.?

[-]StanislavKrym4mo10

Jones (2021) and EpochAI both estimate that you need to scale-up inference by roughly 1,000x to reach the same capability you’d get from a 100x scale-up of training.

This also is confusing to me. Suppose that we scaled up training a hundred times. Then we are either overtraining the model (which does not increase performance beyond a level determined by the model's size!) or are working with a different model which has about a hundred times more parameters. Then what does scaling up inference mean, a tenfold increase in the number of tokens used?

Moderation Log

Aspect	Assessment
Methodological rigor	Decent for available data; not peer-reviewed ML research
Alignment with expert consensus	Broadly consistent with Sutskever, partially at odds with Epoch AI's optimism
Potential blind spots	Algorithmic improvements, insider knowledge, IDA possibilities
Originality	Useful synthesis, but not breakthrough technical analysis
Should you trust it?	Trust it as informed policy analysis, not as definitive ML research