ProRL is a pretty hard counter to the "RLVR only elicits existing capabilities" argument:
Furthermore, Nemotron-Research-Reasoning-Qwen-1.5B offers surprising new insights —RL can
indeed discover genuinely new solution pathways entirely absent in base models, when given sufficient
training time and applied to novel reasoning tasks. Through comprehensive analysis, we show that
our model generates novel insights and performs exceptionally well on tasks with increasingly
difficult and out-of-domain tasks, suggesting a genuine expansion of reasoning capabilities beyond
its initial training. Most strikingly, we identify many tasks where the base model fails to produce any
correct solutions regardless of the amount of sampling, while our RL-trained model achieves 100%
pass rates (Figure 4).
Conflicting results from other papers probably come from:
Another counter-argument is classic RL experiments like AlphaZero and Atari gameplay. These obviously elicit new capabilities, since the models reach superhuman performance after starting with no knowledge at all. In theory, there's no reason that RLVR couldn't teach untrained LLMs to do math from scratch (but in practice sparse rewards mean that the sun would burn out before the model figures it out).
I’ll take a look at the ProRL paper later today, thanks for the second.
I agree that RL inefficiency is one problem but I think this can be reasonably factored out in experiments if not in production. The “RL Razor” paper does an experiment where they do SFT on a KL budget and show they get the same ‘reduced forgetting’ effect - I think of this reduced forgetting, in light of the ‘off the principals’ paper, as being something like inertia of learning new representations, or inability to pass through regions of high curvature. That is, there are definitely still qualitative differences at the per-batch level; perhaps this is just an efficiency thing, but it’s plausible to me that it might be more like GD vs SGD in finding different types of solutions with different generalization properties because of the curvature bias. On priors this would surprise me for RL, but I guess I like these papers because they updated me away from that a bit.
As for the milestone self-play results, you’re right that they’ve no place in this story - my semi-cope pro-tem guess is that LLMs operate in a ‘different regime’. Two intuitions for this:
for the board games especially, there is no ‘curriculum of representations’ - perhaps learning strategic action is easier for RL than learning to have good ontic chunkings of the world. E.g. for Go, there is no need to learn complex hierarchical representations as table-stakes. The harder Atari games (eg Moctezuma) are counter-evidence to this, except perhaps insofar as their hardness was due to them needing more complex representations? This is an even poorer explanation for something like AlphaStar. But nonetheless, we ended up switching to the pre-training paradigm instead of riding A3C to ASI.
The relatively poor performance of things like process supervision for LLM training is still surprising to me and I can’t account for it. If value-learning methods ‘don’t work’ in the LLM regime, is this because there is a structural difference in the data? On priors this just has to be a skill issue, but presumably had someone really solved this and gotten 3-5 OOMs of RLVR efficiency, wouldn’t we be done already? And presumably enough money-effort has been expended that were dramatic success achievable, it’d’ve been done? (To be clear here, by ‘work’ I mean “work so well that per-token gradients aren’t more than a one or two OOMs worse than for pre-training”)
My intuition: RLVR typically follows the path of least resistance. This often results in it tweaking small behavioral knobs: improving reliability of methods, downweighting unreliable and upweighting reliable methods, encouraging adaptive behaviors like backtracking and discouraging maladaptive behaviors like error self-consistency and self-amplification. Upwards from there is old circuits being used in novel ways: i.e. existing "error detectors" rewired to feed into self-check/backtracking triggers.
But nothing prevents RLVR from burning in new behaviors from scratch. Assuming there is no way to get there without, and sufficient RLVR pressure is applied.
This is in part driven by low KL properties of RLVR setups (both explicit and downstream from RLVR being on-policy), in part driven by how ample are the "low hanging fruits" of base model behavior being prediction-optimal but task-suboptimal, and in part by how "expensive" RLVR pressure is - few setups apply enough of it to get truly novel behaviors burned in.
Anyone have a good intuition for why RL post-training updates should be low rank? Is this just a symptom of picking low-hanging fruit, such that when there's a low-dimensional activation subspace that impacts reward a lot, the weight update will be approximately low rank, and the update will be higher rank after the low-hanging fruit is picked?
This post was produced as part of MATS 9.1 under the mentorship of Richard Ngo. It is not part of my main research project, but the ideas have been an important conceptual anchor to me. Epistemically, treat this as watercooler talk. Please feel free to share additional or contradictory work in the comments.
Low-fidelity 5-word summary:
RLVR changes propensity, not lability
Tl;dr is that RL acts on the weights of LLMs in a qualitatively different way from pre-training / SFT. [1] I give a mental model of how and why, and draw a speculative connection to 'emergent misalignment' and 'subliminal learning'.
Most of the papers below I heard of via these two youtube videos by Bycloud:
'The LLM's RL Revelation We Didn't See Coming'
'The RL Irony in LLMs (and its insane new meta)'
1. Weight-level
1.a The Path Not Taken: RLVR Provably Learns Off the Principals (This is the most important one in this section)
(Figure 1c)
1.b Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
(Figure 1)
1.c RLVR updates are sparse / lower-rank than SFT
2. Behavioural-level
2.a Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (This is the most important one in this section)
(Figure 1)
(Figure 2)
(Figure 7)
2.b The Invisible Leash: Why RLVR May or May Not Escape Its Origin
2.c RL's Razor: Why Online Reinforcement Learning Forgets Less
2.d Spurious Rewards: Rethinking Training Signals in RLVR
(Figure 1)
Further reading (unread)
Some more papers which ChatGPT suggests are important and which I have not yet read nor internalized:
3. My mental model
RL changes propensity, not lability (and I think that propensity changes correspond to changing singular values of the eNTK and lability corresponds to rotating eNTK eigenvectors)
3.a Brush-up on Hessian / eNTK eigenvectors
TLDR: Hessian, Gauss-Newton, and eNTK eigenvectors are all morally the same thing, and I quote the eNTK because it's loss-independent.
The Hessian of the loss decomposes (Gauss-Newton) as
where is the output Jacobian and is the loss curvature in output space. Near an interpolating minimum the residuals vanish, so and . The eNTK shares its nonzero spectrum with , with eigenvectors related by , so Hessian Gauss-Newton eNTK eigenvectors.
I quote the eNTK because carries the loss-dependent factor , whereas depends only on the model's input-output Jacobian -- so "RL doesn't rotate the eigenvectors" is a statement about the model, not a particular loss . The 'principal subspaces' of Off the Principals are a cheap proxy for the top of this spectrum.
Everything comes down to kernel eigenvector rotations -- RL doesn't rotate them, SFT does. RL re-fits the magnitudes assigned to ~fixed eigenvectors, while SFT rotates them.
I think this is likely true over short periods of training, but I'm not sure if this is true over longer periods of training; I am fairly sure that this does not come from merely having fewer rollouts / samples - it seems true on a per-batch basis. Whether or not this proves true in weight-space, I expect a looser version of this story holds in 'trait-space' - replace the network-output Jacobian with a suite of behavioural/eval-losses - for example = ("alignment", "SWE-talent", "similarity to Winston Churchill") etc. This intuition comes from observations in biology that a similar object called the -matrix tends to follow roughly this story.
3.b Speculative Implications
This model might retrodict emergent misalignment and subliminal learning. I'll give specific stories below; low confidence in these narratives, but I expect the general propensity/lability division to be important regardless.
Note that both stories below are about SFT, not RL -- they really just illustrate what lability (eigenvector rotation) buys you, since the dramatic generalization comes from rotating eigenvectors rather than rescaling them.
Subliminal learning. Decompose the teacher's output, and hence the student's gradient, into a task-nominal part and a subliminal part:
IIUC subliminal learning only works between models of the same class (for some value of 'the same'). The nominal gradient is shared across models -- a number sequence is a number sequence -- so it transfers however normal training does. The subliminal gradient is model-specific: against a different model's eigenbasis it is incoherent, with ~zero projection, so there is nothing to train on; against the same model's eigenbasis it is exactly aligned, and a small-but-consistent push along an existing eigendirection accumulates over many SFT steps.
This makes subliminal learning a case where SFT behaves propensity-like: because the data is self-generated by that same model, the update is already aligned with the eigenbasis, so it rescales magnitudes along fixed directions rather than rotating them. Like an antenna which achieves high gain by being extremely directional.
Emergent misalignment. Here the story does lean on lability. Suppose 'alignment' is carried by a low-dimensional -- plausibly near one-dimensional -- shared subspace. Then narrow misaligned SFT data, whatever its surface topic, has large overlap with that whole subspace, so a single push shifts the global alignment representation and the misalignment generalizes broadly. How far it generalizes would track the size of that subspace. (At least consistent with the empirical EM finding that a single direction can mediate and steer the effect.)
Most of the research here compares RLVR vs SFT; I'm not sure if there are important differences between SFT vs. pre-training. Likewise for RLVR vs RLHF, respectively. ↩︎
They attribute this behaviour largely to the KL penalty, but other papers disagree. I don't have a strong estimation of which is right. ↩︎
The authors explicitly claim that RL updates are also full-rank, but afaict, they use the naive calculation for sparsity and rank - meaning I don't trust the rank number, and the sparsity number is all the more surprising. 'Off the Principals' talks about this; I think rank should use entropy-rank or effective-rank for robustness reasons. ↩︎
For alignment: consider the differential effect of RLHF versus e.g. SFT persona fine-tuning. ↩︎