Hey, I think I relate to this! I didn't expect to see this phenomenon described anywhere, and I'm happy that you took the time to describe it.
I think I was able to improve on this (or, the aspects of this that were an issue for me) by coming up with new ways of expressing what I think and feel in lossier/less-precise/more vibe-driven ways.
Does anyone have a sense of whether, qualitatively, RL stability has been solved for any practical domains?
This question is at least in part asking for qualitative speculation about how the post-training RL works at big labs, but I'm interested in any partial answer people can come up with.
My impression of RL is that there are a lot of tricks to "improve stability", but performance is path-dependent in pretty much any realistic/practical setting (where state space is huge and action space may be huge or continuous). Even for larger toy problems my sense is that various RL algorithms really only work like up to 70% of the time, and 30% of the time they randomly decline in reward.
One obvious way of getting around this is to just resample. If there are no more principled/reliable methods, this would be the default method of getting a good result from RL. It would follow that that's just what the big labs do. But of course they may have secretly solved some stuff, but it's hard to imagine what the form of that would be.
followup: after looking at the appendix i'm pretty sure the biggest distinction is that the SFT'd models in this paper are SFT'd on data that comes from entirely different models/datasets. so not only is the data not coming from a policy which adapts during training, it's coming from policies very different from the model's own. i think this by itself is enough to explain the results of the paper; i think this is a useful result but not the one i imagined upon reading the title.
i would still like to know the answer to the original question.
it's hard to find definitive information about this basic fact about how modern RL on LLMs works:
are there any particularly clever ways of doing credit assignment for the tokens in a sequence S that resulted in high reward?
moreover, if you adopt the naive strategy of asserting that all of the tokens are equally responsible for the reward, is the actual gradient update to the model parameters mathematically equivalent to the one you'd get SFTing the model on S (possibly weighted by the reward, and possibly adjusted by GRPO)?
the followup is this: in this paper they claim that SFT'd models perform badly at something and RL'd models don't. i can't imagine what the difference between these things would even be, except that the RL'd models are affected by samples which are on-policy for them.
at the end of the somewhat famous blogpost about llm nondeterminism recently https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/ they assert that the determinism is enough to make an rlvr run more stable without importance sampling.
is there something i'm missing here? my strong impression is that the scale of the nondeterminism of the result is quite small, and random in direction, so that it isn't likely to affect an aggregate-scale thing like the qualitative effect of an entire gradient update. (i can imagine that the accumulation of many random errors does bias the policy towards being generally less stable, which implies qualitatively worse, yes...)
without something that mitigates the statement above, my prior is instead that the graph is cherry-picked, intentionally or not, to increase the perceived importance of llm determinism.
Yeah, whenever a result is sensational and comes from a less-than-absolutely-huge name, my prior is that the result is due to mistakes (like 60-95% depending on the degree of surprisingness), and defacto this means I just don't update on papers like this one any more until significant followup work is done.
I wonder if you're referring to the "spurious rewards" paper. If so, I wonder if you're aware of [this critique] (https://safe-lip-9a8.notion.site/Incorrect-Baseline-Evaluations-Call-into-Question-Recent-LLM-RL-Claims-2012f1fbf0ee8094ab8ded1953c15a37) of its methodology, which might be enough to void the result.
I think the critique generalizes if it's a little more focused. If a huge number of papers arose that just demonstrated that EM arose in a bunch of settings that varied superficially without a clear theory of why, this post would be a good critique of that phenomenon.
I don't necessarily disagree with this way of looking at things.
Serious question-- how do you calibrate the standard by which you judge that something is good enough to warrant respect, or self-respect?
To illustrate what I mean, in one of your examples you judge your project teammates negatively for not having had the broad awareness to seriously learn ML ahead of time, in the absence of other obvious external stimuli to do so (like classes that would be hard enough to actually require that). The root of the negative judgment is that a better objective didn't occur to them.
How can you ever be sure that there isn't a better objective that isn't occurring to you, at any given time? More broadly, how can you be sure that there isn't just a generally better way of living, that you're messing up for not currently doing?
If, hypothetically, you encountered a better version of yourself that presented you with a better objective and ways of living better, would you retroactively judge your life up to the present moment as worse and less worthy of respect? (Perhaps, based on the answer to the previous problem, the answer is "yes", but you think this is an unlikely scenario.)