My pet theory theory of this is that you get 2 big benefits from RLVR: 1. A model learns how to write sentences in a way that does not confuse itself (for example, markdown files written by an AI tend to context-poison an AI far less than the same amount of text written by a human or by error messages). 2. A model learns how to do "business processes" - for example, that in order to write code, it needs to first read documentation, then write the code, and then run tests.
These are things that RL if done right is going to improve, and they definitely feel like they explain much of the difference between say ChatGPT-4 and GPT-5.
I expect that these effects can have fairly "general" impact (for example, an AI learning how to work with notes), but the biggest improvements would be completely non-generalizable (for example, heuristics in how to place functions in code).
My pet theory theory of this is that you get 2 big benefits from RLVR:
1. A model learns how to write sentences in a way that does not confuse itself (for example, markdown files written by an AI tend to context-poison an AI far less than the same amount of text written by a human or by error messages).
2. A model learns how to do "business processes" - for example, that in order to write code, it needs to first read documentation, then write the code, and then run tests.
These are things that RL if done right is going to improve, and they definitely feel like they explain much of the difference between say ChatGPT-4 and GPT-5.
I expect that these effects can have fairly "general" impact (for example, an AI learning how to work with notes), but the biggest improvements would be completely non-generalizable (for example, heuristics in how to place functions in code).