My pet theory theory of this is that you get 2 big benefits from RLVR: 1. A model learns how to write sentences in a way that does not confuse itself (for example, markdown files written by an AI tend to context-poison an AI far less than the same amount of text written by a human or by error messages). 2. A model learns how to do "business processes" - for example, that in order to write code, it needs to first read documentation, then write the code, and then run tests.
These are things that RL if done right is going to improve, and they definitely fe... (read more)
My pet theory theory of this is that you get 2 big benefits from RLVR:
1. A model learns how to write sentences in a way that does not confuse itself (for example, markdown files written by an AI tend to context-poison an AI far less than the same amount of text written by a human or by error messages).
2. A model learns how to do "business processes" - for example, that in order to write code, it needs to first read documentation, then write the code, and then run tests.
These are things that RL if done right is going to improve, and they definitely fe... (read more)