The point at which the pass@k curves before and after RLVR training intersect seems remarkably stable for any given type of tasks (benchmark). It barely moves across multiple variations on GRPO (some of which mitigate loss of entropy that it suffers from), or from applying training in the range from 150 to 450 steps (Figure 7). The point of intersection even moves lower with more training, suggesting that performance of the base model at the intersection point with a weak RLVR model might remain an upper bound for performance of a much stronger RLVR model. Since the reliability of the base model is not yet very high even at pass@400 for many important tasks, this kind of bound on capabilities would be crippling for RLVR's potential.
DeepSeek-Prover-V2 on MiniF2F improves from 86.6% (pass@1024) to 88.9% (pass@8192). Kimina-Prover also report its best performance at pass@8192. What makes proving so special? This should oppose for any given type of tasks. Does it implies proving is acutally under-trained for base models so RLVR can consistently improving performance.
I've found the original paper of this chart https://arxiv.org/pdf/2503.11926v1
> We use prompted GPT-4o models to monitor a frontier reasoning agent, an agent in the same family as OpenAI o1 and o3-mini. During training, the agent discovered two hacks affecting nearly all training environments:
The model is in the same family as o1 and o3-mini. Maybe o3 but not comfirmed.