In open RLVR, “improvement” depends on the instrument — a small GRPO testbed separating what training optimizes, measures, and teaches
This post shows that the same open RLVR run can look like a success, a failure, or a reversal depending on the measurement instrument, using a small GRPO testbed that makes this cheap and easy to inspect. Epistemic status: single-seed exploratory study on Qwen2.5-0.5B-Instruct / GSM8K with small held-out evals,...
Jun 157