x

LESSWRONG

LW

JulesRoussel01

JulesRoussel01

Message

6

1

6d

JulesRoussel01

6

6d

JulesRoussel01 — LessWrong

In open RLVR, “improvement” depends on the instrument — a small GRPO testbed separating what training optimizes, measures, and teaches

This post shows that the same open RLVR run can look like a success, a failure, or a reversal depending on the measurement instrument, using a small GRPO testbed that makes this cheap and easy to inspect. Epistemic status: single-seed exploratory study on Qwen2.5-0.5B-Instruct / GSM8K with small held-out evals,...