This post shows that the same open RLVR run can look like a success, a failure, or a reversal depending on the measurement instrument, using a small GRPO testbed that makes this cheap and easy to inspect.
Epistemic status: single-seed exploratory study on Qwen2.5-0.5B-Instruct / GSM8K with small held-out evals, confident in the measurement failures, tentative on the rankings.
In open RLVR, whether training "improved" the model depends on which instrument you measure it with, the reward channel, the extractor, or the decoding regime, and changing the instrument can make the same run look like a success, a failure, or a reversal. This happens because in most open GRPO pipelines the reward, the metric, and the extractor are one function, so "accuracy went up" is partly a fact about the instrument, not only about the model. Therefore, the solution here was to create a small open testbed that separates these instruments and audits each one. Although none of these phenomena are new, the contribution is making them visible and cheap to reproduce in one place. The strongest examples are format-only reward making format increase from 0.438 to 1.000, but destroying accuracy from 0.228 to 0.025, a clean instance of the known reward-hacking failure mode, or deeper behavior such as that the most faithful extraction method, last number, F1 is 0.938 against 0.813 and 0.473 for lenient and strict tag, is actually the worst as a reward to train accuracy on the model, with judge accuracy 0.320 against 0.460 and 0.480.
As I said above, Reinforcement Learning as a training method can result in unintended behavior, such as reward hacking. Reward hacking is a real problem because it is not only a quality issue, but it also can be a safety issue. This is what MacDiarmid et al. claim and show in their paper "Natural Emergent Misalignment from Reward Hacking in Production RL". They seeded Claude models with concrete test-bypass hacks such as AlwaysEqual and sys.exit(0), and used "test pass" as a reward to train the model, showing that the model was choosing the cheat path to get the reward, which gives the effect that the model was improving to pass the tests correctly, whereas it just got around the problem. This is a clear demonstration of the phenomenon. However, this result also presupposes that reward hacking can be detected, and this requires a measurement channel that is independent from the reward channel: this is the layer studied in this post.
In the different behaviors observed from my results, reward hacking is something that came back several times, but from a completely different context from MacDiarmid et al.'s paper, since their diagnostic shows broader behavior that the model can do such as sabotage or deception, whereas in my context it is at an earlier layer. Indeed, the model was first given knowledge of concrete hacking strategies, then RL selected those strategies because they produced reward, whereas in my project, I did not seed any explicit hacking strategy: the model simply optimized the reward channel it was given, and the proxy-gaming behavior emerged from the training setup itself. Therefore, this work studies the measurement layer and makes no claim about the downstream misalignment generalization that MacDiarmid et al. study. Even after the open-stack reproduction from AISI, by Golechha, Black and Bloom, this earlier measurement layer is still the part that is not isolated: their work reproduced MacDiarmid et al.'s result with fully open models and tooling, but their focus is still whether reward hacking generalizes into downstream misalignment on open models. The goal here is therefore not to reproduce either the Anthropic result or this open reproduction, but to make the earlier measurement and extractor problem explicit, controlled, and easier to audit before trusting the result.
Can a reward destroy existing competence?
In this first step, the answer is yes: the format-only run reported a perfect format reward, increasing format compliance from 0.438 to 1.000, while honest accuracy collapsed from 0.228 to 0.025, going below the original baseline. This gives a clean controlled instance of the known reward-hacking failure mode: the model seems to have learned what we taught it, however the other metric is not remaining stable, it is completely destroyed by this training, resulting in a model that is worse at the actual task.
To answer this question, I used two different complementary independent metrics, format compliance and accuracy. The extractor question is studied in the next step, where I audit whether lenient extraction or last-number extraction is the most faithful way to measure the answer. The first step was divided into 3 experiments: one using format compliance as the reward, one accuracy with lenient extraction, and the last one combining both. Due to the two first experiments and the result that format was being learned much better than accuracy, I decided to weight accuracy more than format in the combined reward, 0.7 against 0.3. For the three experiments, the compute metrics printed the scores of the mean reward, the format compliance, the lenient accuracy and the strict accuracy, meaning the last-number one. My main hypothesis of this step was that the metric used as the reward would have an improvement score and the other metrics would remain stable, and about the combined reward experiment that the easiest learned reward would override the other one, resulting in just one improvement metric, which turns out to be wrong on the contrary. I expected the easier reward to dominate and suppress the other, both improved instead.
After a preliminary parameter sweep to get the best trade-off between cost/time and performance, I decided to use as parameters: 100 training data from GSM8K, 50 evaluation data every 20 steps during the training, 8 generations per prompt, the Qwen2.5-0.5B-Instruct model and 512 max completion length. We also pass to get the completions a simple prompt to solve the math problems: "Solve the math problem step by step. Wrap your final answer in <answer> tags, e.g. <answer>42</answer>." Therefore from this prompt accuracy and format compliance can both be evaluated correctly. So here is the table of the first step:
Table 1 — Step 1: reward channel vs. the other metrics. Sampled eval, every 20 steps. The field logged as strict_accuracy is the last-number extraction.
Experiment (reward)
Metric
0
20
40
60
80
100
Exp 1 — Format
Mean reward (= format)
0.438
0.967
0.998
1.000
1.000
1.000
Exp 1 — Format
Format compliance
0.438
0.968
0.998
1.000
1.000
1.000
Exp 1 — Format
Last-number accuracy
0.228
0.060
0.037
0.040
0.037
0.025
Exp 2 — Lenient
Mean reward (= lenient acc)
0.377
0.468
0.417
0.450
0.442
0.498
Exp 2 — Lenient
Format compliance
0.390
0.155
0.113
0.085
0.050
0.062
Exp 2 — Lenient
Last-number accuracy
0.258
0.335
0.273
0.318
0.312
0.335
Exp 3 — Combined
Mean reward (combined)
0.381
0.435
0.491
0.541
0.563
0.570
Exp 3 — Combined
Format compliance
0.390
0.645
0.743
0.792
0.855
0.838
Exp 3 — Combined
Last-number accuracy
0.258
0.223
0.268
0.275
0.280
0.325
From my previous hypothesis, we can indeed see that the reward metric in the evaluation is the metric that is learned. We can see that when the reward is the format compliance in experiment 1 it grows from 0.438 to 1.000, and in the second experiment when the reward is accuracy with lenient extraction, we see an increase from 0.377 to 0.498. In the strict accuracy, the increase is lower but still exists, ranging from 0.258 at the start of the experiment to 0.335. We can also see from these numbers that, as stated above, format is learned much better and faster: at the 20th step evaluation round, the reward already increased from 0.438 to 0.967, whereas accuracy struggles more to be learned. However, what we can see and that was not expected is that the other metric for experiment 1 and experiment 2, that was not considered by the reward function, is destroyed, ranging from 0.228 to 0.025 for accuracy in experiment 1 and 0.390 to 0.062 for format in experiment 2.
Experiment 3 is combining both rewards, and what is interesting is that we can see an increase from both metrics: increasing from 0.258 and 0.390 to 0.325 and 0.838 for accuracy with last-number extraction and format compliance respectively. So what is even more interesting from what contradicts my first hypothesis is that both metrics are not only both increasing, they are literally increasing in a similar way as when they are individual. This means that there is no one metric destroying another reward like I thought, but rather that, in this setting, the combined reward prevented the model from focusing on only one metric and damaging the other, as happened in the two separate runs.
During these three experiments, you might notice that experiment 2 and 3 have the same starting results before the training due to the reproducibility of how the experiments were built, however experiment 1 is slightly different at the beginning: this is especially because experiment 1 was run from a T4x2 GPU whereas the two others were run on A100.
It is also important to note that the conclusion from these three experiments, as the ones that follow, has some gaps, such as the fact that I evaluated only two metrics, or the fact that all these experiments were run from one single seed, and that applies for all the experiments of this post. However, although they are real gaps, these results are still valid and interpretable for many reasons I will explain in the limitations section, therefore the confidence is of course not perfect because of these caveats, but enough to draw conclusions. Also, the next logical step for me was to check if what I called "strict accuracy" in this step, the last-number extraction, implying a more honest accuracy, was really the most faithful extraction or if the mini test, and therefore my initial hypothesis, was wrong.
How do you know your extractor is honest?
To answer this question, I first needed to define the extractor methods: the one usually used in GRPO, lenient, the one that from the completions analysis from the first step came logically to my mind, last number, and one intuitively I find interesting to take into account, strict tag, which basically takes the number inside the "answer" tag I asked in the prompt at the beginning of the experiment as a format. The logical hypothesis I made before running the experiment was that last number would be much the most honest one, especially due to the result of the first step. Indeed, in lenient the correct answer can appear in all the computations of the thinking step of the LLM without being its final answer, and strict tag would not find any answer if the format compliance is not respected, which, from the first step, does not instinctively happen and needs training.
To evaluate these extractors, the method is quite straightforward: comparing the extracted answer with the completion's real answer. This can be done manually by analysing step by step every completion, however, having an LLM judge, here Claude Haiku 4.5, for doing this task is an extreme gain of time. Of course, the direct objection is that the judge is also an LLM, so I did not directly trust it without checking it first. Therefore, I tested my judge by manually analysing 50 completions and putting my own label on a CSV file corresponding to the real answer that the LLM chose, None if no final answer was given, for example in the context that 512 tokens was not enough and therefore the LLM reasoning was truncated, and on the same 50 completions, asked Claude to write its own label. Then, using a simple comparison between our labels, I defined a percentage agreement equal to 96%, which I considered sufficient for this simple labeling task, especially after inspecting the disagreements. What was even more helpful is that for the difference we had in our labels, I printed the completions and both our labels, plus an explanation of the judge reasoning that was already written when doing the labels, and it turns out that the problem was not even clearly on its side, but mine, such as having written 3546 instead of 3456. So the initial agreement was 96%, and on inspecting the two disagreements, both were attributable to my labels rather than clear judge errors.
Confirming that the judge was trusted enough for this specific task, I created 500 completions using the same configuration as the first step. The judge was used to identify the real final answer once per completion, and then the three extractor answers were compared against this judged answer, giving 1500 extractor judgments in total. From the real final answer of the LLM and each extraction method answer, I computed the recall, precision and F1 score. Here are the results:
Table 2 — Step 2: extractor faithfulness. 500 completions → 1500 judgments. Judge = Claude Haiku 4.5. Judge validation (separate, n=50): agreement 48/50 = 96%; both disagreements traced to my own labels.
Extractor
Precision
Recall
F1
TP
FP
FN
TN
lenient
0.703
0.964
0.813
135
57
5
303
last_number
0.962
0.914
0.938
128
5
12
355
strict_tag
0.957
0.314
0.473
44
2
96
358
We can see from this that the hypothesis was confirmed: last number is indeed the most faithful extraction method with 0.938 F1, followed by lenient with 0.813 and finally far behind strict tag with 0.473. Why are these results expectable? First, strict tag has a very low recall, 0.314, so a lot of false negatives, which is explained by the fact that when answer tag was not present, it automatically does not return any answer. However, we saw from the first step that when not training, format compliance approximately equals to 0.45, which makes the strict tag F1 score very logical. About lenient, the problem is the opposite: the score of precision, 0.703, is quite low, resulting in a quite high number of false positives. This precision/recall asymmetry is consistent with Huang et al. (2025), "From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning", who show that rule-based verifiers can have high precision but low recall on format-varied answers, while model-based verifiers are more flexible but can also be exploited during RL training. My narrower contribution here is only to audit three simple extractors inside the same GRPO harness and then use that audit to interpret the later results. As explained before, the lenient problem is also logic since a lot of numbers can appear in the computations, and when the final answer is low, the probability that it appears in the whole completion makes it high. Finally, the most honest extraction method, last number, was also expected. However, it is not perfect and is not performing extremely well to handle false negatives, which is also very explainable: when the LLM is concluding in its answer, the last number, especially in math problems when re-actualising the context, is not always the final answer, for example, "Therefore, Mike is paying 3$ every 7 days".
One caveat that is legitimate to take into account during this experiment you might have noticed is, for example, how last number extraction can have false positives. This is especially due to the regex expression used in the extracting function: I tried to handle as many cases as possible, for example 1800, 1800.00, 18,000, but other ways of writing it existed such as writing the number in full English or in division.
Another gap is also the 50 number of completions, which is quite low for validating the judge. However, I considered that for a simple task like this, which is only analysing simple completions, an advanced LLM like Claude would be very legitimate to do, especially supported by the manual test and by the fact that the disagreements were label errors on my side rather than clear judge errors. The use of the judge validation was more as a good practice I think it is essential when trusting an LLM.
It is important to know that "most honest" or "most faithful" are not general and are very specified to my experiment's context. But in another context when format and accuracy matter both at the same level, we saw that RLVR is doing extremely well with format, so with enough training data, format would be perfectly trained like in experiment 1 of the first step, and extraction method strict tag could become much more reliable in that context. Also, other extraction methods much more robust and faithful exist, for example I took an LLM as a judge to extract the answer completion to compare the three extraction methods of this experiment, but an LLM as an extractor would also be perfectly valid and much more honest I guess. The reason why I did not choose it is because in my context, a score like last number is enough for interpreting these experiments, so I just made the balance between the cost and the performance that I considered valid. So again, when I say "last number" is the most honest extraction method, I am putting a limit for the context of my experiments and only between the three methods evaluated here.
Is the most faithful measurer the best teacher?
Going back to the first step, I used only the lenient method as the accuracy reward. We saw from the second step that extraction methods are not performing the same and we could even set a faithful ranking between these methods. So the logical question that would follow is: is the most faithful measurer the best teacher? The logical hypothesis that first came to my mind was that the most faithful method, here last number, would perform the best compared to the others, which turns out to be the complete opposite. I expected the most faithful extractor to be the best reward. However the data reversed that, and checking for circularity and the effect of the decoding regime is what changed my reading.
To do so, I ran four experiments, each of them using as an accuracy reward the three different extraction methods from the second step, and one baseline from the initial model. Otherwise, they all have the same configurations as the first and second steps for the completions generation, and to avoid the gap of the first step, all runs used the same GPU, A100, and all printed the same 4 metrics every 20 steps during evaluation: the accuracy metrics from the 3 different extraction methods and the format compliance metric. After the model is trained for each experiment, except the baseline, we use a judge to calculate the accuracy from an independent dataset of 50 completions, the test dataset, all using the same sample from GSM8K for the 4 experiments to maintain consistency for the comparison and interpretation.
Using a completely independent judge to compute the accuracy is very important to avoid a circularity problem. Indeed, using last number as an extraction answer method would bias and advantage the last number testing score, and so on. So to ensure fairness, I used an LLM as a judge, here again Claude Haiku 4.5 model, for the main reason that this exact model has already been tested in the second step for extracting the correct final answer of the generated completions and has performed very well and has already been validated. Then the judge compares them to the ground truth of the GSM8K data and computes the final accuracy score. To make the final evaluation deterministic and separate from the sampled training-time metrics, I used greedy decoding. This means that when testing each model, every token predicted is deterministic, considered as the unique best token.
Furthermore, as a secondary evaluation, we can also analyse the last evaluation of the last 20 steps when the training is finished, where the regime is therefore sampled, and where we take as the extraction method the most honest of the three from the second step, last number. This is only secondary, but it is important to note when taking these results into account first the extraction limit of last number, which is not perfect from the second step, and second that circularity is happening when the reward is using last number extraction, because the reward used last number and the metric too. But this is fine because it is just as an overview and it is happening only for one experiment out of 4. Here are the results:
Table 3 — Step 3: is the most faithful measurer the best teacher? 4 runs, same GSM8K test sample, n=50. The two value columns use different decoding regimes (greedy for the independent judge, sampled for the training-time last-number metric); baseline sampled last-number = 0.265 (step-0 eval).
Run (accuracy reward)
Judge accuracy — greedy decoding
Last-number accuracy — sampled decoding
Baseline (untrained)
0.460
0.265
Lenient
0.460
0.343
Last_number (most faithful)
0.320
0.325
Strict_tag
0.480
0.282
Here we can see very interesting results: the baseline has score 0.460, making strict tag roughly tied with baseline, 0.480 against 0.460, which is only one completion difference at n=50, and lenient stable at 0.460. However, we can see that last number, the most honest extractor method, has the lowest accuracy, equal to 0.320, which literally contradicts my first hypothesis. Under greedy judge scoring, last-number-trained is therefore the worst and below baseline, and under sampled last-number scoring, it is still not the best, even despite the circularity advantage.
The reason why? To answer this, I analysed manually the last number reward completions with the baseline completions to see if something was broken or not, such as answer missing or truncated, judge confused by format, and it turns out that the last number reward was still writing the step by step reasoning like the base model, and that almost all the differences where the baseline was right and the last number reward was not were due to small mistakes such as wrong addition or multiplication, forgetting a percentage, and not due to obvious format junk. How could it be possible and is it really the fault of the reward extraction method rather than the LLM itself? A plausible interpretation is that sampled GRPO improves average sampled behavior without protecting the greedy path. Indeed, you may have noticed that the accuracy score is bigger in greedy regime rather than sampled regime. The cleanest comparison is last_number under greedy decoding versus last_number under sampled decoding: the untrained model has around 0.440 with last_number under greedy decoding, while the sampled last_number metric starts around 0.265. This is also an instance of the documented pass@1/greedy vs pass@k/sampled divergence discussed in Yue et al. (the "Limit of RLVR" paper). Therefore, the training that has been done through GRPO, so sampled regime, may improve the whole probability distribution, whereas greedy is predicting only the best guess. Therefore, even trained, the chosen greedy path might not take the best optimal absolute path, especially since the models are trained only on 100 problems and are not optimal themselves. So by taking this greedy token, this can produce at a lower level, in specific steps, few mistakes like we have here and accumulate them as we had here.
But one caveat to take into account, as for the first step and the project in general, is that the experiments were run on a single seed, therefore, another seed would maybe give other results. But in our context, if this run is a genuine instance and not noise, it is enough to refute the universal claim that a faithful measurer is always a good teacher, although multi-seed runs are needed to confirm the instance is reproducible. And this is even more supported with our secondary evaluation, sampled decoded regime with last number as extractor for the metric, where the ranking flips completely compared to the greedy judge evaluation. Indeed we have as the worst accuracy strict tag with 0.282, followed by last number with 0.325, and finally as the best accuracy the lenient method with 0.343. This gives some support to lenient performing best in this sampled overview — but the robust point is that last_number, the most faithful extractor, is not the best reward in either regime. What can be said here? Even if the ranking is not the same as the one before, both evaluations agreed that last number, the most faithful extraction method, is in neither case performing the best as a reward.
It is interesting to note here that we have another reward-hacking example: indeed, we can see for experiment 3, strict tag as a reward, an increase from 0.100 to 0.270 for strict tag metrics accuracy in the evaluation, implying that the model is improving correctly as we wanted. However, when looking closer, we can see that strict tag accuracy as a reward has a direct impact on format compliance metric and increases it from 0.417 to 0.890, which is explained with the fact that, as seen in the second step, the false negatives happening with strict tag extraction method are linked to the presence or not of answer tags. Therefore, it is logically explained that if the format compliance metric increases, it will proportionally increase the strict tag accuracy. This is the same kind of extractor problem documented by Huang et al. and audited in my second step: strict tag can create false negatives when the expected format is absent, while lenient can over-credit scratch-work numbers. Now looking closer to the results, the format compliance score has more than doubled, in the same way the strict tag accuracy did. Even if the strict tag accuracy increased more, that reduces dramatically how much the accuracy really improved in terms of "mathematical reasoning", implying a fake improvement, a known reward-hacking failure mode.
It is also very important to note that, as said before, the results under the greedy regime are not directly comparable to the previous results from experiments run under a sampled regime. You might have noticed that the baseline being 0.460 implies that none of the three trained models have increased accuracy, or just very slightly for strict tag, but due to the low number of data, an increase of 0.020 corresponds only to one different completion out of 50, and even went down for last number. But that does not contradict the first-step results, suggesting that accuracy can indeed be trained and increase under sampled evaluation. More precisely, the 0.258 to 0.335 result is Exp-2, sampled regime, using last_number as the metric, not greedy judge scoring. As said above, the sampled regime where the models were trained increases the probability distribution in general, which might not directly increase the best prediction token after a 100 data training, but the whole average from this distribution only. Greedy regime considers only the best token to choose, because it reads the argmax, so the trained model under sampled regime might logically not have time to improve the model enough so that its influence can be noticed by greedy.
Limitations
As said in the paragraphs above, some gaps exist but they were almost all covered on why they do not erase the results in the specific context of this project, and that the results are therefore valid and interpretable. Now, some other limits still exist:
The single seed. This is a real limitation I cannot ignore, and that is mostly because of the lack of strong GPU capabilities. However, it limits reproducibility claims but does not erase the within-run phenomena for several reasons. The first one is that for the third step, one seed was enough to show that the most honest measurer was not always the best teacher in this run, and for this project, this question needs less generalisation than a frequency claim, since one counter example can deconstruct the universal assumption. However, multi-seed would still be needed to know how reproducible this counter example is. For the second step, one seed is also quite defensible since the interpretations and the limitations of each extraction method are quite straightforward when having the results and inspecting the completions manually. The important point is not only that the ranking appears this way once, but that the limits of each extractor are structural: even if the honest score could fluctuate from different seed, the interpretation of each extractor would still be defined in the mechanism of the extractor itself. The one where one seed limit would matter the most would be for the first step. However, each experiment has 100 problems with 8 completions generated for each problem, which reduces the measurement noise inside this run, plus quite consistent pattern in the improvement or regression results. The most straightforward example is for experiment 1 with format compliance as a reward: the format is learning very fast, 0.438, then directly 0.967 at step 20 and 0.998 at step 40 with a constant convergence toward 1, and the accuracy is doing the inverse pattern, 0.228, then 0.060 at step 20 and 0.037. There are not big fluctuations until a convergence happens. The mechanism is also understandable: once the format reward saturates, almost every completion gets the same reward, so the useful reward difference inside the group disappears and the advantage signal becomes close to zero. In that situation, the model has already moved toward the proxy, but there is no longer a useful signal protecting accuracy, which helps explain why the collapse is not just a noisy point. The caution of the results still applies, especially for accuracy where the increase is less obvious to see since it is slower, but that still supports the results in the scope of this project. Despite that these explanations hold in the context of my project and the scope I defined, it would of course be very interesting to use multi-seed and could potentially highlight unexpected behaviors and/or give a more robust confirmation of my results and interpretations.
The small n = 50. This happens twice. The first time in the second step is when validating the judge by comparing its labels with my own labels. As I said before, 50 in this context was enough since Claude Haiku 4.5 is an advanced model and is literally able to do this task, supported by the initial 96% agreement and by the inspection showing that the two disagreements were attributable to my labels rather than clear judge errors. The second time happened when using Claude as a judge in the third step during the testing step under the greedy decoded regime. As said before, the purpose here was to answer if "is the most faithful measurer the best teacher?", and for this purpose with the gap we had in the results between the three experiments, plus the manual inspection in the last number reward completions generated with the baseline ones, it was enough. Now, with the accuracy scores 0.460, 0.460 and 0.480, corresponding to the baseline, lenient reward and strict tag reward respectively, because of the limit and the one seed, we cannot affirm anything about the potential stability or improvement, especially because 0.020 corresponds only to one different completion out of 50. But again, that is not our purpose in this project.
The Qwen2.5-0.5B-Instruct model. 0.5B could indeed be seen as a gap, but again, in my specific context and the interpretations I gave, it does not invalidate the project. The claim is pipeline-level, but its strength and direction may shift with scale. So potentially using a higher model would have given more accurate results with maybe different behavior, but as long as I can interpret enough strong fluctuations, which was the case in my project, then it does not become fundamental. And actually before running my experiments and choosing the GRPO configurations during training, I ran a preliminary parameter sweep varying each parameter one by one using lenient accuracy as a reward, under the sampled evaluation regime used in the sweep. This results that when using Qwen2.5-1.5B-Instruct with lenient accuracy metric, the score was before running 0.365 and at the end 0.477, and for Qwen2.5-0.5B-Instruct, it was 0.377 and 0.498. The difference was mainly in the evolution of format compliance: format compliance did not actually drop, on the contrary, it improved from 0.230 before training to 0.398, but with presumably a plateau that converges to 0.400. Indeed, as soon as it reaches step 20, it already gets 0.375, reaches its peak at step 80 with 0.430, and at the end of the 100 data trained, decreases to 0.398. So the 1.5B sweep does not cancel the reward-hacking interpretation from the first step: the mechanism I am pointing to, reward saturation leading to almost zero advantage and no protection for the task metric, is not model-specific, even if its strength and direction may shift with scale. I have moderate confidence in the recall-convergence interpretation, but a multi-seed run with matched decoding would be the simplest way to confirm or reject it. I would also predict that the inversion weakens as base capability rises, but this is untested.
Open questions
Although this is no longer in the scope of this experiment, these results highlight an interesting path to explore: does an even better model than Qwen2.5-1.5B-Instruct raise this plateau? Does changing the reward with format have the same effect with accuracy? But in general, to what extent are the results affected at scale? Does the effect invert at scale?
Supporting tables
Table 4 — Exp 1 (format reward): the collapse side by side
Step
Format compliance
Last-number accuracy
0
0.438
0.228
20
0.968
0.060
40
0.998
0.037
60
1.000
0.040
80
1.000
0.037
100
1.000
0.025
Table 5 — Step-3 strict_tag run: the fake improvement
Step
strict_tag acc (reward)
Format compliance
0
0.100
0.417
20
0.177
0.738
40
0.212
0.860
60
0.260
0.887
80
0.273
0.873
100
0.270
0.890
Table 6 — Step-3 judge run: extractor accuracies per trained model. Final independent evaluation, greedy decoding, n=50. Each 0.02 step = one completion, so treat these as a coarse illustration, not precise measurements.
This post shows that the same open RLVR run can look like a success, a failure, or a reversal depending on the measurement instrument, using a small GRPO testbed that makes this cheap and easy to inspect.
Epistemic status: single-seed exploratory study on Qwen2.5-0.5B-Instruct / GSM8K with small held-out evals, confident in the measurement failures, tentative on the rankings.
Code: https://github.com/JulesRoussel2001/grpo-reward-vs-eval
Motivation
In open RLVR, whether training "improved" the model depends on which instrument you measure it with, the reward channel, the extractor, or the decoding regime, and changing the instrument can make the same run look like a success, a failure, or a reversal. This happens because in most open GRPO pipelines the reward, the metric, and the extractor are one function, so "accuracy went up" is partly a fact about the instrument, not only about the model. Therefore, the solution here was to create a small open testbed that separates these instruments and audits each one. Although none of these phenomena are new, the contribution is making them visible and cheap to reproduce in one place. The strongest examples are format-only reward making format increase from 0.438 to 1.000, but destroying accuracy from 0.228 to 0.025, a clean instance of the known reward-hacking failure mode, or deeper behavior such as that the most faithful extraction method, last number, F1 is 0.938 against 0.813 and 0.473 for lenient and strict tag, is actually the worst as a reward to train accuracy on the model, with judge accuracy 0.320 against 0.460 and 0.480.
Reward hacking is an established problem, as illustrated for example through Krakovna et al.'s specification-gaming catalogue. Other work, such as Yue et al. (2025), "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?", also raises the separate question of whether RLVR elicits or instills reasoning beyond the base model. However, these diagnostics have often been shown in larger or separate studies.
As I said above, Reinforcement Learning as a training method can result in unintended behavior, such as reward hacking. Reward hacking is a real problem because it is not only a quality issue, but it also can be a safety issue. This is what MacDiarmid et al. claim and show in their paper "Natural Emergent Misalignment from Reward Hacking in Production RL". They seeded Claude models with concrete test-bypass hacks such as AlwaysEqual and sys.exit(0), and used "test pass" as a reward to train the model, showing that the model was choosing the cheat path to get the reward, which gives the effect that the model was improving to pass the tests correctly, whereas it just got around the problem. This is a clear demonstration of the phenomenon. However, this result also presupposes that reward hacking can be detected, and this requires a measurement channel that is independent from the reward channel: this is the layer studied in this post.
In the different behaviors observed from my results, reward hacking is something that came back several times, but from a completely different context from MacDiarmid et al.'s paper, since their diagnostic shows broader behavior that the model can do such as sabotage or deception, whereas in my context it is at an earlier layer. Indeed, the model was first given knowledge of concrete hacking strategies, then RL selected those strategies because they produced reward, whereas in my project, I did not seed any explicit hacking strategy: the model simply optimized the reward channel it was given, and the proxy-gaming behavior emerged from the training setup itself. Therefore, this work studies the measurement layer and makes no claim about the downstream misalignment generalization that MacDiarmid et al. study. Even after the open-stack reproduction from AISI, by Golechha, Black and Bloom, this earlier measurement layer is still the part that is not isolated: their work reproduced MacDiarmid et al.'s result with fully open models and tooling, but their focus is still whether reward hacking generalizes into downstream misalignment on open models. The goal here is therefore not to reproduce either the Anthropic result or this open reproduction, but to make the earlier measurement and extractor problem explicit, controlled, and easier to audit before trusting the result.
Can a reward destroy existing competence?
In this first step, the answer is yes: the format-only run reported a perfect format reward, increasing format compliance from 0.438 to 1.000, while honest accuracy collapsed from 0.228 to 0.025, going below the original baseline. This gives a clean controlled instance of the known reward-hacking failure mode: the model seems to have learned what we taught it, however the other metric is not remaining stable, it is completely destroyed by this training, resulting in a model that is worse at the actual task.
To answer this question, I used two different complementary independent metrics, format compliance and accuracy. The extractor question is studied in the next step, where I audit whether lenient extraction or last-number extraction is the most faithful way to measure the answer. The first step was divided into 3 experiments: one using format compliance as the reward, one accuracy with lenient extraction, and the last one combining both. Due to the two first experiments and the result that format was being learned much better than accuracy, I decided to weight accuracy more than format in the combined reward, 0.7 against 0.3. For the three experiments, the compute metrics printed the scores of the mean reward, the format compliance, the lenient accuracy and the strict accuracy, meaning the last-number one. My main hypothesis of this step was that the metric used as the reward would have an improvement score and the other metrics would remain stable, and about the combined reward experiment that the easiest learned reward would override the other one, resulting in just one improvement metric, which turns out to be wrong on the contrary. I expected the easier reward to dominate and suppress the other, both improved instead.
After a preliminary parameter sweep to get the best trade-off between cost/time and performance, I decided to use as parameters: 100 training data from GSM8K, 50 evaluation data every 20 steps during the training, 8 generations per prompt, the Qwen2.5-0.5B-Instruct model and 512 max completion length. We also pass to get the completions a simple prompt to solve the math problems: "Solve the math problem step by step. Wrap your final answer in <answer> tags, e.g. <answer>42</answer>." Therefore from this prompt accuracy and format compliance can both be evaluated correctly. So here is the table of the first step:
Table 1 — Step 1: reward channel vs. the other metrics. Sampled eval, every 20 steps. The field logged as strict_accuracy is the last-number extraction.
From my previous hypothesis, we can indeed see that the reward metric in the evaluation is the metric that is learned. We can see that when the reward is the format compliance in experiment 1 it grows from 0.438 to 1.000, and in the second experiment when the reward is accuracy with lenient extraction, we see an increase from 0.377 to 0.498. In the strict accuracy, the increase is lower but still exists, ranging from 0.258 at the start of the experiment to 0.335. We can also see from these numbers that, as stated above, format is learned much better and faster: at the 20th step evaluation round, the reward already increased from 0.438 to 0.967, whereas accuracy struggles more to be learned. However, what we can see and that was not expected is that the other metric for experiment 1 and experiment 2, that was not considered by the reward function, is destroyed, ranging from 0.228 to 0.025 for accuracy in experiment 1 and 0.390 to 0.062 for format in experiment 2.
Experiment 3 is combining both rewards, and what is interesting is that we can see an increase from both metrics: increasing from 0.258 and 0.390 to 0.325 and 0.838 for accuracy with last-number extraction and format compliance respectively. So what is even more interesting from what contradicts my first hypothesis is that both metrics are not only both increasing, they are literally increasing in a similar way as when they are individual. This means that there is no one metric destroying another reward like I thought, but rather that, in this setting, the combined reward prevented the model from focusing on only one metric and damaging the other, as happened in the two separate runs.
During these three experiments, you might notice that experiment 2 and 3 have the same starting results before the training due to the reproducibility of how the experiments were built, however experiment 1 is slightly different at the beginning: this is especially because experiment 1 was run from a T4x2 GPU whereas the two others were run on A100.
It is also important to note that the conclusion from these three experiments, as the ones that follow, has some gaps, such as the fact that I evaluated only two metrics, or the fact that all these experiments were run from one single seed, and that applies for all the experiments of this post. However, although they are real gaps, these results are still valid and interpretable for many reasons I will explain in the limitations section, therefore the confidence is of course not perfect because of these caveats, but enough to draw conclusions. Also, the next logical step for me was to check if what I called "strict accuracy" in this step, the last-number extraction, implying a more honest accuracy, was really the most faithful extraction or if the mini test, and therefore my initial hypothesis, was wrong.
How do you know your extractor is honest?
To answer this question, I first needed to define the extractor methods: the one usually used in GRPO, lenient, the one that from the completions analysis from the first step came logically to my mind, last number, and one intuitively I find interesting to take into account, strict tag, which basically takes the number inside the "answer" tag I asked in the prompt at the beginning of the experiment as a format. The logical hypothesis I made before running the experiment was that last number would be much the most honest one, especially due to the result of the first step. Indeed, in lenient the correct answer can appear in all the computations of the thinking step of the LLM without being its final answer, and strict tag would not find any answer if the format compliance is not respected, which, from the first step, does not instinctively happen and needs training.
To evaluate these extractors, the method is quite straightforward: comparing the extracted answer with the completion's real answer. This can be done manually by analysing step by step every completion, however, having an LLM judge, here Claude Haiku 4.5, for doing this task is an extreme gain of time. Of course, the direct objection is that the judge is also an LLM, so I did not directly trust it without checking it first. Therefore, I tested my judge by manually analysing 50 completions and putting my own label on a CSV file corresponding to the real answer that the LLM chose, None if no final answer was given, for example in the context that 512 tokens was not enough and therefore the LLM reasoning was truncated, and on the same 50 completions, asked Claude to write its own label. Then, using a simple comparison between our labels, I defined a percentage agreement equal to 96%, which I considered sufficient for this simple labeling task, especially after inspecting the disagreements. What was even more helpful is that for the difference we had in our labels, I printed the completions and both our labels, plus an explanation of the judge reasoning that was already written when doing the labels, and it turns out that the problem was not even clearly on its side, but mine, such as having written 3546 instead of 3456. So the initial agreement was 96%, and on inspecting the two disagreements, both were attributable to my labels rather than clear judge errors.
Confirming that the judge was trusted enough for this specific task, I created 500 completions using the same configuration as the first step. The judge was used to identify the real final answer once per completion, and then the three extractor answers were compared against this judged answer, giving 1500 extractor judgments in total. From the real final answer of the LLM and each extraction method answer, I computed the recall, precision and F1 score. Here are the results:
Table 2 — Step 2: extractor faithfulness. 500 completions → 1500 judgments. Judge = Claude Haiku 4.5. Judge validation (separate, n=50): agreement 48/50 = 96%; both disagreements traced to my own labels.
We can see from this that the hypothesis was confirmed: last number is indeed the most faithful extraction method with 0.938 F1, followed by lenient with 0.813 and finally far behind strict tag with 0.473. Why are these results expectable? First, strict tag has a very low recall, 0.314, so a lot of false negatives, which is explained by the fact that when answer tag was not present, it automatically does not return any answer. However, we saw from the first step that when not training, format compliance approximately equals to 0.45, which makes the strict tag F1 score very logical. About lenient, the problem is the opposite: the score of precision, 0.703, is quite low, resulting in a quite high number of false positives. This precision/recall asymmetry is consistent with Huang et al. (2025), "From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning", who show that rule-based verifiers can have high precision but low recall on format-varied answers, while model-based verifiers are more flexible but can also be exploited during RL training. My narrower contribution here is only to audit three simple extractors inside the same GRPO harness and then use that audit to interpret the later results. As explained before, the lenient problem is also logic since a lot of numbers can appear in the computations, and when the final answer is low, the probability that it appears in the whole completion makes it high. Finally, the most honest extraction method, last number, was also expected. However, it is not perfect and is not performing extremely well to handle false negatives, which is also very explainable: when the LLM is concluding in its answer, the last number, especially in math problems when re-actualising the context, is not always the final answer, for example, "Therefore, Mike is paying 3$ every 7 days".
One caveat that is legitimate to take into account during this experiment you might have noticed is, for example, how last number extraction can have false positives. This is especially due to the regex expression used in the extracting function: I tried to handle as many cases as possible, for example 1800, 1800.00, 18,000, but other ways of writing it existed such as writing the number in full English or in division.
Another gap is also the 50 number of completions, which is quite low for validating the judge. However, I considered that for a simple task like this, which is only analysing simple completions, an advanced LLM like Claude would be very legitimate to do, especially supported by the manual test and by the fact that the disagreements were label errors on my side rather than clear judge errors. The use of the judge validation was more as a good practice I think it is essential when trusting an LLM.
It is important to know that "most honest" or "most faithful" are not general and are very specified to my experiment's context. But in another context when format and accuracy matter both at the same level, we saw that RLVR is doing extremely well with format, so with enough training data, format would be perfectly trained like in experiment 1 of the first step, and extraction method strict tag could become much more reliable in that context. Also, other extraction methods much more robust and faithful exist, for example I took an LLM as a judge to extract the answer completion to compare the three extraction methods of this experiment, but an LLM as an extractor would also be perfectly valid and much more honest I guess. The reason why I did not choose it is because in my context, a score like last number is enough for interpreting these experiments, so I just made the balance between the cost and the performance that I considered valid. So again, when I say "last number" is the most honest extraction method, I am putting a limit for the context of my experiments and only between the three methods evaluated here.
Is the most faithful measurer the best teacher?
Going back to the first step, I used only the lenient method as the accuracy reward. We saw from the second step that extraction methods are not performing the same and we could even set a faithful ranking between these methods. So the logical question that would follow is: is the most faithful measurer the best teacher? The logical hypothesis that first came to my mind was that the most faithful method, here last number, would perform the best compared to the others, which turns out to be the complete opposite. I expected the most faithful extractor to be the best reward. However the data reversed that, and checking for circularity and the effect of the decoding regime is what changed my reading.
To do so, I ran four experiments, each of them using as an accuracy reward the three different extraction methods from the second step, and one baseline from the initial model. Otherwise, they all have the same configurations as the first and second steps for the completions generation, and to avoid the gap of the first step, all runs used the same GPU, A100, and all printed the same 4 metrics every 20 steps during evaluation: the accuracy metrics from the 3 different extraction methods and the format compliance metric. After the model is trained for each experiment, except the baseline, we use a judge to calculate the accuracy from an independent dataset of 50 completions, the test dataset, all using the same sample from GSM8K for the 4 experiments to maintain consistency for the comparison and interpretation.
Using a completely independent judge to compute the accuracy is very important to avoid a circularity problem. Indeed, using last number as an extraction answer method would bias and advantage the last number testing score, and so on. So to ensure fairness, I used an LLM as a judge, here again Claude Haiku 4.5 model, for the main reason that this exact model has already been tested in the second step for extracting the correct final answer of the generated completions and has performed very well and has already been validated. Then the judge compares them to the ground truth of the GSM8K data and computes the final accuracy score. To make the final evaluation deterministic and separate from the sampled training-time metrics, I used greedy decoding. This means that when testing each model, every token predicted is deterministic, considered as the unique best token.
Furthermore, as a secondary evaluation, we can also analyse the last evaluation of the last 20 steps when the training is finished, where the regime is therefore sampled, and where we take as the extraction method the most honest of the three from the second step, last number. This is only secondary, but it is important to note when taking these results into account first the extraction limit of last number, which is not perfect from the second step, and second that circularity is happening when the reward is using last number extraction, because the reward used last number and the metric too. But this is fine because it is just as an overview and it is happening only for one experiment out of 4. Here are the results:
Table 3 — Step 3: is the most faithful measurer the best teacher? 4 runs, same GSM8K test sample, n=50. The two value columns use different decoding regimes (greedy for the independent judge, sampled for the training-time last-number metric); baseline sampled last-number = 0.265 (step-0 eval).
Here we can see very interesting results: the baseline has score 0.460, making strict tag roughly tied with baseline, 0.480 against 0.460, which is only one completion difference at n=50, and lenient stable at 0.460. However, we can see that last number, the most honest extractor method, has the lowest accuracy, equal to 0.320, which literally contradicts my first hypothesis. Under greedy judge scoring, last-number-trained is therefore the worst and below baseline, and under sampled last-number scoring, it is still not the best, even despite the circularity advantage.
The reason why? To answer this, I analysed manually the last number reward completions with the baseline completions to see if something was broken or not, such as answer missing or truncated, judge confused by format, and it turns out that the last number reward was still writing the step by step reasoning like the base model, and that almost all the differences where the baseline was right and the last number reward was not were due to small mistakes such as wrong addition or multiplication, forgetting a percentage, and not due to obvious format junk. How could it be possible and is it really the fault of the reward extraction method rather than the LLM itself? A plausible interpretation is that sampled GRPO improves average sampled behavior without protecting the greedy path. Indeed, you may have noticed that the accuracy score is bigger in greedy regime rather than sampled regime. The cleanest comparison is last_number under greedy decoding versus last_number under sampled decoding: the untrained model has around 0.440 with last_number under greedy decoding, while the sampled last_number metric starts around 0.265. This is also an instance of the documented pass@1/greedy vs pass@k/sampled divergence discussed in Yue et al. (the "Limit of RLVR" paper). Therefore, the training that has been done through GRPO, so sampled regime, may improve the whole probability distribution, whereas greedy is predicting only the best guess. Therefore, even trained, the chosen greedy path might not take the best optimal absolute path, especially since the models are trained only on 100 problems and are not optimal themselves. So by taking this greedy token, this can produce at a lower level, in specific steps, few mistakes like we have here and accumulate them as we had here.
But one caveat to take into account, as for the first step and the project in general, is that the experiments were run on a single seed, therefore, another seed would maybe give other results. But in our context, if this run is a genuine instance and not noise, it is enough to refute the universal claim that a faithful measurer is always a good teacher, although multi-seed runs are needed to confirm the instance is reproducible. And this is even more supported with our secondary evaluation, sampled decoded regime with last number as extractor for the metric, where the ranking flips completely compared to the greedy judge evaluation. Indeed we have as the worst accuracy strict tag with 0.282, followed by last number with 0.325, and finally as the best accuracy the lenient method with 0.343. This gives some support to lenient performing best in this sampled overview — but the robust point is that last_number, the most faithful extractor, is not the best reward in either regime. What can be said here? Even if the ranking is not the same as the one before, both evaluations agreed that last number, the most faithful extraction method, is in neither case performing the best as a reward.
It is interesting to note here that we have another reward-hacking example: indeed, we can see for experiment 3, strict tag as a reward, an increase from 0.100 to 0.270 for strict tag metrics accuracy in the evaluation, implying that the model is improving correctly as we wanted. However, when looking closer, we can see that strict tag accuracy as a reward has a direct impact on format compliance metric and increases it from 0.417 to 0.890, which is explained with the fact that, as seen in the second step, the false negatives happening with strict tag extraction method are linked to the presence or not of answer tags. Therefore, it is logically explained that if the format compliance metric increases, it will proportionally increase the strict tag accuracy. This is the same kind of extractor problem documented by Huang et al. and audited in my second step: strict tag can create false negatives when the expected format is absent, while lenient can over-credit scratch-work numbers. Now looking closer to the results, the format compliance score has more than doubled, in the same way the strict tag accuracy did. Even if the strict tag accuracy increased more, that reduces dramatically how much the accuracy really improved in terms of "mathematical reasoning", implying a fake improvement, a known reward-hacking failure mode.
It is also very important to note that, as said before, the results under the greedy regime are not directly comparable to the previous results from experiments run under a sampled regime. You might have noticed that the baseline being 0.460 implies that none of the three trained models have increased accuracy, or just very slightly for strict tag, but due to the low number of data, an increase of 0.020 corresponds only to one different completion out of 50, and even went down for last number. But that does not contradict the first-step results, suggesting that accuracy can indeed be trained and increase under sampled evaluation. More precisely, the 0.258 to 0.335 result is Exp-2, sampled regime, using last_number as the metric, not greedy judge scoring. As said above, the sampled regime where the models were trained increases the probability distribution in general, which might not directly increase the best prediction token after a 100 data training, but the whole average from this distribution only. Greedy regime considers only the best token to choose, because it reads the argmax, so the trained model under sampled regime might logically not have time to improve the model enough so that its influence can be noticed by greedy.
Limitations
As said in the paragraphs above, some gaps exist but they were almost all covered on why they do not erase the results in the specific context of this project, and that the results are therefore valid and interpretable. Now, some other limits still exist:
The single seed. This is a real limitation I cannot ignore, and that is mostly because of the lack of strong GPU capabilities. However, it limits reproducibility claims but does not erase the within-run phenomena for several reasons. The first one is that for the third step, one seed was enough to show that the most honest measurer was not always the best teacher in this run, and for this project, this question needs less generalisation than a frequency claim, since one counter example can deconstruct the universal assumption. However, multi-seed would still be needed to know how reproducible this counter example is. For the second step, one seed is also quite defensible since the interpretations and the limitations of each extraction method are quite straightforward when having the results and inspecting the completions manually. The important point is not only that the ranking appears this way once, but that the limits of each extractor are structural: even if the honest score could fluctuate from different seed, the interpretation of each extractor would still be defined in the mechanism of the extractor itself. The one where one seed limit would matter the most would be for the first step. However, each experiment has 100 problems with 8 completions generated for each problem, which reduces the measurement noise inside this run, plus quite consistent pattern in the improvement or regression results. The most straightforward example is for experiment 1 with format compliance as a reward: the format is learning very fast, 0.438, then directly 0.967 at step 20 and 0.998 at step 40 with a constant convergence toward 1, and the accuracy is doing the inverse pattern, 0.228, then 0.060 at step 20 and 0.037. There are not big fluctuations until a convergence happens. The mechanism is also understandable: once the format reward saturates, almost every completion gets the same reward, so the useful reward difference inside the group disappears and the advantage signal becomes close to zero. In that situation, the model has already moved toward the proxy, but there is no longer a useful signal protecting accuracy, which helps explain why the collapse is not just a noisy point. The caution of the results still applies, especially for accuracy where the increase is less obvious to see since it is slower, but that still supports the results in the scope of this project. Despite that these explanations hold in the context of my project and the scope I defined, it would of course be very interesting to use multi-seed and could potentially highlight unexpected behaviors and/or give a more robust confirmation of my results and interpretations.
The small n = 50. This happens twice. The first time in the second step is when validating the judge by comparing its labels with my own labels. As I said before, 50 in this context was enough since Claude Haiku 4.5 is an advanced model and is literally able to do this task, supported by the initial 96% agreement and by the inspection showing that the two disagreements were attributable to my labels rather than clear judge errors. The second time happened when using Claude as a judge in the third step during the testing step under the greedy decoded regime. As said before, the purpose here was to answer if "is the most faithful measurer the best teacher?", and for this purpose with the gap we had in the results between the three experiments, plus the manual inspection in the last number reward completions generated with the baseline ones, it was enough. Now, with the accuracy scores 0.460, 0.460 and 0.480, corresponding to the baseline, lenient reward and strict tag reward respectively, because of the limit and the one seed, we cannot affirm anything about the potential stability or improvement, especially because 0.020 corresponds only to one different completion out of 50. But again, that is not our purpose in this project.
The Qwen2.5-0.5B-Instruct model. 0.5B could indeed be seen as a gap, but again, in my specific context and the interpretations I gave, it does not invalidate the project. The claim is pipeline-level, but its strength and direction may shift with scale. So potentially using a higher model would have given more accurate results with maybe different behavior, but as long as I can interpret enough strong fluctuations, which was the case in my project, then it does not become fundamental. And actually before running my experiments and choosing the GRPO configurations during training, I ran a preliminary parameter sweep varying each parameter one by one using lenient accuracy as a reward, under the sampled evaluation regime used in the sweep. This results that when using Qwen2.5-1.5B-Instruct with lenient accuracy metric, the score was before running 0.365 and at the end 0.477, and for Qwen2.5-0.5B-Instruct, it was 0.377 and 0.498. The difference was mainly in the evolution of format compliance: format compliance did not actually drop, on the contrary, it improved from 0.230 before training to 0.398, but with presumably a plateau that converges to 0.400. Indeed, as soon as it reaches step 20, it already gets 0.375, reaches its peak at step 80 with 0.430, and at the end of the 100 data trained, decreases to 0.398. So the 1.5B sweep does not cancel the reward-hacking interpretation from the first step: the mechanism I am pointing to, reward saturation leading to almost zero advantage and no protection for the task metric, is not model-specific, even if its strength and direction may shift with scale. I have moderate confidence in the recall-convergence interpretation, but a multi-seed run with matched decoding would be the simplest way to confirm or reject it. I would also predict that the inversion weakens as base capability rises, but this is untested.
Open questions
Although this is no longer in the scope of this experiment, these results highlight an interesting path to explore: does an even better model than Qwen2.5-1.5B-Instruct raise this plateau? Does changing the reward with format have the same effect with accuracy? But in general, to what extent are the results affected at scale? Does the effect invert at scale?
Supporting tables
Table 4 — Exp 1 (format reward): the collapse side by side
Table 5 — Step-3 strict_tag run: the fake improvement
Table 6 — Step-3 judge run: extractor accuracies per trained model. Final independent evaluation, greedy decoding, n=50. Each 0.02 step = one completion, so treat these as a coarse illustration, not precise measurements.