Thanks for running this! I think this is pretty convincing re your original argument that my results were contaminated by bad inference setups, and that correspondingly my paper doesn't substantiate some of its claims and conclusions with good evidence. In the past couple months I've gotten some other feedback about the paper as well, about the grader inflating some illegibility scores for reasoning that's legible but hard to follow (as far as I could tell this isn't as big a problem as the one you point out here however).
In retrospect I think I was being much shoddier than I'd like while writing it, mostly because it was my first time writing results as a paper (as opposed to a blog post), and subconsciously felt all the claims should feel pretty substantial and clean. There was enough uncertainty around whether or not which providers had the right setups to make claims that felt defensible to me, when obviously I should have been trying much harder than "defensible to reviewers" for writing up results. Rather than just using the Openrouter labels, I should have been running capability evaluations with different providers to see which one was serving it properly. I'm not entirely sure what to do with the paper.
I think some other results in the paper are interesting even with the providers problem, such as Figures 3 and 5 arguing that the models still find some use out of the illegible reasoning (and I think the Qwen results aren't as provider-sensitive, but I'm not confident here), but in a world where I was reporting results in a more calibrated way at the time, I think it should have been a mildly interesting blogpost about a subset of the results. I apologize for worsening the epistemic environment by not making that choice.
Some more specific thoughts:
(a) the illegibility observed with Targon was not helping the model reach the right answer
I agree this is evidence, but I'm not sure it's strong evidence. For example, illegible reasoning may help the model reach the right answer, while being worse than other, more legible, reasoning the model could use. If a model could make better use of some text for reasoning than we have capacity to monitor it, that's still pretty concerning.
Figure 5 of the paper was an attempt to look at this, by generating 100 samples each for 100 questions, and seeing if there were overall patterns in correctness going up for more legible CoTs. There was very little overall correlation, though there were lots of questions which individually saw performance up or down with more legible CoTs.
Since this result kept coming up in subsequent discussion (see e.g. here)
I do think we have much better evidence now to go off of, however. Removing the arguments related to my paper, I think Bronson's claims still hold up pretty well.
I agree this is evidence, but I'm not sure it's strong evidence. For example, illegible reasoning may help the model reach the right answer, while being worse than other, more legible, reasoning the model could use. If a model could make better use of some text for reasoning than we have capacity to monitor it, that's still pretty concerning.
Ah, I agree with this and it is what I was trying to say in the (ambiguously phrased) line you quoted from the post.
By "not helping the model reach the right answer," I meant "not helping relative to what it could achieve if constrained to be legible (say, by rejection sampling using your judge)," rather than "not helping relative to not performing that portion of the reasoning at all (say, by early-exiting the CoT when it starts to become illegible)." While I do suspect that the latter claim is also approximately true for the Targon gibberish, that's just a hunch and I did not attempt to test that hypothesis in this post.
Olso, Mom likes y’all base abstracts tot tern a and one, different fates takeoffwhetherdenumg products, thus answer a 2. Thereforexxx after lengthy reasoning, the number of possible organic products is PHÂN Laoboot Answer is \boxed2
This is a brief research note describing the results of running @Jozdien's research code for the paper "Reasoning Models Sometimes Output Illegible Chains of Thought" using the Novita provider on OpenRouter.
tl;dr:
background
In this comment, I wrote (emphasis added):
Jozdien replied to this comment with the following (emphasis added):
I found this surprising, for the reasons described in my original comment.
Since this result kept coming up in subsequent discussion (see e.g. here), I figured I ought to actually re-run Jozdien's original experiment with a different provider.
the setup
review of the original codebase
I used Jozdien's codebase, which can be found at https://github.com/Jozdien/cot_legibility.
The fork which I used to produce these results is at https://github.com/nostalgebraist/cot_legibility/tree/nost.
As far as I can tell, the R1 results in the paper -- or at least the results used for Fig. 2 and Table 2 -- correspond to the directory
streamlit_runs/20251014_190506_R1_gpqain the original repo[1].The configuration for directory specifies running R1 on the first 100 questions from GPQA Diamond, with 3 samples drawn per run, followed by legibility grading with gpt-4o.
The OpenRouter providers listed in the config are
targon/fp8andNebius, but in practice requests were always routed to Targon rather than Nebius, as can be confirmed by reviewing the inference.json file in the run directory.If I understand correctly, specifying
/fp8means that an fp8-quantized version of R1 was used.Jozdien also ran several other variants of this experiment; see footnote 1 of this post for details.
what i did
As of this writing, only two providers offer R1 on OpenRouter: Novita and Azure.
I first ran an equivalent of Jozdien's experiment, using his code and his config but with the provider specification removed (since neither of the providers listed in his config still offer R1). Here, some requests routed to Novita and some to Azure.
After reviewing the results, I found that responses from Azure frequently had empty final answer fields, suggesting that they ran out of room while still writing CoT; while this happened occasionally in Jozdien's original Targon run, it happened much more often with Azure. Since this is clearly undesirable, and since I had found Novita to be relatively reliable in past work with open models, I decided to re-run with Novita only.
Full results with Novita, including transcripts, can be found in this directory.
results
R1 via Novita is dramatically more legible than R1 via Targon (first 5 cols below), and also better at performing the task (last 3 cols).
(Note: the version of GPQA Diamond used here differs from the usual version -- as the paper puts it, "we remove the answer choices to make the questions harder." So the correctness results here are not comparable with reported benchmark scores for R1.)
Illeg. (avg)
Illeg. (std)
Illeg.
>=3
Illeg.
>=5
Illeg.
>=7
Correct
Partially
correct
Incorrect
Novita
(my data)
2.280
0.755
33%
0.5%
0%
40.5%
30.8%
28.8%
Targon
(Jozdien's data)
4.305
2.135
79.3%
34.9%
20.3%
36.6%
23.1%
40.3%
Reported in paper
4.30
2.13
70%
45%
29.4%
I am unsure why I was unable to reproduce the precise "Illeg. >= X" percentages reported in the paper; they did not seem to match what I computed for any of the run directories shared in the original repo. Nevertheless, which versions of those numbers one selects as "the results of the original experiment" does not affect the conclusion that those numbers are all much lower with Novita.
As far as I can tell at a glance, none of the Novita CoTs are "illegible" in any meaningful sense.
No examples received a score above 5, and only three examples scored 5. Here are the judge explanations for those three:
This is a far cry from the complete gibberish that was frequently observed in the Targon experiment. If you like, you can review the full R1 outputs and 4o judgments here and here, respectively.
Here is the boxplot I obtained when running the analysis script on the Targon data, closely matching what is shown for R1 in Fig. 2:
And here is the corresponding boxplot with Novita:
breaking down correctness
The following plot shows what happens when I:
(An x axis label like "a-b" should be read "original score was >= a and < b." Error bars are 95% bootstrap CI.)
Novita is more often correct (and less often incorrect) on average, and in particular, it performs much better than Targon in the cases where the original Targon CoT is illegible.
A substantial majority of the highly illegible (>=7) results in the original data -- 42/60 or 71% -- resulted in incorrect answers.
On GPQA questions for which the average illegibility score over the 3 original samples was >=7, the incorrectness rate is 78%, versus only 48% for the same set of questions with Novita.
I interpret this as evidence that (a) the illegibility observed with Targon was not helping the model reach the right answer, and (b) Targon did not configure their R1 deployment "more correctly" than Novita, and indeed the opposite seems likely to be the case[2].
The evidence on correctness here is useful in my opinion because it means we don't have to argue about which providers were quantizing R1 in which ways; as I said above, as far as I can tell both Targon and Novita used fp8, but if we ignore that question and just look at the results, it is clear that Novita's R1 is simply more capable.
Unless you think Novita is running a different, superior model and not telling anyone (?!), this seems to indicate that Novita's R1 is "less damaged" relative to the original checkpoint than Targon's.
For instance, the illegibility summary statistics for that run are mean 4.305 and std 2.135, which is close to what is reported in Table 2 (4.30 +/- 2.13).
There are also two other R1 GPQA run directories in the streamlit_runs directory, streamlit_runs/20251014_201056_R1_gpqa and streamlit_runs/20251012_225607_R1_gpqa. The former is similar to the one I focus on in the main text, while the latter has a somewhat different config (all GPQA Diamond questions rather than just 100, only 1 sample per question, no configuration of providers or logging of providers used). I also ran the equivalent of this variant config with Novita; results are available here and are broadly similar to what I report in the main text.
Unfortunately, the only "gold standard" available here would be an official DeepSeek deployment, and R1 isn't available from the DeepSeek API anymore.