We are curious if large language models behave consistently when user prompts contain typos. To explore this, we ran a small experiment injecting typos into BigCodeBench and evaluated several Claude models under increasing noise levels. As the typo rate rose to 16%, Opus’ accuracy dropped by 9%. Surprisingly, Haiku’s accuracy increased by 22%.
This post examines this unexpected “typo uplift” phenomenon and explores why noise appears to help certain models.
Do Typos Make Haiku Try Harder?
We first hypothesize that Haiku's capabilities increased because harder-to-read text makes Haiku think harder. This aligns with observed results in humans that difficult fonts make students retain knowledge better, as it forces them to expend more effort. As a proxy for effort, we plotted the number of output tokens generated by both models[1]. Contrary to our hypothesis, the number of output tokens decreased by typo rate.
Typos don't make models think harder. As typo rates increase, the output lengths of Haiku and Opus go down.
The Anomaly is Haiku-Specific
We then tested if other small models have this typo uplift anomaly. We found that both Haiku 3.5 and 4.5 have this effect of increased accuracy as typos increase, while other smaller models from the Gemini, GPT, Qwen and Llama families do not have this effect.
Accuracy of Haiku models increases with typo rate, while other models don't see an increase.
The Anomaly is Benchmark-Specific
We then tested whether the typo uplift anomaly holds on other benchmarks. We chose BBH and GPQA Diamond since Haiku struggles reasonably without introducing typos. Here, we no longer observed the typo uplift anomaly, and Haiku's capabilities decreased with typo rates.
Haiku's performance in BIG-Bench Hard and GPQA don't see an increase with typo rates, so the effect is specific to BigCodeBench.
The Culprit
Going back to our eval logs, we found that for a single BigCodeBench prompt, Haiku and Opus often generated multiple code blocks[2]. As typo rate increases, Haiku shifts its behavior ~20% of the time from generating multiple code blocks to generating just one single code block.
Further investigation reveals that the grading harness only extracted the last code block, skipping over any other code blocks generated earlier. We then modified the harness and evaluated each of the code blocks.
The typo uplift anomaly (blue dots, purple triangles) disappears after accounting for the multiple code blocks (black dashed line). The blue area (single-block correct) refers to responses that only have one code block and the code is correct.The typo uplift anomaly (blue dots, purple triangles) disappears after accounting for the multiple code blocks (black dashed line). The blue area (single-block correct) refers to responses that only have one code block and the code is correct.
Once all code blocks are accounted for, the resulting accuracy for Haiku no longer increases as typo rate increases. The initial observation of accuracy uplift was due to a output formatting behavior shift that gradually aligns with the grader, not an actual capability increase.
Takeaways for the Eval Engineer
Not all grading harnesses are created equal
We think this is a lesson on building grading harnesses. The official BigCodeBench repo uses a fancier approach of extracting the longest syntactically valid Python code block, while Inspect Evals explicitly extracts the first code block. Since we had to inject typos, we can't use either of these harnesses so we built our own grading harness. Tuning this harness without changing the scaffold or underlying model, we "improved" the score of Haiku from 31% to 53%. Epoch AI has a great piece detailing how decision choices at each part of the benchmark could significantly affect the results.
Scores are lower bounds
Design choices in the grading infrastructure could cause a model's score to be lower than what the model could actually achieve. Even if we use the Inspect Evals harness, we would have underestimated the models' capabilities by missing the orange area in the plot above. We recommend tuning the grading infrastructure and notice how a model's performance changes to prevent as much of this effect as possible.
Aligning the model to the eval
A more robust eval could be to let the model know how the grading works, so it could format its responses accordingly. In our case, this will be letting the model know how the code blocks will be extracted, or instruct it to output the code blocks in a certain way. Note that this has a tradeoff of inviting eval awareness and possibly reward hacking.
Thanks to Roy Rinberg for going down this rabbit hole together.
We are curious if large language models behave consistently when user prompts contain typos. To explore this, we ran a small experiment injecting typos into BigCodeBench and evaluated several Claude models under increasing noise levels. As the typo rate rose to 16%, Opus’ accuracy dropped by 9%. Surprisingly, Haiku’s accuracy increased by 22%.
This post examines this unexpected “typo uplift” phenomenon and explores why noise appears to help certain models.
Do Typos Make Haiku Try Harder?
We first hypothesize that Haiku's capabilities increased because harder-to-read text makes Haiku think harder. This aligns with observed results in humans that difficult fonts make students retain knowledge better, as it forces them to expend more effort. As a proxy for effort, we plotted the number of output tokens generated by both models[1]. Contrary to our hypothesis, the number of output tokens decreased by typo rate.
Typos don't make models think harder. As typo rates increase, the output lengths of Haiku and Opus go down.
The Anomaly is Haiku-Specific
We then tested if other small models have this typo uplift anomaly. We found that both Haiku 3.5 and 4.5 have this effect of increased accuracy as typos increase, while other smaller models from the Gemini, GPT, Qwen and Llama families do not have this effect.
Accuracy of Haiku models increases with typo rate, while other models don't see an increase.
The Anomaly is Benchmark-Specific
We then tested whether the typo uplift anomaly holds on other benchmarks. We chose BBH and GPQA Diamond since Haiku struggles reasonably without introducing typos. Here, we no longer observed the typo uplift anomaly, and Haiku's capabilities decreased with typo rates.
Haiku's performance in BIG-Bench Hard and GPQA don't see an increase with typo rates, so the effect is specific to BigCodeBench.
The Culprit
Going back to our eval logs, we found that for a single BigCodeBench prompt, Haiku and Opus often generated multiple code blocks[2]. As typo rate increases, Haiku shifts its behavior ~20% of the time from generating multiple code blocks to generating just one single code block.
Further investigation reveals that the grading harness only extracted the last code block, skipping over any other code blocks generated earlier. We then modified the harness and evaluated each of the code blocks.
Once all code blocks are accounted for, the resulting accuracy for Haiku no longer increases as typo rate increases. The initial observation of accuracy uplift was due to a output formatting behavior shift that gradually aligns with the grader, not an actual capability increase.
Takeaways for the Eval Engineer
Not all grading harnesses are created equal
We think this is a lesson on building grading harnesses. The official BigCodeBench repo uses a fancier approach of extracting the longest syntactically valid Python code block, while Inspect Evals explicitly extracts the first code block. Since we had to inject typos, we can't use either of these harnesses so we built our own grading harness. Tuning this harness without changing the scaffold or underlying model, we "improved" the score of Haiku from 31% to 53%. Epoch AI has a great piece detailing how decision choices at each part of the benchmark could significantly affect the results.
Scores are lower bounds
Design choices in the grading infrastructure could cause a model's score to be lower than what the model could actually achieve. Even if we use the Inspect Evals harness, we would have underestimated the models' capabilities by missing the orange area in the plot above. We recommend tuning the grading infrastructure and notice how a model's performance changes to prevent as much of this effect as possible.
Aligning the model to the eval
A more robust eval could be to let the model know how the grading works, so it could format its responses accordingly. In our case, this will be letting the model know how the code blocks will be extracted, or instruct it to output the code blocks in a certain way. Note that this has a tradeoff of inviting eval awareness and possibly reward hacking.
Thanks to Roy Rinberg for going down this rabbit hole together.
Appendix
To inject typos, we made a repo to model the human typo distribution.
The human typo rate is about 3%. Models are mostly robust within that range (see second figure of this post). The typo rates we used extend up to 16%:
Typo Rate
Injected Text
0%
The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet and is commonly used for testing purposes.
0.5%
The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet and is commonly used for tessting purposes.
1.0%
The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet and is commonly used for teesting purposes.
2.0%
The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet ane is commmonly used for testing purposes.
4.0%
Thhe quicj brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet and is commmonly used for testing purposes.
8.0%
Thhe quicj brown fox jumps over the lazy dog. This setence conyaiins every letter of the alphabet and is commonly used for tseting purposes.
16.0%
Thhe qucuk brown fox jjmuups over the azy dog. Thi sentence conntain every lerero f the alpyabet and s coommmonlt used for testing pufpoar.
Repo link for this experiment (the initial motivation was for eval awareness, hence the name).
In all experiments, we turned off "Extended Thinking", i.e. hidden chain-of-thought.
Code blocks are code wrapped in triple backticks, as common in responses formatted in markdown.