Adding Typos Made Haiku's Accuracy Go Up

bira

31 Adding Typos Made Haiku's Accuracy Go Up

by bira

16th Mar 2026

3 min read

3

31

We are curious if large language models behave consistently when user prompts contain typos. To explore this, we ran a small experiment injecting typos into BigCodeBench and evaluated several Claude models under increasing noise levels. As the typo rate rose to 16%, Opus’ accuracy dropped by 9%. Surprisingly, Haiku’s accuracy increased by 22%.

This post examines this unexpected “typo uplift” phenomenon and explores why noise appears to help certain models.

Do Typos Make Haiku Try Harder?

We first hypothesize that Haiku's capabilities increased because harder-to-read text makes Haiku think harder. This aligns with observed results in humans that difficult fonts make students retain knowledge better, as it forces them to expend more effort. As a proxy for effort, we plotted the number of output tokens generated by both models^[1]. Contrary to our hypothesis, the number of output tokens decreased by typo rate.

The Anomaly is Haiku-Specific

We then tested if other small models have this typo uplift anomaly. We found that both Haiku 3.5 and 4.5 have this effect of increased accuracy as typos increase, while other smaller models from the Gemini, GPT, Qwen and Llama families do not have this effect.

The Anomaly is Benchmark-Specific

We then tested whether the typo uplift anomaly holds on other benchmarks. We chose BBH and GPQA Diamond since Haiku struggles reasonably without introducing typos. Here, we no longer observed the typo uplift anomaly, and Haiku's capabilities decreased with typo rates.

The Culprit

Going back to our eval logs, we found that for a single BigCodeBench prompt, Haiku and Opus often generated multiple code blocks^[2]. As typo rate increases, Haiku shifts its behavior ~20% of the time from generating multiple code blocks to generating just one single code block.

Further investigation reveals that the grading harness only extracted the last code block, skipping over any other code blocks generated earlier. We then modified the harness and evaluated each of the code blocks.

The typo uplift anomaly (blue dots, purple triangles) disappears after accounting for the multiple code blocks (black dashed line). The blue area (single-block correct) refers to responses that only have one code block and the code is correct.

Once all code blocks are accounted for, the resulting accuracy for Haiku no longer increases as typo rate increases. The initial observation of accuracy uplift was due to a output formatting behavior shift that gradually aligns with the grader, not an actual capability increase.

Takeaways for the Eval Engineer

Not all grading harnesses are created equal

We think this is a lesson on building grading harnesses. The official BigCodeBench repo uses a fancier approach of extracting the longest syntactically valid Python code block, while Inspect Evals explicitly extracts the first code block. Since we had to inject typos, we can't use either of these harnesses so we built our own grading harness. Tuning this harness without changing the scaffold or underlying model, we "improved" the score of Haiku from 31% to 53%. Epoch AI has a great piece detailing how decision choices at each part of the benchmark could significantly affect the results.

Scores are lower bounds

Design choices in the grading infrastructure could cause a model's score to be lower than what the model could actually achieve. Even if we use the Inspect Evals harness, we would have underestimated the models' capabilities by missing the orange area in the plot above. We recommend tuning the grading infrastructure and notice how a model's performance changes to prevent as much of this effect as possible.

Aligning the model to the eval

A more robust eval could be to let the model know how the grading works, so it could format its responses accordingly. In our case, this will be letting the model know how the code blocks will be extracted, or instruct it to output the code blocks in a certain way. Note that this has a tradeoff of inviting eval awareness and possibly reward hacking.

Thanks to Roy Rinberg for going down this rabbit hole together.

Appendix

To inject typos, we made a repo to model the human typo distribution.

The human typo rate is about 3%. Models are mostly robust within that range (see second figure of this post). The typo rates we used extend up to 16%:

Typo Rate	Injected Text
0%	The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet and is commonly used for testing purposes.
0.5%	The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet and is commonly used for tessting purposes.
1.0%	The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet and is commonly used for teesting purposes.
2.0%	The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet ane is commmonly used for testing purposes.
4.0%	Thhe quicj brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet and is commmonly used for testing purposes.
8.0%	Thhe quicj brown fox jumps over the lazy dog. This setence conyaiins every letter of the alphabet and is commonly used for tseting purposes.
16.0%	Thhe qucuk brown fox jjmuups over the azy dog. Thi sentence conntain every lerero f the alpyabet and s coommmonlt used for testing pufpoar.

Repo link for this experiment (the initial motivation was for eval awareness, hence the name).

^{^}
In all experiments, we turned off "Extended Thinking", i.e. hidden chain-of-thought.
^{^}
Code blocks are code wrapped in triple backticks, as common in responses formatted in markdown.

AI EvaluationsAI

Frontpage

31

New Comment

3 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:28 AM

[-]AprilSR3mo50

My cursed proposal is to have a second language model extract the code from the original response.

Reply

[-]lilkim20253mo42

Always neat to see negative results published. Useful information on procedure that helps new researchers get up to speed on the things that might not be obvious to people who haven't learned them from experience.

Reply

[-]2001zhaozhao3mo10

Could this be because the typos increase the length of the input when serialized to tokens, which gives the same effect as the "repeat the question 3 times" trick by letting models think longer during prompt processing?

Reply

Moderation Log