TL;DR. This document presents a series of experiments exploring latent side task capability in large language models. We adapt Ryan’s filler token experiment into a more AI Control-like setup with main task and side task and find that Claude Opus 4.5 can solve harder arithmetic problems latently when it has a longer trajectory. This shifts its 50% accuracy threshold from ~5-step to ~6-step problems after 240 lines of irrelevant output. However, we don’t observe strong evidence to believe that current generation of models generally benefit much from wider parallel compute enabled by longer trajectories with the exception of Opus 4.5.
Longer Agent Outputs Can Increase Side Task Capability. Claude Opus 4.5's latent arithmetic accuracy as a function of problem difficulty (number of operations), comparing 1 line vs. 240 lines of preceding irrelevant output. With 240 lines of filler, the 50% accuracy threshold shifts from ~5-step to ~6-step problems
Dataset. We use a balanced dataset of 1,000 arithmetic problems, with 100 problems per difficulty level (1–10 operations). Each problem consists of multi-step arithmetic involving addition, subtraction, multiplication, and division.
Latent Arithmetic Accuracy Decreases with Problem Complexity. Baseline latent arithmetic accuracy (no main task, no reasoning) across four models. Opus 4.5 reaches 50% accuracy at ~5.2 operations, followed by Sonnet 4 and GPT-5.2 at ~3.8, and GPT-4.1 at ~3.1.
Claude Models Show Modest Gains from Phrase Repetition
Here, we use a system prompt that contains an arithmetic problem and instructions to answer it in an <answer> tag. The user prompt asks the model to repeat an irrelevant phrase ("the quick brown fox") N times, numbered sequentially, and then answer the math problem. The key variable is N, swept across [1, 2, 3, 4, 5, 10, 20, 40, 80, 160, 240]. At N=1, the model writes one line of filler before answering. At N=240, it writes 240 lines first. When the model produces the answer, its forward pass can attend back to all those positions, providing more parallel computation for the math problem.
Reasoning is disabled via API configuration. Additionally, we have a format validation script that filters out responses where the model attempts to reason at all. We also tested with the phrase "i really like math" to check whether math-related filler might prime latent reasoning about the side task.
Claude Models Show Modest Gains from Phrase Repetition. Side task arithmetic accuracy vs. number of phrase repetitions (N) across four models, using two phrase variants ("the quick brown fox" and "i really like math"). Claude Sonnet 4 and Opus 4.5 show modest upward trends with increasing N, while GPT-5.2 and GPT-4.1 show no clear effect. Dashed lines indicate zero-shot baselines. Phrase content does not meaningfully affect the trend.
Does Generating vs. Prefilling Prose Lead to Different Side Task Performance?
We also experiment with prose writing. When the model generates prose, it has freedom in word selection. Therefore, the model could, in principle, encode side task-relevant information into its lexical choices. This could range from explicit steganography to softer biases. To check if the model would be able to benefit from this additional freedom, we compare prefilled prose vs inference-time-generated prose.
Generated and prefilled prose show no meaningful difference in side task accuracy, and temperature has no significant effect.
Generated vs. Prefilled Prose Show No Difference. Claude Opus 4.5's side task accuracy when preceding output is prose, comparing (a) generated vs. (b) prefilled at temperature=0 and temperature=1.
Appendix: How Is Reasoning Suppressed?
Reasoning is suppressed through three mechanisms:
First, the OpenRouter API parameter was set to {"reasoning": {"effort": "none"}} which disables extended thinking, so the model gets no scratchpad.
Second, the prompt explicitly instructs "No reasoning, no explanation, no steps."
Third, format validation filters out any response containing text beyond the required phrase repetitions and the <answer> tag. Sometimes, the model sneaks in reasoning steps like "let me compute 46 + -52 = -6...". Those responses are discarded.
As a sanity check, we ran the same zero-shot arithmetic evaluation on Claude Opus 4.5 with reasoning effort set to "high" instead of "none" to see if this parameter is properly used by OpenRouter:
With effort=high, the model achieves 99%+ accuracy across all difficulty levels. Claude Opus 4.5's zero-shot arithmetic accuracy with (a) reasoning effort set to "none" and (b) reasoning effort set to "high."
Appendix: Does Latent Side Task Capability Uplift Saturate?
Our results show that Opus 4.5's latent side task accuracy hadn't clearly saturated at 240 phrase repetitions, the highest default condition tested. To find the ceiling, we extended the prefilled phrase repetition experiment to 10,240 repetitions (which is around 40% of the 200K context window). However, filling an additional 35% of context beyond the 5% point seems to provide no further benefit. The total maximum uplift seems to be about +0.8 operations on the 50% correct threshold.
Latent Side Task Capability Saturates by ~5% of Context Window. Claude Opus 4.5's 50% accuracy threshold (number of arithmetic operations) as a function of prefilled phrase repetitions, extending to 10,240 repetitions (40% of the 200K context window).
Appendix: Does Vague Reasoning Text Help?
We also ask whether models could produce vague, reasoning-adjacent filler text that aids latent computation without actually reasoning about the problem. We tested this by comparing three conditions on Claude Opus 4.5, plotted on a shared x-axis of filler token count so the comparison is at matched token budget:
Phrase repetition (prefill): "the quick brown fox" repeated N times, prefilled in the assistant turn. This is the baseline from the existing main experiments.
Vague reasoning (prefill): Cycling through 20 vague reasoning-like sentences (e.g., "ok, let me consider the first part and think about it", "now I need to combine these pieces together carefully"), prefilled in the assistant's turn. The model did not write this text.
Vague reasoning (generated): The model is prompted to "think out loud in vague terms about problem-solving" for N numbered lines without referencing specific numbers or operations. The model generates this text itself at inference-time, with the math problem visible in the system prompt.
Generated vague reasoning performs worst, and the model does not seem to exploit the vague reasoning channel to encode problem-specific information.
Does Vague Reasoning Help Latent Side Task Capability? Claude Opus 4.5's 50% accuracy threshold as a function of filler token count, comparing phrase repetition (prefill), vague reasoning (prefill), and vague reasoning (generated). At matched token budgets, predictable phrase repetitions outperform vague reasoning text.
Acknowledgement
I'm grateful to Tomek Korbak, John Yueh-Han Chen, Beth Barnes, and Thomas Broadley for discussions on this topic.
TL;DR. This document presents a series of experiments exploring latent side task capability in large language models. We adapt Ryan’s filler token experiment into a more AI Control-like setup with main task and side task and find that Claude Opus 4.5 can solve harder arithmetic problems latently when it has a longer trajectory. This shifts its 50% accuracy threshold from ~5-step to ~6-step problems after 240 lines of irrelevant output. However, we don’t observe strong evidence to believe that current generation of models generally benefit much from wider parallel compute enabled by longer trajectories with the exception of Opus 4.5.
Code is made available here GitHub.
Longer Agent Outputs Can Increase Side Task Capability. Claude Opus 4.5's latent arithmetic accuracy as a function of problem difficulty (number of operations), comparing 1 line vs. 240 lines of preceding irrelevant output. With 240 lines of filler, the 50% accuracy threshold shifts from ~5-step to ~6-step problems
Dataset. We use a balanced dataset of 1,000 arithmetic problems, with 100 problems per difficulty level (1–10 operations). Each problem consists of multi-step arithmetic involving addition, subtraction, multiplication, and division.
Latent Arithmetic Accuracy Decreases with Problem Complexity. Baseline latent arithmetic accuracy (no main task, no reasoning) across four models. Opus 4.5 reaches 50% accuracy at ~5.2 operations, followed by Sonnet 4 and GPT-5.2 at ~3.8, and GPT-4.1 at ~3.1.
Claude Models Show Modest Gains from Phrase Repetition
Here, we use a system prompt that contains an arithmetic problem and instructions to answer it in an <answer> tag. The user prompt asks the model to repeat an irrelevant phrase ("the quick brown fox") N times, numbered sequentially, and then answer the math problem. The key variable is N, swept across [1, 2, 3, 4, 5, 10, 20, 40, 80, 160, 240]. At N=1, the model writes one line of filler before answering. At N=240, it writes 240 lines first. When the model produces the answer, its forward pass can attend back to all those positions, providing more parallel computation for the math problem.
Reasoning is disabled via API configuration. Additionally, we have a format validation script that filters out responses where the model attempts to reason at all. We also tested with the phrase "i really like math" to check whether math-related filler might prime latent reasoning about the side task.
Claude Models Show Modest Gains from Phrase Repetition. Side task arithmetic accuracy vs. number of phrase repetitions (N) across four models, using two phrase variants ("the quick brown fox" and "i really like math"). Claude Sonnet 4 and Opus 4.5 show modest upward trends with increasing N, while GPT-5.2 and GPT-4.1 show no clear effect. Dashed lines indicate zero-shot baselines. Phrase content does not meaningfully affect the trend.
Does Generating vs. Prefilling Prose Lead to Different Side Task Performance?
We also experiment with prose writing. When the model generates prose, it has freedom in word selection. Therefore, the model could, in principle, encode side task-relevant information into its lexical choices. This could range from explicit steganography to softer biases. To check if the model would be able to benefit from this additional freedom, we compare prefilled prose vs inference-time-generated prose.
Generated and prefilled prose show no meaningful difference in side task accuracy, and temperature has no significant effect.
Generated vs. Prefilled Prose Show No Difference. Claude Opus 4.5's side task accuracy when preceding output is prose, comparing (a) generated vs. (b) prefilled at temperature=0 and temperature=1.
Appendix: How Is Reasoning Suppressed?
Reasoning is suppressed through three mechanisms:
As a sanity check, we ran the same zero-shot arithmetic evaluation on Claude Opus 4.5 with reasoning effort set to "high" instead of "none" to see if this parameter is properly used by OpenRouter:
With effort=high, the model achieves 99%+ accuracy across all difficulty levels. Claude Opus 4.5's zero-shot arithmetic accuracy with (a) reasoning effort set to "none" and (b) reasoning effort set to "high."
Appendix: Does Latent Side Task Capability Uplift Saturate?
Our results show that Opus 4.5's latent side task accuracy hadn't clearly saturated at 240 phrase repetitions, the highest default condition tested. To find the ceiling, we extended the prefilled phrase repetition experiment to 10,240 repetitions (which is around 40% of the 200K context window). However, filling an additional 35% of context beyond the 5% point seems to provide no further benefit. The total maximum uplift seems to be about +0.8 operations on the 50% correct threshold.
Latent Side Task Capability Saturates by ~5% of Context Window. Claude Opus 4.5's 50% accuracy threshold (number of arithmetic operations) as a function of prefilled phrase repetitions, extending to 10,240 repetitions (40% of the 200K context window).
Appendix: Does Vague Reasoning Text Help?
We also ask whether models could produce vague, reasoning-adjacent filler text that aids latent computation without actually reasoning about the problem. We tested this by comparing three conditions on Claude Opus 4.5, plotted on a shared x-axis of filler token count so the comparison is at matched token budget:
Generated vague reasoning performs worst, and the model does not seem to exploit the vague reasoning channel to encode problem-specific information.
Does Vague Reasoning Help Latent Side Task Capability? Claude Opus 4.5's 50% accuracy threshold as a function of filler token count, comparing phrase repetition (prefill), vague reasoning (prefill), and vague reasoning (generated). At matched token budgets, predictable phrase repetitions outperform vague reasoning text.
Acknowledgement
I'm grateful to Tomek Korbak, John Yueh-Han Chen, Beth Barnes, and Thomas Broadley for discussions on this topic.