No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Summary
Note: This post was drafted with the help of an LLM assistant. I did the literature selection, framing, and final edits, and I’m responsible for any mistakes.
In the current "reasoning models + agents" world, the hard part is often not how to do chain-of-thought, tool use, or oversight in isolation, but how LLM agents decide to spend their test-time compute across thinking more, calling tools, and asking for oversight, under explicit cost and safety constraints.
This post doesn't offer a full algorithm or theory. Its goal is narrower: to argue that these decisions form a single meta-level control problem for LLM agents, and to sketch a simple framing that puts some recent work on reasoning, planning, and CoT monitoring into the same picture.
Right now, discussions about "reasoning models", "inference-time scaling", "tool-using agents", and "CoT-based safety cases" often happen in separate threads. My claim is that many of these threads are implicitly about the same question:
Given limited inference-time compute, tool budgets, and oversight bandwidth, how should an agent allocate them over thinking, acting, and asking for help?
This framing grew out of thinking about Mark Chen's recent comments on "next paradigms" in a Core Memory interview, where he emphasized long-horizon reasoning and automated research as the direction beyond incremental benchmarks. The post is not about OpenAI; I'm using that context as a jumping-off point for a more general meta-control question.
To be clear about scope: I do not propose a complete mathematical model of this meta-problem. I do not claim to have an optimal meta-policy for real-world agents. I do try to give a minimal framing, show how a few recent papers fit into it, and point at one concrete open question: how to price monitorability inside a value-of-computation objective.
1. Reasoning, monitoring, and hidden reasoning
Recent work on "reasoning models" and chain-of-thought (CoT) monitoring proposes building safety cases around models that use explicit CoT plus extra inference-time compute, and around our ability to detect dangerous capabilities by inspecting those reasoning traces. Schulz's A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring is a good example: it defines "reasoning models" and proposes a two-part safety case where (i) the model has no dangerous capabilities without CoT, and (ii) any dangerous capabilities enabled 'with' CoT are reliably detectable by CoT monitoring.
In that roadmap, 'monitorability' is not a nice-to-have; it is the core assumption. The paper spends much of its space on threats to monitorability such as "neuralese" and "encoded reasoning", which it further splits into linguistic drift, steganography, and alien reasoning, and on techniques to keep CoT faithful enough for a safety case to be plausible.
At the same time, Rauno Arike et al's Hidden Reasoning in LLMs: A Taxonomy and Howe & Carroll's The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs point out that models can learn to hide or distort their reasoning when that is instrumentally useful. The taxonomy distinguishes several kinds of hidden reasoning (neuralese, steganography, etc.), explicitly from the perspective of a monitor. "The Ends Justify the Thoughts" shows that RL training can induce systematically "motivated" chains of thought that rationalize misaligned outputs and can even mislead LLM judges, directly attacking the assumption that CoT is always an honest window into the model’s cognition.
So by late 2025 we are in an interesting place: - we want "more" test-time reasoning and CoT, because it boosts performance; - but we also want CoT to remain short enough, cheap enough, and honest enough that we can still monitor it.
That already sounds like a "resource allocation" problem, not just a capabilities story.
2. Three meta-decisions every agent makes
If you look at concrete LLM agents—ReAct-style systems, tool-using agents, "deep research" setups—they all seem to repeatedly face the same three meta-decisions: 1. How much to think? How long should the chain-of-thought be? Should we invoke a slow "deliberate" mode at all? Should we branch and sample multiple thoughts, or just one? 2. Which tools to call? When do we hit search, code execution, a simulator, a database, or some other external tool, knowing each call has compute, latency, and sometimes rate limits or money attached? 3. When to escalate to oversight instead of acting alone? When do we call a verifier, ask a human, or route to a safer but more expensive model, rather than just committing to an action?
Right now, these decisions are mostly encoded as hand-written heuristics: "always plan", "never plan", "always CoT", "ask a human only on these endpoints", and so on. Instead, we can view them as actions of a single meta-controller sitting above the base model.
In the simplest picture, at each step this meta-controller observes something like: an estimate of task difficulty or uncertainty, remaining compute / latency budget, some coarse risk level or deployment context,
and then chooses among actions like THINK_MORE, CALL_TOOL, ACT, or ASK_FOR_HELP. Conceptually, it is trying to maximize
I'm not claiming this is new as theory. I'm claiming that once you put reasoning models, tool-using agents, and CoT-based safety proposals in the same frame, they all implicitly hinge on this meta-level control problem, and that naming it explicitly helps us see what's missing.
3. We're already solving pieces of the meta-problem
Several recent papers already look like "1D slices" of this meta-control problem.
How much to think? Token-Budget-Aware LLM Reasoning (TALE) starts from the observation that current CoT traces are often unnecessarily long: they identify substantial token redundancy in the reasoning steps of strong LLMs. They then propose a framework that dynamically estimates a token budget per problem, based on a complexity estimator, and uses this budget to guide CoT generation. Experiments show they can reduce CoT token usage with only a slight reduction in accuracy.
BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens goes further by inserting special control tokens during inference to continually inform the model of its remaining token budget, and uses a two-stage pipeline (SFT + RL with a length-aware reward) to train budget-aware reasoning. On math benchmarks, they show that you can hit different accuracy / latency trade-offs by dialing a budget knob, making it possible to deploy the same model under different cost constraints.
Rational Metareasoning for Large Language Models takes a more direct cognitive-science route. It defines a value-of-computation style reward that penalizes unnecessary reasoning, then uses Expert Iteration to train models to emit extra CoT steps "only when they improve expected accuracy enough". Across several models and datasets, they report roughly 20–37% fewer tokens generated while maintaining task performance. This is almost a textbook example of a learned "THINK_MORE vs. STOP" meta-policy.
When to plan vs. act? In a more agentic setting, Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents formalizes dynamic planning as a central problem: always planning is computationally expensive and can 'hurt' performance on long-horizon tasks; never planning also caps performance. They introduce a two-stage pipeline (SFT on synthetic data + RL) to train agents that decide 'when' to allocate extra test-time compute for planning. On the Crafter environment, these "sometimes plan" agents outperform both "always plan" and "never plan" baselines and use test-time compute more efficiently. From the meta-control viewpoint, that paper is literally "learning a better policy over THINK_MORE vs. ACT".
Monitoring and hidden reasoning. On the safety side, Schulz's roadmap, the Hidden Reasoning taxonomy, and the 'Ends Justify the Thoughts' results can be re-read as saying: "monitorability is itself a scarce resource". If the meta-policy can improve reward by pushing reasoning into forms that monitors don’t see (neuralese, steganography, motivated CoT), it may do so unless "legibility" appears explicitly in the objective.
Putting these strands together, it looks like we already have several partial solutions to the same underlying meta-control problem.
4. Limitations and one open question
This framing is obviously incomplete. Real agents have messy long-term memory, training–deployment mismatches, non-stationary oversight processes, and constraints that don't fit neatly into a small MDP state. Hidden or motivated reasoning — as in the work by Schulz, Arike et al., and Howe & Carroll mentioned above — also directly threatens the assumption that we can cheaply observe or enforce "honest" CoT traces.
As mentioned in the intro, this is a framing rather than a full mathematical treatment.
If we take this lens seriously, one natural open question is:
> How should we price "monitorability" inside a value-of-computation objective?
In other words: can we treat "keeping the reasoning legible enough for CoT-like monitors" as a first-class term in the meta-objective, alongside accuracy and compute cost, so that the 'cheap' way to succeed is "not" to hide or distort the reasoning?
I'd be very interested in pointers to existing work that already does this, or in arguments that a resource-rational / meta-control lens is the wrong way to think about inference-time safety for agents.
Summary
Note: This post was drafted with the help of an LLM assistant. I did the literature selection, framing, and final edits, and I’m responsible for any mistakes.
In the current "reasoning models + agents" world, the hard part is often not how to do chain-of-thought, tool use, or oversight in isolation, but how LLM agents decide to spend their test-time compute across thinking more, calling tools, and asking for oversight, under explicit cost and safety constraints.
This post doesn't offer a full algorithm or theory. Its goal is narrower: to argue that these decisions form a single meta-level control problem for LLM agents, and to sketch a simple framing that puts some recent work on reasoning, planning, and CoT monitoring into the same picture.
Right now, discussions about "reasoning models", "inference-time scaling", "tool-using agents", and "CoT-based safety cases" often happen in separate threads. My claim is that many of these threads are implicitly about the same question:
Given limited inference-time compute, tool budgets, and oversight bandwidth, how should an agent allocate them over thinking, acting, and asking for help?
This framing grew out of thinking about Mark Chen's recent comments on "next paradigms" in a Core Memory interview, where he emphasized long-horizon reasoning and automated research as the direction beyond incremental benchmarks. The post is not about OpenAI; I'm using that context as a jumping-off point for a more general meta-control question.
To be clear about scope:
I do not propose a complete mathematical model of this meta-problem.
I do not claim to have an optimal meta-policy for real-world agents.
I do try to give a minimal framing, show how a few recent papers fit into it, and point at one concrete open question: how to price monitorability inside a value-of-computation objective.
1. Reasoning, monitoring, and hidden reasoning
Recent work on "reasoning models" and chain-of-thought (CoT) monitoring proposes building safety cases around models that use explicit CoT plus extra inference-time compute, and around our ability to detect dangerous capabilities by inspecting those reasoning traces. Schulz's A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring is a good example: it defines "reasoning models" and proposes a two-part safety case where (i) the model has no dangerous capabilities without CoT, and (ii) any dangerous capabilities enabled 'with' CoT are reliably detectable by CoT monitoring.
In that roadmap, 'monitorability' is not a nice-to-have; it is the core assumption. The paper spends much of its space on threats to monitorability such as "neuralese" and "encoded reasoning", which it further splits into linguistic drift, steganography, and alien reasoning, and on techniques to keep CoT faithful enough for a safety case to be plausible.
At the same time, Rauno Arike et al's Hidden Reasoning in LLMs: A Taxonomy and Howe & Carroll's The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs point out that models can learn to hide or distort their reasoning when that is instrumentally useful. The taxonomy distinguishes several kinds of hidden reasoning (neuralese, steganography, etc.), explicitly from the perspective of a monitor. "The Ends Justify the Thoughts" shows that RL training can induce systematically "motivated" chains of thought that rationalize misaligned outputs and can even mislead LLM judges, directly attacking the assumption that CoT is always an honest window into the model’s cognition.
So by late 2025 we are in an interesting place:
- we want "more" test-time reasoning and CoT, because it boosts performance;
- but we also want CoT to remain short enough, cheap enough, and honest enough that we can still monitor it.
That already sounds like a "resource allocation" problem, not just a capabilities story.
2. Three meta-decisions every agent makes
If you look at concrete LLM agents—ReAct-style systems, tool-using agents, "deep research" setups—they all seem to repeatedly face the same three meta-decisions:
1. How much to think?
How long should the chain-of-thought be? Should we invoke a slow "deliberate" mode at all? Should we branch and sample multiple thoughts, or just one?
2. Which tools to call?
When do we hit search, code execution, a simulator, a database, or some other external tool, knowing each call has compute, latency, and sometimes rate limits or money attached?
3. When to escalate to oversight instead of acting alone?
When do we call a verifier, ask a human, or route to a safer but more expensive model, rather than just committing to an action?
Right now, these decisions are mostly encoded as hand-written heuristics: "always plan", "never plan", "always CoT", "ask a human only on these endpoints", and so on. Instead, we can view them as actions of a single meta-controller sitting above the base model.
In the simplest picture, at each step this meta-controller observes something like:
an estimate of task difficulty or uncertainty,
remaining compute / latency budget,
some coarse risk level or deployment context,
and then chooses among actions like THINK_MORE, CALL_TOOL, ACT, or ASK_FOR_HELP. Conceptually, it is trying to maximize
task reward
− (compute + latency costs)
− (some penalty for risk / unmonitored behavior).
This is basically the classical Value of Computation / resource-rational metareasoning idea from cognitive science, applied to LLM agents.
I'm not claiming this is new as theory. I'm claiming that once you put reasoning models, tool-using agents, and CoT-based safety proposals in the same frame, they all implicitly hinge on this meta-level control problem, and that naming it explicitly helps us see what's missing.
3. We're already solving pieces of the meta-problem
Several recent papers already look like "1D slices" of this meta-control problem.
How much to think?
Token-Budget-Aware LLM Reasoning (TALE) starts from the observation that current CoT traces are often unnecessarily long: they identify substantial token redundancy in the reasoning steps of strong LLMs. They then propose a framework that dynamically estimates a token budget per problem, based on a complexity estimator, and uses this budget to guide CoT generation. Experiments show they can reduce CoT token usage with only a slight reduction in accuracy.
BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens goes further by inserting special control tokens during inference to continually inform the model of its remaining token budget, and uses a two-stage pipeline (SFT + RL with a length-aware reward) to train budget-aware reasoning. On math benchmarks, they show that you can hit different accuracy / latency trade-offs by dialing a budget knob, making it possible to deploy the same model under different cost constraints.
Rational Metareasoning for Large Language Models takes a more direct cognitive-science route. It defines a value-of-computation style reward that penalizes unnecessary reasoning, then uses Expert Iteration to train models to emit extra CoT steps "only when they improve expected accuracy enough". Across several models and datasets, they report roughly 20–37% fewer tokens generated while maintaining task performance. This is almost a textbook example of a learned "THINK_MORE vs. STOP" meta-policy.
When to plan vs. act?
In a more agentic setting, Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents formalizes dynamic planning as a central problem: always planning is computationally expensive and can 'hurt' performance on long-horizon tasks; never planning also caps performance. They introduce a two-stage pipeline (SFT on synthetic data + RL) to train agents that decide 'when' to allocate extra test-time compute for planning. On the Crafter environment, these "sometimes plan" agents outperform both "always plan" and "never plan" baselines and use test-time compute more efficiently. From the meta-control viewpoint, that paper is literally "learning a better policy over THINK_MORE vs. ACT".
Monitoring and hidden reasoning.
On the safety side, Schulz's roadmap, the Hidden Reasoning taxonomy, and the 'Ends Justify the Thoughts' results can be re-read as saying: "monitorability is itself a scarce resource". If the meta-policy can improve reward by pushing reasoning into forms that monitors don’t see (neuralese, steganography, motivated CoT), it may do so unless "legibility" appears explicitly in the objective.
Putting these strands together, it looks like we already have several partial solutions to the same underlying meta-control problem.
4. Limitations and one open question
This framing is obviously incomplete. Real agents have messy long-term memory, training–deployment mismatches, non-stationary oversight processes, and constraints that don't fit neatly into a small MDP state. Hidden or motivated reasoning — as in the work by Schulz, Arike et al., and Howe & Carroll mentioned above — also directly threatens the assumption that we can cheaply observe or enforce "honest" CoT traces.
As mentioned in the intro, this is a framing rather than a full mathematical treatment.
If we take this lens seriously, one natural open question is:
> How should we price "monitorability" inside a value-of-computation objective?
In other words: can we treat "keeping the reasoning legible enough for CoT-like monitors" as a first-class term in the meta-objective, alongside accuracy and compute cost, so that the 'cheap' way to succeed is "not" to hide or distort the reasoning?
I'd be very interested in pointers to existing work that already does this, or in arguments that a resource-rational / meta-control lens is the wrong way to think about inference-time safety for agents.