This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
We keep building AI systems that treat “produce an answer” as the default response to “receive an input”. This is the computational equivalent of replying to every email the moment it hits your inbox, including the ones that are obviously spam, wildly outside your job description, or written in a language you don’t speak.
It works, right up until it doesn’t.
When it doesn’t, you don’t just get wrong answers. You get a specific kind of wrongness: confident answers to the wrong question, expensive reasoning on tasks that were trivial, and a sort of polite improvisation when the task was fundamentally not doable in the first place. The system never pauses to ask a very basic, very important question:
‘Is it worth thinking about this at all?’
This post is about putting that question upstream, as a small, explicit step in the pipeline that decides whether to (a) handle the input cheaply, (b) pay for the expensive black-box model to really think, or (c) refuse and log the case rather than guessing. If you like rationality framings: it’s an attempt to make attention allocation and abstention concrete in an AI architecture, instead of leaving them as accidental side effects of whatever the downstream model feels like doing today.
I’m not claiming this is the One True Way to build systems. It’s more like a missing primitive that seems oddly absent given how much we talk about compute constraints, calibration, and epistemic humility. If you already do something like this, I’m genuinely interested in how you formalise it, because the devil is very much in the details and the details are where most obvious ideas go to die.
Let me start with a failure mode I’ve seen repeatedly in modern AI pipelines, and I’ll deliberately keep it unromantic.
Imagine you have an application that takes user requests and routes them to an AI model. The model might be a large language model, or a multimodal model, or an agentic system that can call tools. The specifics do not matter. What matters is that, in the architecture, this downstream model is treated as a black box that you pay to run.
I’m going to call that downstream model System-2. Not because I think it has human-style System-2 cognition, but because it plays a similar role: it is the slow, expensive, deliberative mode that gets invoked when you want the system to really think. In practice, it’s any model or agent you treat as the heavyweight step in your pipeline.
Now here’s the key: most systems implicitly assume System-2 should engage by default.
A Harder Look at the Problem
That default is so common that it barely looks like a choice. An input arrives, the expensive model wakes up, something comes out, everyone pretends that was the only sensible way the story could have gone.
Here’s the failure mode: the system behaves as if every prompt deserves the same kind of attention.
A decent chunk of real inputs are boring in one of three ways:
Some are trivial: the answer is already in a cache, or can be handled by a cheap lookup, or is a simple transformation you don’t need a ‘thinking’ model for.
Some are ill-posed: missing key information, internally inconsistent, or phrased in a way where any crisp answer would be more guess than inference.
And some are out-of-scope: the system, as deployed, just doesn’t have the tools, data, or reliability needed to do the thing being asked.
Yet the default pipeline treats all three as identical. It escalates to the heavyweight model anyway.
This creates a particular pattern of bad behaviour.
First, you get the obvious waste: expensive reasoning spent on tasks that did not merit it. At scale, this is not a rounding error. It shows up as latency, cost, and the slow creep of why is our system so sluggish? in places where the inputs themselves were not the problem.
Second, you get something more epistemically corrosive: when the input is ill-posed, the system often does what humans do under cognitive pressure. It quietly replaces the hard question with a nearby easier one, then answers that instead. As the Lincoln Lawyer would call it, nothing more than a straw man. The output can be coherent, even helpful-looking, while being unanchored to the actual request. If you’ve ever watched a model give a confident explanation that is technically a response to some question, just not the one you asked, you’ve seen this.
Third, out-of-scope inputs produce the worst combination: the system tries, fails, and returns a confident-looking artefact anyway, because the architecture doesn’t give it a dignified way to abstain. Refusal, when it exists, is usually bolted on downstream as a policy constraint or a safety filter, not treated as a rational control action. So you either get brittle ‘no’ responses for the wrong reasons, or you get ‘yes-ish’ responses with no warning label.
Notice what’s missing here: it’s not just model capability. It’s a decision procedure. In a rational agent, ‘should I spend effort on this?’ is part of the algorithm. In many AI stacks, it’s an accident.
You can try to patch this with better prompting, bigger models, post-hoc validators, confidence scores, guardrails, tool checks, retrieval, caching. All useful. But they mostly share one feature: they operate after you’ve already committed to engaging the heavyweight system. You paid the compute, you generated the narrative, and then you try to clean it up.
So, what would it look like to make engagement a decision?
The Missing Primitive
What I want to explore is a cleaner intervention: put a small metacognitive step before the expensive model. Its job is not to solve the task. Its job is to decide, explicitly, whether the expensive model should even be invoked, and to do so in a way you can inspect and iterate on.
If that sounds like obvious engineering hygiene, good. The surprise is how rarely it is treated as a first-class part of the design.
Here’s the idea in plain terms.
You have some downstream model or agent that does the heavy lifting. I called it System-2 earlier, but I’ll restate the definition so it stays crisp: in this post, System-2 is any downstream AI component you treat as the expensive black box. It might be a large model, a tool-using agent, a chain-of-thought style reasoner, or just the part of your pipeline you only want to run when it’s worth paying for. The name is borrowed from the Kahneman metaphor purely to indicate role (slow/expensive vs fast/cheap), not to make claims about the model having two modes of cognition.
Now insert one lightweight component upstream. Call it a router, gate, or metacognitive layer. I’ll use metacognitive router because it describes the intent: it’s making a decision about whether to think.
Its output is not an answer. It’s an action.
There are three actions that matter:
Accept:handle the input without waking up System-2. This might be a cached answer, a retrieval lookup, a deterministic transformation, or a cheaper model. The key point is that you’ve decided the expensive deliberation step is not justified.
Escalate: pay the cost. Invoke System-2, run the expensive reasoning or tool use, and produce the best answer you can.
Refuse: do not guess. In the mild version, refusal can look like asking for missing information. In the stronger version, it is a deliberate abstention: the system admits it cannot reliably handle this input under current constraints, logs the case, and moves on. The important part is that refusal is treated as a rational control action, not as a failure state or a bolt-on policy filter.
This triage framing is doing more work than it seems.
It makes explicit something that is currently implicit and messy. Every system already has an engagement policy, it’s just usually buried inside the downstream black box, or spread across ad-hoc rules, or expressed indirectly as failure. The metacognitive router pulls that policy into the open, where you can actually argue about it.
It also forces you to state what you care about. Are you optimising for coverage, cost, latency, safety, user trust, or some weighted mixture? A router can encode those trade-offs directly. Without it, you get whatever behaviour falls out of the downstream model’s incentives and training distribution, which is a bit like letting your calendar plan your life.
At this point a reasonable objection is: isn’t this just a fancy name for a classifier that says hard vs easy?
Not quite. Difficulty is part of it, but the decision you actually want is closer to: what is the expected value of spending expensive compute here? Sometimes a task is ‘hard’ in the sense of requiring reasoning, and yet escalation is still not worth it because the request is ill-posed. Sometimes a task is ‘easy’ and escalation is still worth it because the stakes are high. Sometimes the right action is refusal because the system cannot do the thing reliably at all, and any attempt would just create confident sludge.
If this feels like smuggling in rationality language, that’s because it is. The router is basically an explicit attention allocator: it decides where expensive cognition goes, and it records why.
The Diagram - What Actually Sits Where
The easiest way to see what I mean is to draw it as a simple flow chart. Not a grand theory of mind. More like plumbing.
Start at the top: Raw Input x. A user request, an agent subtask, a tool result you need to interpret. Whatever arrives at the boundary of your system.
The first thing that happens is not reasoning. It’s encoding.
You run x through a TaskEncoder (Embedding e). In practice this might be an embedding model, a small language model producing a structured summary, or a hand-engineered feature extractor if you live in a world where those still exist. The output, e, isn’t an answer. It’s a representation of the task that makes the next steps cheap and comparable across inputs.
From e, the diagram fans out into four upstream modules. These are intentionally lightweight, and each one produces a signal that is about the task rather than its solution.
First, there’s the Heuristic Module (conf, comp, feas). This is where the router forms quick, inspectable estimates like:
conf: how confident the router is in its own judgement about routing (not the downstream model’s bravado)
comp: a proxy for predicted complexity or deliberation cost
feas: feasibility, meaning whether the request is well-posed and achievable given current tools, permissions, and information
Second, there’s a Novelty Detector (novelty). This is your out-of-distribution / unfamiliarity signal: how much this input resembles what the system has seen and succeeded on before. It can be crude (embedding distance to past tasks) or sophisticated. The point is: novelty exists, and ignoring it is how systems end up confidently improvising in unfamiliar territory.
Third, there’s the DeltaRegressor (Δ-hat). This is doing one specific job: estimate the expected benefit of escalating to the expensive model. Δ-hat is meant to capture how much better you expect the outcome to be if you pay for System-2, compared to staying on the fast path. The benefit can be measured in whatever currency your system actually cares about (accuracy, utility, avoided harm, reduced uncertainty), but the decision-relevant part is: is escalation worth it here?
Fourth, there’s the Experience Buffer (for replay). This stores past tasks and outcomes so you can calibrate or retrain the router over time, rather than freezing the routing policy as a set of thresholds. In other words: you want the router to get less wrong as it sees more of the world.
All of these signals feed into the MetaController (ACCEPT / ESCALATE / REFUSE), which is the only box in the middle that is allowed to make a decision. The controller can start off rule-based (thresholds, if-statements, conservative handling under high novelty), and later become learned. Either way, its output is an action, not an answer:
Fast Accept Path: handle the input without waking up System-2 (cache, retrieval, deterministic transform, cheap model).
Escalate to System-2: invoke the expensive black-box model, producing System-2 Output y2.
Refuse for Future Learning: abstain (or request clarification), and store the case in the RefusalBuffer.
The RefusalBuffer matters because refusal should not be a trash can. It’s a curated dataset of edge cases: ill-posed requests, out-of-scope demands, and genuinely novel inputs. The right response to many of these is not to guess, but also not to forget them.
That’s why the diagram includes Deferred Reconsideration (Periodic): the system can periodically revisit refused cases as it accumulates experience, updates its policies, or improves its downstream capabilities. Refusal becomes a way to say not yet, rather than never.
If you want a one-sentence summary of the diagram, it’s this: we separate the decision to spend compute from the computation itself, and we make that decision explicit enough that it can be inspected, criticised, and improved.
From Conceptualising to Building
At this point, the idea is either obvious in a satisfying way, or obvious in a suspicious way. So let’s cash it out into a concrete design.
I’ll sketch two versions. The first is the minimal thing that is genuinely implementable without turning your codebase into a dissertation. The second is the version where you start taking the router seriously as a learned component, and you try to ground the decision in outcomes rather than gut feel.
The minimal version: rules first, humility always
You already have the building blocks from the diagram: an embedding for the task, a few cheap meta-signals, and a controller that chooses Accept / Escalate / Refuse.
In the minimal version, the MetaController is just a small set of thresholds. The main trick is choosing thresholds that encode the behaviour you actually want, instead of the behaviour that feels brave in a demo.
A reasonable first policy looks like this:
If feasibility is low, refuse or ask a clarifying question rather than escalating. This prevents the most common failure pattern: paying for expensive reasoning on an ill-posed request and then getting a beautifully written hallucination.
If novelty is high, be conservative. Either escalate (if the stakes justify it) or refuse (if you expect System-2 to freestyle). High novelty is where your system is most likely to be confidently wrong.
If Δ-hat is low, accept on the fast path. Even if System-2 could do better, the expected improvement isn’t worth the compute.
If Δ-hat is high, escalate. This is the easy one, but you’d be surprised how often systems skip the step of making this explicit.
Notice what’s not in that policy: it doesn’t say “escalate whenever confidence is low”. That’s a tempting rule, but it can be backwards. Low confidence might mean the task is hard but feasible, or it might mean the task is incoherent. Escalating incoherence is how you get expensive nonsense.
In practice, you’ll end up with a handful of if-statements that look exhausting on paper and save you a lot of grief in production. This is good. Exhausting is a feature here.
If you want a small dose of cheekiness: a lot of current systems are built as if compute is free and humility is expensive. This flips that.
Where do the signals come from?
The minimal version still needs you to compute feasibility, novelty, and Δ-hat in some way that isn’t pure hand-waving.
Novelty can start as embedding distance to a store of previous tasks that were handled successfully. It’s not perfect, but it’s often good enough to detect when the system is in unfamiliar territory.
Feasibility can be a cheap classifier that looks for patterns of underspecification or impossibility. Does the request require information that wasn’t provided? Does it ask for an action the system cannot take? Does it contain contradictions? You can implement a crude feasibility estimator with a small model or even a set of heuristics, then gradually improve it by learning from refusal outcomes.
Δ-hat can start as a proxy rather than a direct prediction. One simple approach is to estimate “how much worse would a fast-path answer be likely to be”, which you can approximate using novelty, complexity, and task type. You don’t need perfect Δ-hat to get value. You need it to be directionally correct often enough that you’re not escalating everything.
This is an important theme: the router doesn’t require mystical accuracy. It requires useful calibration, because it’s making a routing decision under uncertainty, not solving the original task.
The more serious version: learn Δ-hat from outcomes
Once the minimal system exists, you can make it much less arbitrary by learning parts of it. The natural thing to learn is Δ-hat: predicted benefit of escalation.
Here’s the clean training story.
For a subset of tasks, you run both paths:
produce a fast-path output (cheap)
produce a System-2 output (expensive)
Then you score them using your chosen metric. That metric depends on your domain: it might be human preference judgements, task success rates, correctness on eval sets, downstream utility, or a bundle of signals.
The difference between the expensive score and the cheap score is your empirical Δ for that task. You train the DeltaRegressor to predict that difference from the upstream representation and meta-features.
This is nice because it matches the decision you actually care about. You’re not training a model to be generally accurate. You’re training a model to predict when paying for expensive compute will change the outcome in a meaningful way.
You can also calibrate feasibility from data. Refusals that turned out to be genuinely impossible should raise feasibility thresholds. Refusals that, in hindsight, were doable can be used as negative examples, so the router becomes less timid.
The bit that makes refusal useful rather than embarrassing
The RefusalBuffer + deferred reconsideration loop in the diagram is not cosmetic.
If you treat refusal as a first-class outcome, you can do something that’s hard to do when everything is forced through System-2: you can accumulate a structured dataset of the system’s boundary.
Refused cases are often the best training data you can ask for, because they cluster around ambiguity, novelty, and missing constraints. They tell you what users ask for that your system cannot yet handle reliably. They also tell you what your router is uncertain about.
With deferred reconsideration, you can periodically replay refused tasks through an updated router, or through a stronger System-2, or with new tools. Some will remain refusals. Some will become escalations. Some will become cheap accepts because the system has effectively been “vaccinated” against a class of previously novel tasks.
That’s the pragmatic payoff: refusal stops being a dead end and becomes a mechanism for controlled expansion of competence.
A concrete example (so this doesn’t stay abstract)
Suppose your system receives: “Give me a confident answer about X, and don’t mention uncertainty.”
A naive pipeline escalates to System-2 and produces a confident answer, because it was asked to. Your router sees low feasibility (the request is pushing the system towards epistemically bad behaviour), and high risk of harm to truthfulness. The rational action is refusal or a reframe request, not escalation.
Or suppose the input is: “Summarise this 2-paragraph text.”
Novelty low, feasibility high, Δ-hat low. Fast accept. Don’t wake up the expensive model. Nobody gets a medal for burning compute on summarising two paragraphs.
This is the kind of thing the router is for: not intelligence, but allocation.
Claims, Trade-Offs and Ways This Can Fail
Yes, you can absolutely over-refuse and build a system that is safely useless, which is an accomplishment of sorts but not the one we’re aiming for.
So, if you’re reading this with the mild suspicion that I’m reinventing a fancy if statement, that’s healthy. The whole point of making the routing policy explicit is that you can interrogate it like any other decision procedure: what does it buy you, what does it cost you, and how does it break?
Here are the claims I think this architecture can support, stated in the least heroic way I can manage.
Claim 1: You reduce wasted compute without gambling on quality
This one is almost tautological. If you introduce an explicit fast accept path, and your router is even moderately calibrated, you will stop paying System-2 prices for tasks that don’t need System-2 effort.
The interesting part is not saving compute on obviously trivial queries. It’s saving compute on the tasks where escalation produces a longer answer rather than a better one. There are many of these. They’re just harder to notice because they look industrious.
The risk, of course, is rejecting tasks that would have benefited from escalation. That’s why the relevant metric isn’t raw accuracy, it’s accuracy conditional on acceptance and escalation, plus coverage. More on evaluation in a moment.
Claim 2: You get a cleaner form of epistemic humility
Most current pipelines have one epistemic mode: answer. Sometimes they wrap that answer in cautionary language. Sometimes they refuse for policy reasons. But they rarely have an explicit, inspectable mechanism for abstention based on feasibility, novelty, or expected value of computation.
Treating Refuse as a rational action changes the system’s failure modes. It turns some hallucinations into clarifying questions, and some confident nonsense into logged edge cases. That’s not magical. It’s just giving the system a way to say “I don’t know” that isn’t socially awkward.
There’s a subtle benefit here that I care about more than tone: when refusal is allowed, you can design the system to be conservative under novelty. That is exactly when black-box models tend to be at their most eloquently wrong.
Claim 3: You gain auditability, which matters even when you’re wrong
A router produces artefacts: meta-features, novelty scores, Δ-hat estimates, and a decision. That gives you something to inspect after the fact.
When the system escalates and fails, you can ask: did we mis-estimate feasibility, mis-read novelty, overestimate Δ-hat, or set thresholds poorly? When the system refuses and it turns out it could have answered, you can ask the symmetrical question.
Without an explicit router, these post-mortems are mushy. Everything disappears into the downstream model’s internal process, and your only lever is to change the model or add more patches downstream.
This separation of concerns is, in my view, one of the most compelling reasons to do it. It turns a vague behavioural complaint into a concrete policy bug.
Now the trade-offs. Because yes, you can make this worse.
Trade-off 1: Coverage vs reliability is no longer hidden
A system that always answers maximises coverage. It does not maximise truthfulness. A system that refuses too readily maximises something like safety but becomes useless.
The router makes this trade-off explicit. That’s good, but it also means you have to own the decision.
If you tune the router for high reliability, you will refuse more. If you tune it for high coverage, you will escalate more and potentially hallucinate more. There isn’t a free lunch. The router just forces you to pick a lunch deliberately.
Trade-off 2: The router becomes a new thing that can be wrong
This is the obvious critique, and it’s valid. You’re adding a component that can misroute tasks.
There are a few distinct failure modes:
Under-escalation: tasks that should have gone to System-2 get fast-accepted, degrading quality.
Over-escalation: the router punts too much to System-2, and you get the original problem back with extra steps.
Over-refusal: the system becomes cautious to the point of paralysis.
Misplaced confidence: the router believes it understands a task it doesn’t, because embeddings can be misleading and novelty detection is imperfect.
In other words, you don’t escape uncertainty. You relocate it into a part of the system you can at least observe.
Trade-off 3: Novelty signals are easy to get wrong in silly ways
Embedding distance is a tempting novelty detector because it’s simple. It’s also a blunt instrument. Two tasks can be semantically similar but differ crucially in feasibility or stakes. Conversely, a task can be lexically unfamiliar but structurally easy.
So novelty should be treated as a warning light, not a verdict. High novelty should trigger conservatism, not panic.
There’s also the possibility of adversarial behaviour or distribution shift where novelty metrics become unreliable. If the router is used in a setting where inputs are strategic, you need to treat the router itself as part of the attack surface.
Trade-off 4: Δ-hat can collapse into a proxy
Δ-hat is the most important number in the diagram and also the easiest one to fool yourself about.
If you don’t have a good outcome metric, Δ-hat will learn to predict whatever your evaluation setup rewards, which may not be what you intended. If the only feedback is user satisfaction, you can accidentally train a router that escalates whenever the user seems emotionally invested, which is an impressively human failure mode.
Even with a good metric, Δ-hat is a counterfactual prediction: how much better would the answer be if we escalated? Those are hard to learn cleanly. The minimal version can use proxies, but the serious version needs careful grounding.
How you should evaluate this
If you evaluate this as “overall accuracy”, you’ll miss the point. The router changes coverage. You need metrics that respect abstention and conditional performance.
A better evaluation frame looks like:
Coverage: fraction of tasks accepted or escalated (i.e. not refused).
Conditional quality: quality on accepted tasks, quality on escalated tasks.
Cost: compute spent per task, latency, tool calls, etc.
Selective performance: the trade-off curve between quality and coverage as you adjust thresholds.
In ML, this is close to how people evaluate selective prediction or classification with a reject option: you don’t just ask whether the model is right, you ask how well it chooses when to answer.
The overall claim is not that routing makes systems perfect. It’s that routing gives you a principled way to trade off cost, coverage, and reliability, and to inspect the policy that’s making those trade-offs rather than leaving them implicit.
Ending with a Question
The most important box in the diagram is the one I’m least confident about: Δ-hat, the estimated benefit of escalation.
On paper it’s clean. The router is trying to predict how much better the outcome will be if we wake up System-2, compared to staying on the fast path. In practice, the moment you try to make that concrete, you run into a mess of definitions.
Better in what sense?
If you’re doing QA, ‘better’ might mean correctness. If you’re doing planning, it might mean success rate downstream. If you’re doing analysis, it might mean coherence and coverage of key considerations. If you’re doing anything safety-adjacent, it might mean avoided harm rather than raw utility. And in many real systems it’s some weighted mixture, which is where all the arguments start.
So here’s the open question I want to hand to the LessWrong audience:
How should we define and measure the value of escalation in a way that is both decision-relevant and not totally gameable?
I don’t mean this as a rhetorical flourish. I mean it as the crux. If Δ-hat is poorly grounded, the router becomes a very expensive way to encode your prejudices.
A few failure modes feel particularly sharp:
If your proxy for ‘better’ is user satisfaction, you risk teaching the system to escalate whenever the user seems emotionally invested, or whenever the safe answer would be short and the persuasive answer would be long. That’s not exactly what we meant by rationality.
If your proxy is model self-confidence, you can get the opposite pathology: the system escalates whenever it feels uncertain, including cases where the correct action is to refuse because the task is ill-posed. You’ve just built a machine that spends extra compute on nonsense.
If your proxy is some automatic scoring heuristic, you risk Goodharting it. The router learns to optimise for the score, not for the thing you actually care about, and you end up with a beautifully tuned escalation policy that’s useless in the real world.
Even if you pick a good metric, there’s a deeper issue: Δ-hat is a counterfactual prediction. It is predicting the difference between two worlds: the world where we escalate and the world where we don’t. Learning counterfactuals is hard, and doing it under distribution shift is harder.
So I’m curious about two very specific things:
First, what is the cleanest way to define Δ in a general-purpose system without smuggling in a pile of hidden assumptions? If you had to write a definition that would survive a hostile reading, what would it be?
Second, what would you trust as training signal? Human judgements are expensive and biased. Automated signals are cheap and gameable. Task success rates are clean but domain-specific. There’s probably no perfect answer, but there might be a less bad one.
If you take the router seriously, Δ-hat becomes the point where epistemology meets engineering. You’re formalising what it means for extra cognition to be worth the cost. That seems like exactly the kind of question LW people have strong opinions about, and I mean that as a compliment.
If you have a proposed definition of Δ, or a training setup that avoids the worst Goodhart traps, I’d love to hear it. If your instinct is that the whole framing is flawed and value-of-computation shouldn’t be compressed into a single scalar, I’d also love to hear that. A router can take richer signals than one number. The diagram is not sacred.
At minimum, I want to come away with a clearer story for what the router should be optimising for, and how to know when it’s doing the right thing rather than merely doing a confident thing.
We keep building AI systems that treat “produce an answer” as the default response to “receive an input”. This is the computational equivalent of replying to every email the moment it hits your inbox, including the ones that are obviously spam, wildly outside your job description, or written in a language you don’t speak.
It works, right up until it doesn’t.
When it doesn’t, you don’t just get wrong answers. You get a specific kind of wrongness: confident answers to the wrong question, expensive reasoning on tasks that were trivial, and a sort of polite improvisation when the task was fundamentally not doable in the first place. The system never pauses to ask a very basic, very important question:
‘Is it worth thinking about this at all?’
This post is about putting that question upstream, as a small, explicit step in the pipeline that decides whether to (a) handle the input cheaply, (b) pay for the expensive black-box model to really think, or (c) refuse and log the case rather than guessing. If you like rationality framings: it’s an attempt to make attention allocation and abstention concrete in an AI architecture, instead of leaving them as accidental side effects of whatever the downstream model feels like doing today.
I’m not claiming this is the One True Way to build systems. It’s more like a missing primitive that seems oddly absent given how much we talk about compute constraints, calibration, and epistemic humility. If you already do something like this, I’m genuinely interested in how you formalise it, because the devil is very much in the details and the details are where most obvious ideas go to die.
Let me start with a failure mode I’ve seen repeatedly in modern AI pipelines, and I’ll deliberately keep it unromantic.
Imagine you have an application that takes user requests and routes them to an AI model. The model might be a large language model, or a multimodal model, or an agentic system that can call tools. The specifics do not matter. What matters is that, in the architecture, this downstream model is treated as a black box that you pay to run.
I’m going to call that downstream model System-2. Not because I think it has human-style System-2 cognition, but because it plays a similar role: it is the slow, expensive, deliberative mode that gets invoked when you want the system to really think. In practice, it’s any model or agent you treat as the heavyweight step in your pipeline.
Now here’s the key: most systems implicitly assume System-2 should engage by default.
A Harder Look at the Problem
That default is so common that it barely looks like a choice. An input arrives, the expensive model wakes up, something comes out, everyone pretends that was the only sensible way the story could have gone.
Here’s the failure mode: the system behaves as if every prompt deserves the same kind of attention.
A decent chunk of real inputs are boring in one of three ways:
Yet the default pipeline treats all three as identical. It escalates to the heavyweight model anyway.
This creates a particular pattern of bad behaviour.
First, you get the obvious waste: expensive reasoning spent on tasks that did not merit it. At scale, this is not a rounding error. It shows up as latency, cost, and the slow creep of why is our system so sluggish? in places where the inputs themselves were not the problem.
Second, you get something more epistemically corrosive: when the input is ill-posed, the system often does what humans do under cognitive pressure. It quietly replaces the hard question with a nearby easier one, then answers that instead. As the Lincoln Lawyer would call it, nothing more than a straw man. The output can be coherent, even helpful-looking, while being unanchored to the actual request. If you’ve ever watched a model give a confident explanation that is technically a response to some question, just not the one you asked, you’ve seen this.
Third, out-of-scope inputs produce the worst combination: the system tries, fails, and returns a confident-looking artefact anyway, because the architecture doesn’t give it a dignified way to abstain. Refusal, when it exists, is usually bolted on downstream as a policy constraint or a safety filter, not treated as a rational control action. So you either get brittle ‘no’ responses for the wrong reasons, or you get ‘yes-ish’ responses with no warning label.
Notice what’s missing here: it’s not just model capability. It’s a decision procedure. In a rational agent, ‘should I spend effort on this?’ is part of the algorithm. In many AI stacks, it’s an accident.
You can try to patch this with better prompting, bigger models, post-hoc validators, confidence scores, guardrails, tool checks, retrieval, caching. All useful. But they mostly share one feature: they operate after you’ve already committed to engaging the heavyweight system. You paid the compute, you generated the narrative, and then you try to clean it up.
So, what would it look like to make engagement a decision?
The Missing Primitive
What I want to explore is a cleaner intervention: put a small metacognitive step before the expensive model. Its job is not to solve the task. Its job is to decide, explicitly, whether the expensive model should even be invoked, and to do so in a way you can inspect and iterate on.
If that sounds like obvious engineering hygiene, good. The surprise is how rarely it is treated as a first-class part of the design.
Here’s the idea in plain terms.
You have some downstream model or agent that does the heavy lifting. I called it System-2 earlier, but I’ll restate the definition so it stays crisp: in this post, System-2 is any downstream AI component you treat as the expensive black box. It might be a large model, a tool-using agent, a chain-of-thought style reasoner, or just the part of your pipeline you only want to run when it’s worth paying for. The name is borrowed from the Kahneman metaphor purely to indicate role (slow/expensive vs fast/cheap), not to make claims about the model having two modes of cognition.
Now insert one lightweight component upstream. Call it a router, gate, or metacognitive layer. I’ll use metacognitive router because it describes the intent: it’s making a decision about whether to think.
Its output is not an answer. It’s an action.
There are three actions that matter:
This triage framing is doing more work than it seems.
It makes explicit something that is currently implicit and messy. Every system already has an engagement policy, it’s just usually buried inside the downstream black box, or spread across ad-hoc rules, or expressed indirectly as failure. The metacognitive router pulls that policy into the open, where you can actually argue about it.
It also forces you to state what you care about. Are you optimising for coverage, cost, latency, safety, user trust, or some weighted mixture? A router can encode those trade-offs directly. Without it, you get whatever behaviour falls out of the downstream model’s incentives and training distribution, which is a bit like letting your calendar plan your life.
At this point a reasonable objection is: isn’t this just a fancy name for a classifier that says hard vs easy?
Not quite. Difficulty is part of it, but the decision you actually want is closer to: what is the expected value of spending expensive compute here? Sometimes a task is ‘hard’ in the sense of requiring reasoning, and yet escalation is still not worth it because the request is ill-posed. Sometimes a task is ‘easy’ and escalation is still worth it because the stakes are high. Sometimes the right action is refusal because the system cannot do the thing reliably at all, and any attempt would just create confident sludge.
If this feels like smuggling in rationality language, that’s because it is. The router is basically an explicit attention allocator: it decides where expensive cognition goes, and it records why.
The Diagram - What Actually Sits Where
The easiest way to see what I mean is to draw it as a simple flow chart. Not a grand theory of mind. More like plumbing.
Start at the top: Raw Input x. A user request, an agent subtask, a tool result you need to interpret. Whatever arrives at the boundary of your system.
The first thing that happens is not reasoning. It’s encoding.
You run x through a TaskEncoder (Embedding e). In practice this might be an embedding model, a small language model producing a structured summary, or a hand-engineered feature extractor if you live in a world where those still exist. The output, e, isn’t an answer. It’s a representation of the task that makes the next steps cheap and comparable across inputs.
From e, the diagram fans out into four upstream modules. These are intentionally lightweight, and each one produces a signal that is about the task rather than its solution.
First, there’s the Heuristic Module (conf, comp, feas). This is where the router forms quick, inspectable estimates like:
Second, there’s a Novelty Detector (novelty). This is your out-of-distribution / unfamiliarity signal: how much this input resembles what the system has seen and succeeded on before. It can be crude (embedding distance to past tasks) or sophisticated. The point is: novelty exists, and ignoring it is how systems end up confidently improvising in unfamiliar territory.
Third, there’s the DeltaRegressor (Δ-hat). This is doing one specific job: estimate the expected benefit of escalating to the expensive model. Δ-hat is meant to capture how much better you expect the outcome to be if you pay for System-2, compared to staying on the fast path. The benefit can be measured in whatever currency your system actually cares about (accuracy, utility, avoided harm, reduced uncertainty), but the decision-relevant part is: is escalation worth it here?
Fourth, there’s the Experience Buffer (for replay). This stores past tasks and outcomes so you can calibrate or retrain the router over time, rather than freezing the routing policy as a set of thresholds. In other words: you want the router to get less wrong as it sees more of the world.
All of these signals feed into the MetaController (ACCEPT / ESCALATE / REFUSE), which is the only box in the middle that is allowed to make a decision. The controller can start off rule-based (thresholds, if-statements, conservative handling under high novelty), and later become learned. Either way, its output is an action, not an answer:
The RefusalBuffer matters because refusal should not be a trash can. It’s a curated dataset of edge cases: ill-posed requests, out-of-scope demands, and genuinely novel inputs. The right response to many of these is not to guess, but also not to forget them.
That’s why the diagram includes Deferred Reconsideration (Periodic): the system can periodically revisit refused cases as it accumulates experience, updates its policies, or improves its downstream capabilities. Refusal becomes a way to say not yet, rather than never.
If you want a one-sentence summary of the diagram, it’s this: we separate the decision to spend compute from the computation itself, and we make that decision explicit enough that it can be inspected, criticised, and improved.
From Conceptualising to Building
At this point, the idea is either obvious in a satisfying way, or obvious in a suspicious way. So let’s cash it out into a concrete design.
I’ll sketch two versions. The first is the minimal thing that is genuinely implementable without turning your codebase into a dissertation. The second is the version where you start taking the router seriously as a learned component, and you try to ground the decision in outcomes rather than gut feel.
The minimal version: rules first, humility always
You already have the building blocks from the diagram: an embedding for the task, a few cheap meta-signals, and a controller that chooses Accept / Escalate / Refuse.
In the minimal version, the MetaController is just a small set of thresholds. The main trick is choosing thresholds that encode the behaviour you actually want, instead of the behaviour that feels brave in a demo.
A reasonable first policy looks like this:
Notice what’s not in that policy: it doesn’t say “escalate whenever confidence is low”. That’s a tempting rule, but it can be backwards. Low confidence might mean the task is hard but feasible, or it might mean the task is incoherent. Escalating incoherence is how you get expensive nonsense.
In practice, you’ll end up with a handful of if-statements that look exhausting on paper and save you a lot of grief in production. This is good. Exhausting is a feature here.
If you want a small dose of cheekiness: a lot of current systems are built as if compute is free and humility is expensive. This flips that.
Where do the signals come from?
The minimal version still needs you to compute feasibility, novelty, and Δ-hat in some way that isn’t pure hand-waving.
This is an important theme: the router doesn’t require mystical accuracy. It requires useful calibration, because it’s making a routing decision under uncertainty, not solving the original task.
The more serious version: learn Δ-hat from outcomes
Once the minimal system exists, you can make it much less arbitrary by learning parts of it. The natural thing to learn is Δ-hat: predicted benefit of escalation.
Here’s the clean training story.
For a subset of tasks, you run both paths:
Then you score them using your chosen metric. That metric depends on your domain: it might be human preference judgements, task success rates, correctness on eval sets, downstream utility, or a bundle of signals.
The difference between the expensive score and the cheap score is your empirical Δ for that task. You train the DeltaRegressor to predict that difference from the upstream representation and meta-features.
This is nice because it matches the decision you actually care about. You’re not training a model to be generally accurate. You’re training a model to predict when paying for expensive compute will change the outcome in a meaningful way.
You can also calibrate feasibility from data. Refusals that turned out to be genuinely impossible should raise feasibility thresholds. Refusals that, in hindsight, were doable can be used as negative examples, so the router becomes less timid.
The bit that makes refusal useful rather than embarrassing
The RefusalBuffer + deferred reconsideration loop in the diagram is not cosmetic.
If you treat refusal as a first-class outcome, you can do something that’s hard to do when everything is forced through System-2: you can accumulate a structured dataset of the system’s boundary.
Refused cases are often the best training data you can ask for, because they cluster around ambiguity, novelty, and missing constraints. They tell you what users ask for that your system cannot yet handle reliably. They also tell you what your router is uncertain about.
With deferred reconsideration, you can periodically replay refused tasks through an updated router, or through a stronger System-2, or with new tools. Some will remain refusals. Some will become escalations. Some will become cheap accepts because the system has effectively been “vaccinated” against a class of previously novel tasks.
That’s the pragmatic payoff: refusal stops being a dead end and becomes a mechanism for controlled expansion of competence.
A concrete example (so this doesn’t stay abstract)
Suppose your system receives: “Give me a confident answer about X, and don’t mention uncertainty.”
A naive pipeline escalates to System-2 and produces a confident answer, because it was asked to. Your router sees low feasibility (the request is pushing the system towards epistemically bad behaviour), and high risk of harm to truthfulness. The rational action is refusal or a reframe request, not escalation.
Or suppose the input is: “Summarise this 2-paragraph text.”
Novelty low, feasibility high, Δ-hat low. Fast accept. Don’t wake up the expensive model. Nobody gets a medal for burning compute on summarising two paragraphs.
This is the kind of thing the router is for: not intelligence, but allocation.
Claims, Trade-Offs and Ways This Can Fail
Yes, you can absolutely over-refuse and build a system that is safely useless, which is an accomplishment of sorts but not the one we’re aiming for.
So, if you’re reading this with the mild suspicion that I’m reinventing a fancy
ifstatement, that’s healthy. The whole point of making the routing policy explicit is that you can interrogate it like any other decision procedure: what does it buy you, what does it cost you, and how does it break?Here are the claims I think this architecture can support, stated in the least heroic way I can manage.
Claim 1: You reduce wasted compute without gambling on quality
This one is almost tautological. If you introduce an explicit fast accept path, and your router is even moderately calibrated, you will stop paying System-2 prices for tasks that don’t need System-2 effort.
The interesting part is not saving compute on obviously trivial queries. It’s saving compute on the tasks where escalation produces a longer answer rather than a better one. There are many of these. They’re just harder to notice because they look industrious.
The risk, of course, is rejecting tasks that would have benefited from escalation. That’s why the relevant metric isn’t raw accuracy, it’s accuracy conditional on acceptance and escalation, plus coverage. More on evaluation in a moment.
Claim 2: You get a cleaner form of epistemic humility
Most current pipelines have one epistemic mode: answer. Sometimes they wrap that answer in cautionary language. Sometimes they refuse for policy reasons. But they rarely have an explicit, inspectable mechanism for abstention based on feasibility, novelty, or expected value of computation.
Treating Refuse as a rational action changes the system’s failure modes. It turns some hallucinations into clarifying questions, and some confident nonsense into logged edge cases. That’s not magical. It’s just giving the system a way to say “I don’t know” that isn’t socially awkward.
There’s a subtle benefit here that I care about more than tone: when refusal is allowed, you can design the system to be conservative under novelty. That is exactly when black-box models tend to be at their most eloquently wrong.
Claim 3: You gain auditability, which matters even when you’re wrong
A router produces artefacts: meta-features, novelty scores, Δ-hat estimates, and a decision. That gives you something to inspect after the fact.
When the system escalates and fails, you can ask: did we mis-estimate feasibility, mis-read novelty, overestimate Δ-hat, or set thresholds poorly? When the system refuses and it turns out it could have answered, you can ask the symmetrical question.
Without an explicit router, these post-mortems are mushy. Everything disappears into the downstream model’s internal process, and your only lever is to change the model or add more patches downstream.
This separation of concerns is, in my view, one of the most compelling reasons to do it. It turns a vague behavioural complaint into a concrete policy bug.
Now the trade-offs. Because yes, you can make this worse.
Trade-off 1: Coverage vs reliability is no longer hidden
A system that always answers maximises coverage. It does not maximise truthfulness. A system that refuses too readily maximises something like safety but becomes useless.
The router makes this trade-off explicit. That’s good, but it also means you have to own the decision.
If you tune the router for high reliability, you will refuse more. If you tune it for high coverage, you will escalate more and potentially hallucinate more. There isn’t a free lunch. The router just forces you to pick a lunch deliberately.
Trade-off 2: The router becomes a new thing that can be wrong
This is the obvious critique, and it’s valid. You’re adding a component that can misroute tasks.
There are a few distinct failure modes:
In other words, you don’t escape uncertainty. You relocate it into a part of the system you can at least observe.
Trade-off 3: Novelty signals are easy to get wrong in silly ways
Embedding distance is a tempting novelty detector because it’s simple. It’s also a blunt instrument. Two tasks can be semantically similar but differ crucially in feasibility or stakes. Conversely, a task can be lexically unfamiliar but structurally easy.
So novelty should be treated as a warning light, not a verdict. High novelty should trigger conservatism, not panic.
There’s also the possibility of adversarial behaviour or distribution shift where novelty metrics become unreliable. If the router is used in a setting where inputs are strategic, you need to treat the router itself as part of the attack surface.
Trade-off 4: Δ-hat can collapse into a proxy
Δ-hat is the most important number in the diagram and also the easiest one to fool yourself about.
If you don’t have a good outcome metric, Δ-hat will learn to predict whatever your evaluation setup rewards, which may not be what you intended. If the only feedback is user satisfaction, you can accidentally train a router that escalates whenever the user seems emotionally invested, which is an impressively human failure mode.
Even with a good metric, Δ-hat is a counterfactual prediction: how much better would the answer be if we escalated? Those are hard to learn cleanly. The minimal version can use proxies, but the serious version needs careful grounding.
How you should evaluate this
If you evaluate this as “overall accuracy”, you’ll miss the point. The router changes coverage. You need metrics that respect abstention and conditional performance.
A better evaluation frame looks like:
In ML, this is close to how people evaluate selective prediction or classification with a reject option: you don’t just ask whether the model is right, you ask how well it chooses when to answer.
The overall claim is not that routing makes systems perfect. It’s that routing gives you a principled way to trade off cost, coverage, and reliability, and to inspect the policy that’s making those trade-offs rather than leaving them implicit.
Ending with a Question
The most important box in the diagram is the one I’m least confident about: Δ-hat, the estimated benefit of escalation.
On paper it’s clean. The router is trying to predict how much better the outcome will be if we wake up System-2, compared to staying on the fast path. In practice, the moment you try to make that concrete, you run into a mess of definitions.
Better in what sense?
If you’re doing QA, ‘better’ might mean correctness. If you’re doing planning, it might mean success rate downstream. If you’re doing analysis, it might mean coherence and coverage of key considerations. If you’re doing anything safety-adjacent, it might mean avoided harm rather than raw utility. And in many real systems it’s some weighted mixture, which is where all the arguments start.
So here’s the open question I want to hand to the LessWrong audience:
How should we define and measure the value of escalation in a way that is both decision-relevant and not totally gameable?
I don’t mean this as a rhetorical flourish. I mean it as the crux. If Δ-hat is poorly grounded, the router becomes a very expensive way to encode your prejudices.
A few failure modes feel particularly sharp:
Even if you pick a good metric, there’s a deeper issue: Δ-hat is a counterfactual prediction. It is predicting the difference between two worlds: the world where we escalate and the world where we don’t. Learning counterfactuals is hard, and doing it under distribution shift is harder.
So I’m curious about two very specific things:
First, what is the cleanest way to define Δ in a general-purpose system without smuggling in a pile of hidden assumptions? If you had to write a definition that would survive a hostile reading, what would it be?
Second, what would you trust as training signal? Human judgements are expensive and biased. Automated signals are cheap and gameable. Task success rates are clean but domain-specific. There’s probably no perfect answer, but there might be a less bad one.
If you take the router seriously, Δ-hat becomes the point where epistemology meets engineering. You’re formalising what it means for extra cognition to be worth the cost. That seems like exactly the kind of question LW people have strong opinions about, and I mean that as a compliment.
If you have a proposed definition of Δ, or a training setup that avoids the worst Goodhart traps, I’d love to hear it. If your instinct is that the whole framing is flawed and value-of-computation shouldn’t be compressed into a single scalar, I’d also love to hear that. A router can take richer signals than one number. The diagram is not sacred.
At minimum, I want to come away with a clearer story for what the router should be optimising for, and how to know when it’s doing the right thing rather than merely doing a confident thing.