TL;DR
I’ve been experimenting with a model-agnostic inference-time control layer that uses the model’s own confidence signal to decide whether it should answer, ask a clarifying question, gather more information, or defer. The goal is to reduce high-risk outputs — especially hallucinations — without modifying model weights and without adding fine-tuned supervision.
This post is a high-level overview of the idea and some initial observations.
I’m not publishing implementation details or equations, only the conceptual framing.
Mission
Current LLMs have two well-known weaknesses:
-They generate outputs even when they are deeply uncertain.
-They lack an internal mechanism to “decide not to answer.”
This creates obvious problems in enterprise, legal, and safety-critical contexts.
A model can be overconfident and wrong and there is no built-in control layer to say:
“I should gather more information first”
or
“It is safer to defer.”
Fine-tuning helps, but it doesn’t address the real issue:
on any given question, the model has no inference-time decision rule about whether it should answer in the first place.
That’s the gap I tried to address.
What I Built
I built a model-agnostic inference wrapper that:
- reads the model’s confidence (from logits or other calibrated scoring)
computes the expected value of each possible cognitive action:
- answering
- asking a clarifying question
- gathering more information
- deferring
and executes only the action with positive expected value,
given a user-defined cost model for mistakes, follow-up questions, or gathering.
This is not an agent.
It does not optimize long-horizon rewards.
It does not pursue goals.
It is simply a risk-aware output filter that makes a local decision at inference time.
Think of it as an epistemic safety valve.
Key Idea
The central question is:
“Is answering right now worth the expected risk?”
If the expected value of answering is negative, the wrapper doesn’t let the model answer.
Instead, it chooses the next-best safe action (ask, gather, defer).
This creates a dynamic form of selective prediction, where the model speaks only when doing so is justified by its own confidence signal.
What I Observed
Under the wrapper, the model behaved more predictably, deferred on genuinely low-confidence items, and improved the quality of answered cases without requiring retraining.
Why This Might Matter for Alignment / Safety
This approach is interesting (at least to me) because:
- It reduces risk without modifying the model.
- It scales naturally: stronger models → better calibration → better gating.
- It forces the model to acknowledge uncertainty at inference time.
- It offers a structured alternative to “just answer everything.”
- It’s compatible with existing model APIs.
- It works even on very small models.
- Epistemic control layers scale with the quality of the underlying world model.
As models get deployed in higher-stakes settings, I suspect inference-time epistemic control may become as important as training-time alignment.
Why I’m Posting This
This is an area that feels underexplored.
Most work on hallucination reduction focuses on:
- fine-tuning
- retrieval
- supervised guardrails
- confidence heuristics
…but not decision-theoretic control of the output pathway.
My goal with this post is simply to surface the idea at a conceptual level and see if anyone else has explored similar inference-time mechanisms or has thoughts about the direction.
Happy to discuss the high-level framing but not sharing equations or implementation details.
Questions for Readers
- Has anyone tried selective prediction / VOI-style gating for LLM inference before?
- Are there known risks or pitfalls with decision-theoretic output filters?
- How much demand is there for inference-time reliability layers in practice?
- Are there theoretical lines of inquiry worth exploring here (calibration, selective abstention, Bayesian filtering, etc.)?