Inference-Time Epistemic Control Layer for LLM Reliability

Michael Montplaisir

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

TL;DR

I’ve been experimenting with a model-agnostic inference-time control layer that uses the model’s own confidence signal to decide whether it should answer, ask a clarifying question, gather more information, or defer. The goal is to reduce high-risk outputs — especially hallucinations — without modifying model weights and without adding fine-tuned supervision.

This post is a high-level overview of the idea and some initial observations.
I’m not publishing implementation details or equations, only the conceptual framing.

Mission

Current LLMs have two well-known weaknesses:

-They generate outputs even when they are deeply uncertain.

-They lack an internal mechanism to “decide not to answer.”

This creates obvious problems in enterprise, legal, and safety-critical contexts.
A model can be overconfident and wrong and there is no built-in control layer to say:

“I should gather more information first”
or
“It is safer to defer.”

Fine-tuning helps, but it doesn’t address the real issue:
on any given question, the model has no inference-time decision rule about whether it should answer in the first place.

That’s the gap I tried to address.

What I Built

I built a model-agnostic inference wrapper that:

reads the model’s confidence (from logits or other calibrated scoring)
computes the expected value of each possible cognitive action:
- answering
- asking a clarifying question
- gathering more information
- deferring
and executes only the action with positive expected value,
given a user-defined cost model for mistakes, follow-up questions, or gathering.

This is not an agent.
It does not optimize long-horizon rewards.
It does not pursue goals.
It is simply a risk-aware output filter that makes a local decision at inference time.

Think of it as an epistemic safety valve.

Key Idea

The central question is:

“Is answering right now worth the expected risk?”

If the expected value of answering is negative, the wrapper doesn’t let the model answer.
Instead, it chooses the next-best safe action (ask, gather, defer).

This creates a dynamic form of selective prediction, where the model speaks only when doing so is justified by its own confidence signal.

What I Observed

Under the wrapper, the model behaved more predictably, deferred on genuinely low-confidence items, and improved the quality of answered cases without requiring retraining.

Why This Might Matter for Alignment / Safety

This approach is interesting (at least to me) because:

It reduces risk without modifying the model.
It scales naturally: stronger models → better calibration → better gating.
It forces the model to acknowledge uncertainty at inference time.
It offers a structured alternative to “just answer everything.”
It’s compatible with existing model APIs.
It works even on very small models.
Epistemic control layers scale with the quality of the underlying world model.

As models get deployed in higher-stakes settings, I suspect inference-time epistemic control may become as important as training-time alignment.

Why I’m Posting This

This is an area that feels underexplored.
Most work on hallucination reduction focuses on:

fine-tuning
retrieval
supervised guardrails
confidence heuristics

…but not decision-theoretic control of the output pathway.

My goal with this post is simply to surface the idea at a conceptual level and see if anyone else has explored similar inference-time mechanisms or has thoughts about the direction.

Happy to discuss the high-level framing but not sharing equations or implementation details.

Questions for Readers

Has anyone tried selective prediction / VOI-style gating for LLM inference before?
Are there known risks or pitfalls with decision-theoretic output filters?
How much demand is there for inference-time reliability layers in practice?
Are there theoretical lines of inquiry worth exploring here (calibration, selective abstention, Bayesian filtering, etc.)?

LESSWRONG
LW