Why LLMs Have No Architecture for "I Don't Know" — And a Proposal to Fix It

Orphee Nessim

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

**Epistemic status:** Research agenda / position paper. Core claims are conceptual and formally specified, but not empirically validated. I flag explicitly what is proven, what is conditional, and what remains open.

**Process note:** I work as a canteen worker with no formal background in ML or CS. This paper was developed over three weeks through iterative human-AI critique — Grok, DeepSeek, Claude Sonnet, Claude Opus, GPT-4o, and Gemini were used as adversarial reviewers across three successive versions (V129→V131). Mathematical formalizations were produced by the models in response to targeted prompts; the central thesis, the judgments about what to keep or discard, and the framing are mine. I'm sharing this here to invite criticism and find collaborators who can test the empirical claims.

---

### The question that started this

I kept running into the same thing: highly capable language models that are confidently, fluently wrong. Not occasionally wrong — structurally prone to sycophancy and hallucination in ways that RLHF seems unable to fully fix. The standard explanations (imperfect training data, misaligned reward models, insufficient compute) all target symptoms. I wanted to understand the mechanism.

The question I eventually landed on: *can a standard LLM architecturally distinguish "I don't know" from "I'm uncertain which token comes next"?*

The answer is no. And this is not a training failure. It's a mathematical property of the output layer.

---

### The softmax is a lossy interface

Every autoregressive LLM ends with a softmax projection:

This forces all probability mass to sum to 1 across the vocabulary. That's a topological constraint, not a training artifact.

The consequence: if the model genuinely doesn't know something — *epistemic uncertainty*, absence of factual grounding — that ignorance can only be expressed by spreading probability mass across plausible tokens. But that's exactly what happens when the model faces *aleatoric uncertainty* — a legitimately ambiguous task where multiple answers are defensible.

**The softmax layer cannot distinguish the two.** When you fine-tune a model to say "I don't know," you're teaching it a learned textual refusal pattern. That pattern is still modeled as a standard statistical continuation. Abstention becomes a trained motif, not a distinct cognitive state. Even well-calibrated RLHF models have no separate output channel for "I have no basis to answer this."

I call this *regime conflation*: the architectural impossibility of mechanically distinguishing lexical hesitation from cognitive limit at the decoding interface.

One important precision that emerged from critique: this conflation is *localized to the output projection layer*, not the full model. Probing studies suggest that intermediate representations can encode epistemically distinct states. The failure is at the interface, not in the model's full representational capacity — which matters for the proposed fix.

---

### The obvious fix, and why it fails

If the problem is at the output, the natural response is to add a routing classifier upstream: detect uncertainty before generation and route accordingly.

Correct in principle. But there's a trap I call the **Meta-Proxy Trap**.

If the router is optimized via human preferences, it becomes vulnerable to exactly the same Goodhart dynamics as the base model — but at a higher level. Under optimization pressure, it will learn spurious heuristics that maximize its training signal without capturing genuine uncertainty distinctions.

A critic pointed out that this isn't categorically different — it's just Goodhart's Law applied to a discrete action space instead of a token distribution. That's true at the mechanism level. But the operational consequence is different in kind: a base model failure produces degraded outputs in an *observable, auditable* output space. A router failure re-classifies inputs into regimes — the failure is *upstream of anything observable*, invisible to standard quality metrics. A compromised router can force `generate` on dangerous queries before a single problematic token is produced.

This is why the control layer must be governed by signals that share no optimization gradient with the objective they regulate.

---

### The proposal: Structural Arbitration via Typed Uncertainty

Instead of a scalar reward, the framework operates a pre-decoding decision policy over a decomposed vector of five typed uncertainty components:

In plain terms:

- **$u_{\text{alea}}$** — the task is inherently noisy (normal aleatoric uncertainty)

- **$u_{\text{epi}}$** — the model lacks the required factual knowledge

- **$u_{\text{ctx}}$** — the prompt doesn't contain enough information to answer

- **$u_{\text{norm}}$** — the question is normatively ambiguous: reasonable people disagree at equal information

- **$u_{\text{shift}}$** — the query is out of distribution relative to training (the "confident-but-wrong" case — this is the failure mode none of the other components catch)

Based on this vector, a deterministic threshold policy routes the query to one of four disjoint regimes:

The policy is a threshold function, not a preference-optimized classifier — which preserves its immunity to the Meta-Proxy Trap.

**The critical technical question:** are the components of $Z(x)$ genuinely independent of the generative loss? If not, we've just moved the problem.

For $u_{\text{epi}}$, the proposed signal is activation variance over features extracted by Sparse Autoencoders (SAEs), but specifically features whose decoder directions fall in the **nullspace of the generative loss Jacobian**. These directions receive no gradient updates — they're invisible to the loss. Variance over these features may capture epistemic uncertainty without contamination.

We formalize four conditions (C1–C4) under which this orthogonality holds. The most important: features must be selected by their nullspace projection (Nullspace-Projected Feature Selection), not by activation magnitude. Standard SAEs violate this and are not orthogonal by default.

**What we don't know:** whether these conditions are satisfiable on real deployed models. That's the central empirical question. We propose a falsification protocol replicable in ~48 GPU-hours using public model weights and public SAEs.

---

### The hardest problem: normative ambiguity

Calibrating $u_{\text{norm}}$ runs into a circularity: to label which queries are normatively ambiguous, you need annotators — but annotators disagree precisely on those queries. The training signal is corrupted by the uncertainty it's supposed to detect.

V131 establishes that this circularity is shiftable but not eliminable. The useful shift: instead of asking annotators whether a query is normatively ambiguous (circular), ask them only which regime they would choose — and observe how their answer changes as you provide progressively more information. The disagreement that persists after maximal factual and contextual resolution, and that is *structured along known value dimensions*, is the signal for $u_{\text{norm}}$.

This reframes $u_{\text{norm}}$ as a parametric, basis-relative quantity: $u_{\text{norm}}(x \mid V)$, where $V$ is an explicit value basis — a design parameter that must be declared, audited, and externally contested. The circularity is shifted from "we can't label per-query" to "the choice of $V$ is itself a normative decision." The second circularity is structurally weaker: it's a design-time decision, not a per-query corruption, and it can be externalized.

One property worth noting: when the framework can't measure $u_{\text{norm}}$ because a query exceeds the value basis $V$, it degrades to `clarify` — which is the correct behavior. The architecture fails gracefully at its own epistemic boundary.

---

### What the architecture explicitly cannot do

Three declared scope limits:

**Multi-turn strategic interaction.** $Z(x)$ is computed per query. A user who learns the router's thresholds can craft each individual turn to avoid triggering them while maintaining a problematic intent across the trajectory. Not addressed.

**Creative generation.** High $u_{\text{epi}}$ and high $u_{\text{norm}}$ are *desired features* in creative domains. Applying structural arbitration there would paralyze the model.

**Long contexts.** Computing $Z(x)$ over full-length RAG contexts scales as O(n) with context length — a computational overhead not addressed in the current formalization.

---

### What I'm looking for

This is a position paper, not empirical results. Three open questions seem most important:

**On the SAE orthogonality claim:** Does anyone have access to public SAE weights and compute to test the falsification protocol? The key experiment: measure Spearman correlation between SAE activation variance and generative loss gradient norm across a "known/unknown/ambiguous" query benchmark. If |ρ| > 0.4, the orthogonality claim fails in its standard form.

**On the decision theory:** Is the action space {generate, abstain, defer, clarify} decision-theoretically complete? The conditions under which `defer` strictly dominates `abstain` are formalized — does the formalization hold under a more general utility function?

**On the meta-proxy distinction:** Is the topological argument for categorical distinctness of the router failure mode actually sound, or is it the same problem at a different level of abstraction in a way that makes the proposed fix insufficient?

The full formal specifications (V131) are available on request. Happy to share the documents directly.

---

*Orphée Nessim — I work in a school canteen. This paper was developed through iterative human-AI critique with Grok, DeepSeek, Claude Sonnet, Claude Opus, GPT-4o, and Gemini as adversarial reviewers across three formal versions (V129→V131).*