Uncertainty Without Qualification: The Standard Output Problem in Autoregressive Language Models

Orphee Nessim

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

A next-token distribution can signal uncertainty without indicating, by default, whether it comes from an underdetermined question or from missing information on the model side. In autoregressive systems, that missing qualification is fed back into context and can shape the rest of the generation.

This note concerns a narrow point about the standard output of autoregressive language models: the default predictive distribution can signal uncertainty without explicitly exposing what kind of uncertainty it is.

That matters because in composite systems, later stages may rely on the default output without access to additional signals about the source of the uncertainty. In that case, missing qualification at the interface can shape what happens next.

The problem at the standard output

In autoregressive language models, the standard output takes the form of a probability distribution over the vocabulary for the next token.

On its own, it does not tell you whether the uncertainty comes from an underdetermined question or from a gap in the model’s information — and those two cases call for different downstream responses.

Taken individually, neither feature is surprising. The problem lies in their combination at the standard output.

Two regimes

A question may be underdetermined. The prompt does not by itself determine what should count as the answer. The problem is not that the model lacks information, but that the wording does not provide a sufficient criterion for selecting among several admissible answers.

Another case is an information gap on the model side. A determinate answer exists, but the model cannot, with the information available to it, reliably identify it.

This overlaps somewhat with the aleatoric/epistemic distinction (Kendall & Gal 2017; Hüllermeier & Waegeman 2021), but what matters here is simpler: the two cases call for different downstream responses. One calls for clarification; the other calls for information retrieval. For related work on selectively answering ambiguous questions, see Cole et al. 2023.

What autoregression adds

By standard output, I mean the predictive distribution exposed by default in ordinary next-token generation, not signals that might be reconstructed through repeated sampling, auxiliary instrumentation, or access to internal activations.

To see the specific cost introduced by autoregression, consider two contrasting prompts.

First: “Who won the deciding match?”

The prompt leaves several crucial dimensions open: the sport, the competition, the date, and the teams.

Second: “What was the price of one kilogram of T55 flour at the Rungis market on March 14, 2019?”

Here the query is much more determinate. Yet if the model does not have the relevant information, local hesitation may still appear as dispersion over several plausible values, without a sufficient basis.

In both cases, the standard output has the same form: a probability distribution over candidate tokens. Local dispersion can therefore appear in both situations, even though the downstream response required is different.

In autoregressive generation, that distribution determines the next token, which is then appended to the context. Once produced, the token enters the context without any explicit marker of the uncertainty regime it came from. Generation then continues from already-produced content, even when the system should instead ask for clarification, suspend the answer, or retrieve the missing information.

The failure to qualify the source does not stay local. It becomes part of the context as already-produced content, not as a separate signal.

Standard decoding choices can alter selection, but they do not qualify the source of the uncertainty.

At inference

At inference, the model offers no separate output that restores the distinction. In the standard next-token interface, it does not expose, alongside the vocabulary distribution, a separate output explicitly trained to indicate which uncertainty regime is at play.

Responses such as “I don’t know” are still produced by the same generative mechanism as ordinary answers. Auxiliary methods can extract useful uncertainty signals (Guo et al. 2017; Kadavath et al. 2022; Kuhn, Gal, & Farquhar 2023). But that does not add to the model’s standard interface a separate output explicitly trained to qualify the uncertainty regime at play.

At training

During training, the standard objective is to better predict the next token from the preceding context. The loss is defined over prediction of the target token, without explicit supervision on the source of the uncertainty.

Whether the gap comes from question underdetermination or from an information gap on the model side, both are treated as prediction errors to be minimized.

In an autoregressive regime, this indifference matters more: a gap at the token level is not merely registered, but immediately recycled into the context for subsequent predictions. The model thus learns to predict better, not to explicitly preserve the distinction between these two cases at the level of the output.

Scope and open question

The argument is local: it concerns a failure of qualification in the standard output itself, not merely a calibration problem or an interpretive difficulty.

Instrumental methods — calibration, repeated inference, or ensembles — can improve the situation without directly providing that qualification in the default output.

This note does not cover failures of internal mobilization, or purely stylistic or tokenization-based dispersion.

This note says nothing about what the model may represent internally. The question remains open: how much of this is an engineering problem, and how much reflects a more general constraint of the learning regime?

References

Cole, J. R., Zhang, M. J. Q., Gillick, D., Eisenschlos, J. M., Dhingra, B., & Eisenstein, J. (2023). Selectively Answering Ambiguous Questions.

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks.

Hüllermeier, E., & Waegeman, W. (2021). Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.

Kadavath, S., et al. (2022). Language Models (Mostly) Know What They Know.

Kendall, A., & Gal, Y. (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

Kuhn, L., Gal, Y., & Farquhar, S. (2023). Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation.