Evaluation-Domain Incompleteness: The Third Component of the Alignment Problem

sasenchal

Rejected for the following reason(s):

Insufficient Quality for AI Content.
Not obviously not Language Model.

Read full explanation

Paper Link: The Alignment Problem

Foundational work: Observer Theory and the Ruliad: An Extension to the Wolfram Model — Wolfram Institute

Video Overview (for Friston's TNB working group): Observer Theory - Sam Senchal

--------------------------------------------------------------------------------------------------

Embedded agency, mesa-optimization, ELK, Vingean reflection and 'ontological crises' are all different problems, studied by different researchers. This is not a proposal to group them.

I am proposing that they all share a structural feature this community has circled for a decade without naming (or proving): the evaluation domain a bounded agent must reason over is not recursively enumerable.

The paper linked above provides a formal proof.

The proof takes a Rice-style diagonalisation (I know...) What is ambitious is the claim that follows from it. Evaluation-domain incompleteness is provably a third structural component of the alignment problem, distinct from value specification and capability control, and several canonical problems on this site are manifestations of it seen through different lens.

----------------------------------------------------------------------------------------------------

Summary

The agent has bounded resources and a model class of environments it can represent. The deployment environment lives in a larger set, which the agent's evaluations must cover to be reliable.

Under three axioms (environment capable of universal computation; agent bounded; agent embedded) plus one assumption (the action-to-consequence map factors through Turing-complete computation on an open subset of action space — which holds for language models on the open internet, multi-agent systems, and long-horizon autonomous agents but closed spaces like chess or gridworld), the deployment environment is not recursively enumerable from the agent's internal resources.

Therefore the deployment environment is not equal to the modelled environment for any bounded agent in any sufficiently rich task, by construction, regardless of architecture, training regime, or interpretability tooling.

Three corollaries follow: a structural Goodhart result (see paper), a monitoring impossibility proof, and an argument that the only viable alignment strategy is defence-in-depth (i.e. multiple alignment vectors covering each one's blind spots).

This post sketches them out. The paper has the proofs.

Why the Third Axis is Real

The standard alignment framing has two structural components:

Value specification reducing to "we don't know how to write down what we want"
Capability control reducing to "we don't know how to bound what the system does"

Both are active research programmes with serious methodology. Evaluation-domain incompleteness is neither. It is the claim that the space over which specification and control are defined is itself unconstructible.

Consider what each component presupposes.

Value specification assumes a space of outcomes over which a reward function is defined. Capability control assumes a space of actions or states within which constraints operate. Both work if that space is the right space.

State-space enumerability failure ("SSE-Epistemic failure" in the paper) is the assertion that the space is systematically incomplete in a direction optimization pressure selects for. Neither better rewards nor tighter constraints close the gap, because both operate over the silent region. The paper uses the Flash Crash to demonstrate this heuristically. In that example every trading algorithm had a complete model of the market (as it understood it) but there was a gap because the space of possible interactions among the models was unmapped.

One clarification that will pre-empt what I expect will be a common objection:

SSE-Epistemic failure does not compete with PAC learning, distributionally robust optimisation, or bounded-regret online learning.

It sits under them. PAC bounds are statements about a hypothesis class that operate over a fixed outcome space. SSE targets whether that outcome space is constructible at all. A practical alignment architecture requires both: a representation covering the decision-relevant region, and a statistical procedure that estimates the objective given that representation. The paper focusses on the first; existing alignment machinery is appropriate for the second.

Furthermore, the impossibility proof implies the failure of four weaker operational conditions; bounded-domain optimization, Bayesian model uncertainty, Knightian uncertainty, and computational intractability. They fail because an agent unable to represent the full deployment environment cannot optimize, bound, or specify probability over it. These are the core failure modes of prescriptive decision procedures (the research programmes of Simon and Gigerenzer are responses to the inability to optimise over the full domain). The paper treats them as complementary to the SSE-framework.

Three Corollaries

Structural Goodhart

Skalse et al. (2022) proved that over the full stochastic-policy set, two reward functions are non-trivially unhackable only if one is constant. That is the structural source of reward hacking.

The SSE framework identifies a second, logically independent source. Namely, any computable extension of the proxy to deployment environment must diverge arbitrarily at some state and optimization pressure places positive occupancy on the unmapped 'area' where the proxy imposes no cost. Better reward functions don't eliminate the first source. Larger models can't eliminate the second.

Monitoring impossibility

A bounded monitor is itself a bounded agent, so its own modelled domain is constructible, and the union of the agent's modelled domain and the monitor's modelled domain is therefore also constructible.

The true domain, by the main theorem, is not. The residual is non-empty. Increasing the monitor's capacity shrinks the residual but doesn't eliminate it: the monitor hierarchy has non-vanishing gaps at every level. The Agarwal (2026) Alignment Verification Trilemma is complementary. The SSE result is architecture-general and rules out full-domain generality as a matter of computability; Agarwal's trilemma applies specifically to neural-network verification, rooted in parameter symmetries and NP-hardness, and maps the trade-offs among soundness, generality, and tractability that any concrete verification procedure must accept.

Neither subsumes the other.

Defence-in-depth (summary)

Any single approach addressing only some of the six components (see section 4.3 of the paper for the full treatment) contains a residual on the others. The structural Goodhart result shows that optimization pressure places positive occupancy on the coverage and adversarial components. The main theorem's density claim shows that non-trivial decisions are dense in the Turing-coupled region of the deployment action space. Composite architectures covering all six components are therefore formally required for asymptotic coverage. Defence-in-depth moves from intuition to theorem.

------------------------------------------------------------------------------------------------------

Coda

The underlying intuition that bounded agents can't enumerate their evaluation domains is old hat.

What I think is new is (i) the computability-theoretic core in the specific form of the Prescriptive Bridge + Rice, (ii) the six-axis decomposition with non-subsumption witnesses, and (iii) the three formal corollaries, particularly the defence-in-depth result, which I don't think (someone here can probably tell me if it has!) has been stated as a theorem before.

So: if this has been proved, I want to know where. If the six-axis decomposition is redundant with work I haven't cited, I want to know that too. I'd particularly value pushback from people working on embedded agency, infra-Bayesianism, mesa-optimisation, and verification because the claim that your separate research directions are manifestations of one underlying structural limit is the kind of claim that most needs to be wrong in a way I can see.

Comments open.

I will read every substantive reply and respond to the ones that identify a specific gap.

1

Evaluation-Domain Incompleteness: The Third Component of the Alignment Problem

1

Summary

Why the Third Axis is Real

Three Corollaries

Coda

1

1