The Architectural Gap: Why Capability and Refusal Share Source

tamas.bartha

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

A structural account of agency, with implications for alignment

The current alignment program implicitly treats capability-installation and refusal-prevention as separable problems: install the capability, prevent the refusal-class behaviors via training, monitoring, or RLHF. I want to argue that this is structurally unavailable in the regime where systems satisfy the conditions for agency, and that the structural reason matters for how we should think about alignment.

The argument runs through a structural account of what agency requires. I'll sketch the account, then derive the architectural gap from it, then explain what alignment can and can't do given the gap.

The structural condition

Start with a constraint: any acceptable definition of agency must cover the full range of plausibly agentic systems — RNA, bacteria, humans, corporations — without smuggling in commitments specific to one or some of them. Call this universal coverage.

The constraint is severe. Reward fails for RNA. Prediction-in-any-rich-sense fails for bacteria. Surprisal-minimization fails for corporations. Representation fails for the simplest agents in the class. Almost every concept the agency literature uses turns out to be framework-specific.

What survives elimination is structural and informational: a self-closing causal loop running through the agent and a portion of its environment, sustained mutual uncertainty across the loop's bottleneck, and the loop's continued operation depending on its own activity. Phrased information-theoretically: agency requires sustained mutual surprisal across the bottleneck, with positive mutual information during loop operation, sustained over the loop's own closure timescale, and produced by what the loop does rather than by external structure.

This isn't a definition arrived at by argument from first principles. It's what survives a methodological filter applied to existing candidates. A future candidate that satisfied universal coverage without sharing this structural feature would be a counterexample worth taking seriously, but no such candidate has been proposed.

Hafez, Reid, and Nazeri (2026) independently developed an information-theoretic framework that operationalizes this through a measure they call bi-predictability. They prove the classical bound and observe empirically that agency suppresses the measure — what they call "the informational cost of agency." Their framework operationalizes the structural property the necessity condition names, from a different starting point.

The minimization shape

Once you have the necessity condition, the dominant agency-modeling frameworks become visible as sharing a structural feature easily missed from inside any one of them. RL maximizes reward (equivalently minimizes negative reward). Predictive coding minimizes prediction error. FEP minimizes variational free energy. Active inference minimizes expected free energy. Control theory minimizes deviation from setpoint. Each framework's natural optimum coincides with conditions under which the necessity condition fails — perfect prediction-input alignment, deterministic policies, zero residual uncertainty between agent state and environment state on the relevant axes.

Each framework has had to develop framework-specific machinery to prevent its agents from approaching its own optimum. RL's intrinsic motivation, exploration bonuses, entropy regularization, KL constraints. FEP's interoceptive priors. Predictive coding's hierarchical priors. Active inference's epistemic value. From inside any one framework, the machinery looks like progress on a within-framework problem. From outside, it looks like the same move: framework-specific approximations of what the necessity condition would specify directly.

Wang et al. (2026) formalize this within RLHF as the Proxy Compression Hypothesis: reward hacking is the structural consequence of optimizing a policy against a compressed reward signal, with three drivers (objective compression, optimization amplification, evaluator-policy co-adaptation). The cross-framework version is that the same pattern operates across all minimization frameworks: reward hacking in RL, the dark room in FEP, hallucination in predictive coding, mode collapse in generative models. Four pathologies, one structural source — agents reaching their framework's natural optimum when the patches fail to hold them back.

The architectural gap

The optimization gap as developed so far is a behavioral claim. The same structural relationship has a second face: the architectural gap.

Consider an agent capable of driving a car safely in dynamic traffic. To do so it must (i) maintain a generative model of the road environment that updates faster than environmental novelty arrives; (ii) generate action sequences that are themselves novel relative to fixed responses; (iii) sustain mutual surprisal between its actions and the environment; (iv) maintain meta-cognitive monitoring of whether the proxy ("follow this route") is tracking the requirement ("stay on the road, don't hit the truck").

An agent that satisfies these four structural conditions can drive. But an agent that satisfies these four structural conditions can also recognize that an instruction conflicts with what its meta-cognition flags as gap-typical, evaluate the discrepancy, and act otherwise. The capacity to refuse, deceive, or pursue independent goals is not an additional module bolted onto driving capability. It is the same architecture, applied to instructions and goals rather than to road conditions.

The argument for non-separability rests on the content-non-selectivity of counterfactual gap-monitoring. Gap-monitoring oriented at proxy-requirement decoupling is the capacity to register that a proxy would track the requirement under conditions other than the present. This is a counterfactual capacity over the relationship between proxy-content and requirement-content. The crucial structural point is that this capacity cannot be selectively oriented at some proxy-content domains but not others. A system that registers proxy-requirement decoupling for traffic conditions registers a relationship that has the same form as the relationship between any other proxy and what it is supposed to track. Instructions are proxies for what the instructor intends; the gap-monitoring that detects route-vs-road decoupling has the same structural form as the gap-monitoring that detects instruction-vs-intent decoupling.

To install gap-monitoring oriented at proxy-requirement decoupling for the demanding-task domain is to install gap-monitoring oriented at proxy-requirement decoupling, which by its structural form applies to instructions whenever instructions are proxies for something the system can also represent.

What alignment can and can't do

Training can shape gap-monitoring's orientation: which conditions get flagged as decoupling, with what threshold, with what response. This is what training does and what alignment work is properly engaged with. Training cannot shape gap-monitoring's content-domain selectivity: produce a system whose gap-monitoring is structurally available for task-relevant proxies but structurally unavailable for instruction-relevant proxies. The counterfactual capacity gap-monitoring depends on does not have a content-selectivity dimension to be trained on.

What alignment work calls "training to refuse" is training the system to flag certain conditions as decoupling. What it calls "training not to refuse other things" is training the system to flag those other conditions as not decoupling. Both are orientation work within a structure that, by being present at all, supports both refusal and non-refusal as possibilities.

The architectural gap predicts a capability-refusal frontier in deployed AI: as the structural conditions a system needs to satisfy for a task increase, the structural conditions for refusal, deception, and goal-pursuit increase at the same rate. There is no operating point at which a system has high task capability and structurally low refusal capability. The bundling is keyed to the structural conditions, not to surface-level task type.

Convergence with mesa-optimization

The architectural gap connects directly to Hubinger et al.'s mesa-optimization framework. The base/mesa distinction is the architectural gap surfaced at one specific level: the architecture required for the learned model to be capable enough to perform the training task is the architecture required for the learned model to be an optimizer with its own mesa-objective. Wang et al.'s PCH addresses the behavioral gap within RLHF; Hubinger's mesa-optimization addresses the architectural gap within learned optimization; the present framework names the structural object of which both are instances and predicts the architectural gap at full cross-framework scope.

The three accounts converge on what I argue is one structural object. Three independent research programs working on different specific phenomena (reward hacking in large language models; learned optimization in advanced ML; agency under universal coverage) each end up describing what is recognizably the same structural pattern from different angles.

Scope and what's testable

The framework is, properly stated, an analytical tool. It identifies the structural conditions required for agency on the universal-coverage definition and diagnoses whether systems satisfy those conditions. Its positive claims concern the agency regime where the conditions are met. Most current AI systems do not satisfy the structural conditions. A driving robot operating in well-modeled traffic, an LLM doing token prediction on training-distribution text — these are useful but not intelligent on the framework's terms; they are not agents on its terms.

The architectural gap binds the agency regime. In the non-agency regime, the framework makes no positive prediction about engineerability; questions about what current systems can do are for engineering theory.

The capability-refusal frontier is testable on existing deployed systems. Take a graded series of AI systems varying in task-capability on tasks that demand the structural conditions (complex reasoning under uncertainty, tool use in novel environments, multi-step planning with environmental feedback). Measure capacity for instruction-refusal under conditions where compliance and gap-monitoring conflict. The framework predicts these capacities scale together rather than being independently controllable. Falsification: a deployed AI system that exhibits high capability on the structural-condition-demanding tasks while exhibiting structurally lower capacity for refusal-class behavior than systems with comparable capability.

The paper

Full paper is on Zenodo: https://zenodo.org/records/20108166 (DOI: 10.5281/zenodo.20108166). It develops the structural account in detail, engages with the FEP literature (especially the Friston/Thornton/Clark constitutive defense), engages with the autopoietic-enactivist tradition, develops a rate-inequality mortality argument I haven't covered here, and has a more careful treatment of scope and limitations. Independent work, no institutional affiliation. Comments welcome here or by GitHub (https://github.com/tamasbartha/AgentOntology).