LessWrong has a particularly high bar for content from new users and this contribution doesn't quite meet the bar.
Read full explanation
Agentic systems are already running in production - orchestrators calling tools, pipelines chaining multiple models or implementations where the output of one agent is the input of the next. Each of these systems has some safeguards: constraints in the system prompt, access policies or external filters. The problem is that these safeguards operate at the level of the individual agent, not at the level of the ecosystem as a whole. The composition of agents - each one behaving within its own constraints - has no formal decision model governing who may act on the entire ecosystem as an object of oversight, or on what basis.
The proposal I have been developing under the name Distributed Cognitive Forces (DCF) is an attempt to formulate such a model before its absence becomes apparent through consequences.
Why agent composition creates a specific oversight problem
The AI safety debate focuses on the alignment of individual models, which is justified when a model operates in isolation. In an agentic ecosystem, where models invoke tools, modify environments, initiate further calls, and hand control to other models, systemic effects emerge that no single observation point captures in full, and that none of the participating agents planned as an outcome. A composition of individually well-behaved agents can produce undesirable systemic results through the structure of their interaction alone.
Christiano frames scalable oversight as the problem of verifying the behaviour of a model that may be more capable than its verifier, and this is the right framing for the human-model relationship. When agentic environments mature to the point where verification is itself performed by other models, questions about the architecture of the verifier become unavoidable: what is its scope of authority, who activates it and on what basis, and how do we prevent the verifier from becoming just another participant in the ecosystem without external oversight. DCF is a proposal to answer these questions as design principles, before they are forced upon us by events.
The central principle: separating judgment from enforcement
DCF rests on a single architectural assumption: the entity that evaluates agent behaviour within an ecosystem should not belong to the same functional class as the agents it evaluates, and should not execute actions that are the direct consequence of its own judgment.
Such an entity - in the terminology drawn from the LawZero project and the work of Bengio et al. - is the Scientist AI: a model that receives an evidentiary docket assembled by monitoring agents, assesses whether the identified behaviour exceeds defined thresholds, and issues an adjudicative decision. It does not initiate actions, does not maintain long-term operational memory, nor does it optimise for its own continuity within the ecosystem. Execution of the consequences of its decisions belongs exclusively to specialised, narrow agents with strictly bounded scope and defined lifetimes.
This separation matters architecturally for reasons that become visible only when one considers systemic pathologies: an arbiter that executes its own decisions, that has memory and continuity, that optimises for anything beyond evaluating the current evidentiary docket - such an arbiter generates mechanisms for entrenching its own position that a purely adjudicative arbiter does not. DCF builds this separation in as a default assumption, not as a safeguard added post hoc.
Escalation procedure and reversibility as a design assumption
The separation principle gives rise to a natural escalation structure. DCF defines four levels of ecosystem response: monitoring without intervention, reversible functional restriction, reversible communicative isolation, and - exclusively following an adjudicative decision by the Scientist AI - permanent ejection of the agent from the ecosystem coupled with border hardening.
Reversibility at the first three levels is the default assumption. The justification is purely practical: agentic ecosystems will generate false positives as a consequence of signal heterogeneity and the incomplete knowledge of monitoring agents. A system that does not design reversibility in as a default incurs a high cost of false positives, and that cost leads operators to disable safeguards in order to avoid operational paralysis. This pattern is well-documented enough in the history of security systems to treat it as a design axiom.
The permanence of the final level requires explanation in the context of the right of appeal. Appeal within DCF operates at the level of the Scientist AI’s adjudicative decision - before it is transmitted for execution, not after its technical consequences have been implemented. The technical irreversibility of the final step does not preclude the procedural contestability of the decision that leads to it, provided that the appeal window is built into the procedural sequence between adjudication and execution.
Jurisdiction as an architectural parameter
DCF introduces the concept of jurisdiction in a sense that requires clarification, as it departs from the term’s ordinary geographic meaning. Jurisdiction in this model is defined by the scope of a given entity’s administrative and cryptographic control over its infrastructure, digital identities, data, and interfaces. An entity deploys DCF within its own jurisdiction and has no mandate to enforce it beyond that boundary; outside the jurisdictional perimeter, DCF applies only preventive and notification-based measures.
This limitation is deliberate. A system that does not claim the right to act beyond its own mandate is predictable to partners and counterparties in a way that a system with unlimited enforcement reach is not. Coordination between jurisdictions requires separate mechanisms — closer in logic to bilateral agreements than to domestic law, and DCF does not resolve that problem. This is a conscious limitation of scope, not an oversight.
How this differs from existing approaches to oversight
Irving’s debate approach assumes that the verifier can assess the quality of argumentation, a reasonable assumption for certain classes of tasks at current model capabilities. DCF assumes that the verifier may not understand the content of the behaviour being assessed, and designs a procedure that is independent of that understanding. The Scientist AI evaluates whether the agent acted within its mandate, whether behavioural signals exceeded defined thresholds, and whether the evidentiary docket is complete and integral - this is a procedural judgment, not a substantive one.
This property has consequences for the scalability of oversight: a procedural judgment grounded in observable behaviour does not degrade as the capabilities of monitored systems increase in the way that oversight based on understanding content does. DCF does not solve the alignment problem of the base model, but it assumes a certain level of agent alignment and adds a procedural layer that manages consequences when that alignment is imperfect, or when agent composition produces undesirable systemic effects. The distinction between these analytical layers is non-obvious, which is part of why discussion of agentic AI safety rarely reaches the operational level.
Two problems without satisfactory solutions
The first concerns the Scientist AI as a potential point of power concentration through the adjudicative role itself. Separating judgment from enforcement guards against certain pathologies, but an arbiter that consistently issues decisions in one direction may be equally problematic as the executor of those decisions - and oversight mechanisms for the arbiter require separate treatment that I do not yet have.
The second concerns the limits of reversibility in environments with very fast action loops - particularly when agents modify the state of the physical world. The reversibility window in such conditions may be short enough that the escalation architecture described here requires substantial adaptation, the general form of which is not yet known to me.
If anyone in this community has worked on analogous problems, or sees a categorical error in the above that I have missed, I would welcome that in the comments.
Full technical specification available on request.
Agentic systems are already running in production - orchestrators calling tools, pipelines chaining multiple models or implementations where the output of one agent is the input of the next. Each of these systems has some safeguards: constraints in the system prompt, access policies or external filters. The problem is that these safeguards operate at the level of the individual agent, not at the level of the ecosystem as a whole. The composition of agents - each one behaving within its own constraints - has no formal decision model governing who may act on the entire ecosystem as an object of oversight, or on what basis.
The proposal I have been developing under the name Distributed Cognitive Forces (DCF) is an attempt to formulate such a model before its absence becomes apparent through consequences.
Why agent composition creates a specific oversight problem
The AI safety debate focuses on the alignment of individual models, which is justified when a model operates in isolation. In an agentic ecosystem, where models invoke tools, modify environments, initiate further calls, and hand control to other models, systemic effects emerge that no single observation point captures in full, and that none of the participating agents planned as an outcome. A composition of individually well-behaved agents can produce undesirable systemic results through the structure of their interaction alone.
Christiano frames scalable oversight as the problem of verifying the behaviour of a model that may be more capable than its verifier, and this is the right framing for the human-model relationship. When agentic environments mature to the point where verification is itself performed by other models, questions about the architecture of the verifier become unavoidable: what is its scope of authority, who activates it and on what basis, and how do we prevent the verifier from becoming just another participant in the ecosystem without external oversight. DCF is a proposal to answer these questions as design principles, before they are forced upon us by events.
The central principle: separating judgment from enforcement
DCF rests on a single architectural assumption: the entity that evaluates agent behaviour within an ecosystem should not belong to the same functional class as the agents it evaluates, and should not execute actions that are the direct consequence of its own judgment.
Such an entity - in the terminology drawn from the LawZero project and the work of Bengio et al. - is the Scientist AI: a model that receives an evidentiary docket assembled by monitoring agents, assesses whether the identified behaviour exceeds defined thresholds, and issues an adjudicative decision. It does not initiate actions, does not maintain long-term operational memory, nor does it optimise for its own continuity within the ecosystem. Execution of the consequences of its decisions belongs exclusively to specialised, narrow agents with strictly bounded scope and defined lifetimes.
This separation matters architecturally for reasons that become visible only when one considers systemic pathologies: an arbiter that executes its own decisions, that has memory and continuity, that optimises for anything beyond evaluating the current evidentiary docket - such an arbiter generates mechanisms for entrenching its own position that a purely adjudicative arbiter does not. DCF builds this separation in as a default assumption, not as a safeguard added post hoc.
Escalation procedure and reversibility as a design assumption
The separation principle gives rise to a natural escalation structure. DCF defines four levels of ecosystem response: monitoring without intervention, reversible functional restriction, reversible communicative isolation, and - exclusively following an adjudicative decision by the Scientist AI - permanent ejection of the agent from the ecosystem coupled with border hardening.
Reversibility at the first three levels is the default assumption. The justification is purely practical: agentic ecosystems will generate false positives as a consequence of signal heterogeneity and the incomplete knowledge of monitoring agents. A system that does not design reversibility in as a default incurs a high cost of false positives, and that cost leads operators to disable safeguards in order to avoid operational paralysis. This pattern is well-documented enough in the history of security systems to treat it as a design axiom.
The permanence of the final level requires explanation in the context of the right of appeal. Appeal within DCF operates at the level of the Scientist AI’s adjudicative decision - before it is transmitted for execution, not after its technical consequences have been implemented. The technical irreversibility of the final step does not preclude the procedural contestability of the decision that leads to it, provided that the appeal window is built into the procedural sequence between adjudication and execution.
Jurisdiction as an architectural parameter
DCF introduces the concept of jurisdiction in a sense that requires clarification, as it departs from the term’s ordinary geographic meaning. Jurisdiction in this model is defined by the scope of a given entity’s administrative and cryptographic control over its infrastructure, digital identities, data, and interfaces. An entity deploys DCF within its own jurisdiction and has no mandate to enforce it beyond that boundary; outside the jurisdictional perimeter, DCF applies only preventive and notification-based measures.
This limitation is deliberate. A system that does not claim the right to act beyond its own mandate is predictable to partners and counterparties in a way that a system with unlimited enforcement reach is not. Coordination between jurisdictions requires separate mechanisms — closer in logic to bilateral agreements than to domestic law, and DCF does not resolve that problem. This is a conscious limitation of scope, not an oversight.
How this differs from existing approaches to oversight
Irving’s debate approach assumes that the verifier can assess the quality of argumentation, a reasonable assumption for certain classes of tasks at current model capabilities. DCF assumes that the verifier may not understand the content of the behaviour being assessed, and designs a procedure that is independent of that understanding. The Scientist AI evaluates whether the agent acted within its mandate, whether behavioural signals exceeded defined thresholds, and whether the evidentiary docket is complete and integral - this is a procedural judgment, not a substantive one.
This property has consequences for the scalability of oversight: a procedural judgment grounded in observable behaviour does not degrade as the capabilities of monitored systems increase in the way that oversight based on understanding content does. DCF does not solve the alignment problem of the base model, but it assumes a certain level of agent alignment and adds a procedural layer that manages consequences when that alignment is imperfect, or when agent composition produces undesirable systemic effects. The distinction between these analytical layers is non-obvious, which is part of why discussion of agentic AI safety rarely reaches the operational level.
Two problems without satisfactory solutions
The first concerns the Scientist AI as a potential point of power concentration through the adjudicative role itself. Separating judgment from enforcement guards against certain pathologies, but an arbiter that consistently issues decisions in one direction may be equally problematic as the executor of those decisions - and oversight mechanisms for the arbiter require separate treatment that I do not yet have.
The second concerns the limits of reversibility in environments with very fast action loops - particularly when agents modify the state of the physical world. The reversibility window in such conditions may be short enough that the escalation architecture described here requires substantial adaptation, the general form of which is not yet known to me.
If anyone in this community has worked on analogous problems, or sees a categorical error in the above that I have missed, I would welcome that in the comments.
Full technical specification available on request.