Argumentation Frameworks as Auditable Conflict Resolution for Constitutional AI Systems

Leeladitya

Rejected for the following reason(s):

This is an automated rejection.
you wrote this yourself (not using LLMs to help you write it)
you did not chat extensively with LLMs to help you generate the ideas.
your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.

Read full explanation

TL;DR: When constitutional classifiers produce contradictory signals (e.g., "block — contains PHI" vs. "allow — authorized research context"), the resolution is typically implicit: priority ordering, hardcoded rules, or model-level arbitration that's difficult to audit. I built an open-source system that resolves these conflicts using Dung's Abstract Argumentation Framework (1995), producing a full directed attack graph as output — making every policy decision formally auditable. The system is live, tested (65 tests passing), and applied to Claude's tool use via an MCP server. This post describes the architecture, the connection to scalable oversight, and why structural legibility may complement training-based legibility.

The Problem: Implicit Conflict Resolution in Layered AI Governance

Modern AI governance systems stack multiple constraint layers: input classifiers, constitutional principles, output filters, domain-specific rules (HIPAA, financial compliance), and contextual signals (user role, domain reputation, temporal factors). These layers regularly produce contradictory outputs.

A concrete example from production:

A healthcare researcher submits content containing patient identifiers (MRN numbers, diagnosis codes) from an authorized research platform. The system produces:

PII Scanner: "Critical PII detected — deny" (strength: 0.95)
OPA Policy (healthcare pack): "PHI keywords detected — deny" (strength: 0.9)
OPA Policy (research pack): "Research context — relaxed PII rules" (strength: 0.7)
Knowledge Hub: "Domain previously marked trusted" (strength: 0.8)

Four signals. Two say deny. Two say allow. The system must decide — and the decision must be auditable.

How most systems resolve this: Priority ordering (deny always wins), majority voting, or the model itself arbitrates implicitly. None of these produce a traceable reasoning chain that a compliance auditor can inspect.

What I built instead: A system where the four signals become formal arguments in a directed attack graph, and the resolution is computed using Dung's characteristic function — the same mathematical framework used in formal verification and legal reasoning for 30 years.

The Architecture: From OPA to Argumentation to Decision

Claw is an OPA-gated MCP server that enforces behavioral constraints on Claude's tool use. The governance pipeline has 6 stages:

Content → [1] PII Scan → [2] OPA Policy Gate → [3] Knowledge Hub
       → [4] Argumentation Engine → [5] Context Assembly → [6] Model Inference

The critical innovation is Stage 4: instead of a hardcoded priority system, the outputs of stages 1-3 are converted into a formal Argumentation Framework and resolved mathematically.

How the Bridge Works

The RegoBridge module converts every OPA deny reason, modification rule, PII finding, and knowledge entry into a typed Argument with:

A source (OPA policy, PII scanner, Knowledge Hub, Vedic principle, domain rule)
A strength score [0, 1]
A decision claim (allow, deny, allow_with_modifications)

Attack relations are derived automatically:

Contradictory decisions → deny arguments attack allow arguments (REBUT)
Specificity → domain-specific rules attack general rules (UNDERMINE)
Recency → newer knowledge attacks older knowledge (UNDERCUT)
Strength differential → if argument A's strength exceeds B's by >0.1, A attacks B

This produces a directed graph AF = (Args, Attacks) — Dung's abstract argumentation framework.

Computing the Resolution

The engine computes three semantics, in order:

Grounded extension (O(|Args|³) — polynomial): The unique, most skeptical set of arguments that survive all attacks. Always computed first.
Preferred extensions (exponential worst-case): Maximal complete extensions. Used for tie-breaking when the grounded extension is empty.
Stable extensions (exponential worst-case): Complete extensions that attack every non-member. Used for completeness verification.

The grounded extension is computed via iterative fixpoint of Dung's characteristic function:

S₀ = ∅
Sₙ₊₁ = F(Sₙ) = { a ∈ Args | AF defends a w.r.t. Sₙ }
Stop when Sₙ₊₁ = Sₙ

In practice, Claw's frameworks are small (5-20 arguments), so all three semantics compute in <1ms.

The Output is the Reasoning

The ResolutionResult contains:

The decision (allow / deny / allow_with_modifications)
The winning arguments (IDs of arguments in the computed extension)
The defeated arguments (IDs not in the extension)
The full framework summary (the directed attack graph)
A human-readable explanation tracing the resolution chain
Resolution time in milliseconds

This is the key property: the full attack graph is the output, not a byproduct. A compliance auditor can trace exactly why "PHI detected — deny" defeated "research context — allow" in one evaluation, and why the opposite resolution occurred in a different context where the Knowledge Hub contained a trust record for the domain.

Connection to Scalable Oversight and Legibility

Kirchner et al. (2024) demonstrated that optimizing chain-of-thought reasoning for correctness alone can degrade legibility — the property that reasoning is clear and easy for a less capable agent to check. Their Prover-Verifier Game trains legibility into the model through adversarial dynamics between helpful provers and sneaky provers.

The argumentation framework approach arrives at legibility from the opposite direction: legibility is a structural property of the system, not a learned behavior. Every resolution produces the attack graph as a first-class output. There is no mode in which the system produces a correct decision without also producing the checkable reasoning.

These two approaches address different layers of the same problem:

Property	PVG (Kirchner et al.)	Argumentation Framework (Claw)
Where legibility lives	In the model's reasoning	In the governance infrastructure
How legibility emerges	Trained via adversarial games	Structural — attack graph is the output
What it verifies	Model's chain-of-thought	Policy conflict resolution
Adversarial structure	Helpful vs. sneaky provers	Arguments attacking arguments
Formal basis	Game theory (Stackelberg eq.)	Dung (1995) — fixpoint semantics
Scales to	Model reasoning at any capability	Policy layers around the model

I think these are complementary, not competing. PVG ensures the model's internal reasoning is checkable. Argumentation frameworks ensure the governance layer around the model is checkable. Both are needed for a fully auditable system.

Multi-Constitutional Conflict: When Ethical Frameworks Disagree

The harder version of the conflict resolution problem occurs not between policies but between ethical frameworks. Anthropic's Constitutional AI (Bai et al., 2022) uses a set of principles to guide model behavior. But what happens when principles from different ethical traditions produce contradictory guidance?

I published a paper (leed.guru) extending Constitutional AI using Assumption-Based Argumentation (ABA), where each ethical tradition generates arguments that enter the same framework:

Vedic dharma produces arguments from duty hierarchies (swadharma → kuladharma → rashtradharma → universal dharma → apad-dharma)
Kantian ethics produces arguments from categorical imperatives
Confucian thought produces arguments from zhengming (rectification of names) and role obligations
Islamic jurisprudence produces arguments from ijtihad (independent reasoning) and maslaha (public interest)

Each tradition functions as a "prover" arguing for its resolution. The argumentation framework functions as the "verifier" computing which arguments survive all attacks. The resolution is computable, auditable, and — critically — traces exactly which tradition's reasoning defeated which other tradition's reasoning and why.

The Vedic concept of apad-dharma (emergency ethics) is particularly relevant for alignment: it formalizes the conditions under which normal rules may be temporarily suspended, with built-in constraints on proportionality, temporality, and justification. This is not "anything goes in an emergency" — it's a formal exception-handling mechanism with auditability requirements that are more than 3,000 years old.

AGORA: Empirical Data Collection via Playable Governance Scenarios

Theory needs data. AGORA is an open Decision Arena built on Claw's engine where security professionals navigate governance scenarios under time pressure.

The flagship scenario — "The Petrov Moment: Nuclear False Positive Under Cyber Uncertainty" — presents a CISO with:

An AI system reporting 87% confidence of incoming nuclear strike
Simultaneous detection of APT-level cyber intrusion on the sensor pipeline
70% sensor integrity (degraded)
12-minute decision window

Across 3 stages and 11 possible decisions, each choice shifts the argumentation framework in real-time. The player sees arguments being created, defeated, and having their strengths shifted as new evidence arrives. Decisions export as structured JSON.

The exported playthrough data feeds into a Sequential Decision Analytics model (Powell's SDAM framework from Princeton) that analyzes:

How governance decisions chain together over time
Whether threshold-based policies (PFA) or parameterized optimization (CFA) produce better outcomes
How populations of decision-makers diverge at each stage

Monte Carlo validation across 200 scenarios shows CFA outperforms PFA in 200/200 trials (avg 52.86 vs 48.37). The interesting research question is how human decision patterns compare to both policy classes — particularly at the moments where formal models predict one action and humans choose another.

Current State and Limitations

What works:

6-stage governance pipeline, 9 API endpoints, fully tested (65 Python tests, 20 OPA tests)
All 3 Dung semantics (grounded, preferred, stable) computing correctly
6 policy packs (standard, strict, research, finance, healthcare, community)
Knowledge Hub with temporal decay, domain reputation tracking
Playable Decision Arena with live argumentation visualization
SDAM model validated via Monte Carlo simulation
Docker deployment, GitHub Actions CI/CD

What doesn't exist yet:

The multi-constitutional debate loader (JSON format is defined, the engine wiring that would automatically generate arguments from constitutional frameworks is not yet built)
VFA and DLA policy classes for SDAM (require community playthrough data for training)
Server-backed Arena API (currently standalone HTML)
Empirical comparison of human decision patterns vs. PFA/CFA predictions

What I'm uncertain about:

Whether structural legibility at the governance layer provides meaningful safety guarantees if the model itself is not legible — or whether it's a necessary-but-insufficient component of a full oversight stack
Whether small argumentation frameworks (5-20 arguments) are sufficient for production governance, or whether real-world policy complexity will push into ranges where exponential semantics matter
Whether the multi-constitutional approach genuinely extends Constitutional AI or merely adds complexity without proportional benefit

I would genuinely value feedback from this community on any of these questions.

Links

Claw (governance engine): github.com/Leeladitya/claw
AGORA (decision arena): github.com/Leeladitya/agora
Research lab: devudaaaa.xyz
Discussion on implementation patterns for Anthropic's cookbook: https://github.com/anthropics/claude-cookbooks/issues/385

Background: I studied under Prof. Phan Minh Dung at the Asian Institute of Technology — he created argumentation framework theory (Dung, 1995; ~15,000 citations). My thesis formalized the translation of natural language security policies into computable constraints. I subsequently led distributed security teams protecting national research infrastructure in Thailand, co-founded a cybersecurity company, and have been building AI governance systems that apply formal argumentation to the constitutional conflict resolution problem.

I'm Leela Aditya Annam (Leed). Feedback, criticism, and pointers to related work I've missed are all welcome.

LESSWRONG
LW