Rejected for the following reason(s):
- This is an automated rejection.
- you wrote this yourself (not using LLMs to help you write it)
- you did not chat extensively with LLMs to help you generate the ideas.
- your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
Read full explanation
TL;DR: When constitutional classifiers produce contradictory signals (e.g., "block — contains PHI" vs. "allow — authorized research context"), the resolution is typically implicit: priority ordering, hardcoded rules, or model-level arbitration that's difficult to audit. I built an open-source system that resolves these conflicts using Dung's Abstract Argumentation Framework (1995), producing a full directed attack graph as output — making every policy decision formally auditable. The system is live, tested (65 tests passing), and applied to Claude's tool use via an MCP server. This post describes the architecture, the connection to scalable oversight, and why structural legibility may complement training-based legibility.
The Problem: Implicit Conflict Resolution in Layered AI Governance
Modern AI governance systems stack multiple constraint layers: input classifiers, constitutional principles, output filters, domain-specific rules (HIPAA, financial compliance), and contextual signals (user role, domain reputation, temporal factors). These layers regularly produce contradictory outputs.
A concrete example from production:
A healthcare researcher submits content containing patient identifiers (MRN numbers, diagnosis codes) from an authorized research platform. The system produces:
Four signals. Two say deny. Two say allow. The system must decide — and the decision must be auditable.
How most systems resolve this: Priority ordering (deny always wins), majority voting, or the model itself arbitrates implicitly. None of these produce a traceable reasoning chain that a compliance auditor can inspect.
What I built instead: A system where the four signals become formal arguments in a directed attack graph, and the resolution is computed using Dung's characteristic function — the same mathematical framework used in formal verification and legal reasoning for 30 years.
The Architecture: From OPA to Argumentation to Decision
Claw is an OPA-gated MCP server that enforces behavioral constraints on Claude's tool use. The governance pipeline has 6 stages:
The critical innovation is Stage 4: instead of a hardcoded priority system, the outputs of stages 1-3 are converted into a formal Argumentation Framework and resolved mathematically.
How the Bridge Works
The
RegoBridgemodule converts every OPA deny reason, modification rule, PII finding, and knowledge entry into a typedArgumentwith:Attack relations are derived automatically:
This produces a directed graph AF = (Args, Attacks) — Dung's abstract argumentation framework.
Computing the Resolution
The engine computes three semantics, in order:
The grounded extension is computed via iterative fixpoint of Dung's characteristic function:
In practice, Claw's frameworks are small (5-20 arguments), so all three semantics compute in <1ms.
The Output is the Reasoning
The
ResolutionResultcontains:This is the key property: the full attack graph is the output, not a byproduct. A compliance auditor can trace exactly why "PHI detected — deny" defeated "research context — allow" in one evaluation, and why the opposite resolution occurred in a different context where the Knowledge Hub contained a trust record for the domain.
Connection to Scalable Oversight and Legibility
Kirchner et al. (2024) demonstrated that optimizing chain-of-thought reasoning for correctness alone can degrade legibility — the property that reasoning is clear and easy for a less capable agent to check. Their Prover-Verifier Game trains legibility into the model through adversarial dynamics between helpful provers and sneaky provers.
The argumentation framework approach arrives at legibility from the opposite direction: legibility is a structural property of the system, not a learned behavior. Every resolution produces the attack graph as a first-class output. There is no mode in which the system produces a correct decision without also producing the checkable reasoning.
These two approaches address different layers of the same problem:
I think these are complementary, not competing. PVG ensures the model's internal reasoning is checkable. Argumentation frameworks ensure the governance layer around the model is checkable. Both are needed for a fully auditable system.
Multi-Constitutional Conflict: When Ethical Frameworks Disagree
The harder version of the conflict resolution problem occurs not between policies but between ethical frameworks. Anthropic's Constitutional AI (Bai et al., 2022) uses a set of principles to guide model behavior. But what happens when principles from different ethical traditions produce contradictory guidance?
I published a paper (leed.guru) extending Constitutional AI using Assumption-Based Argumentation (ABA), where each ethical tradition generates arguments that enter the same framework:
Each tradition functions as a "prover" arguing for its resolution. The argumentation framework functions as the "verifier" computing which arguments survive all attacks. The resolution is computable, auditable, and — critically — traces exactly which tradition's reasoning defeated which other tradition's reasoning and why.
The Vedic concept of apad-dharma (emergency ethics) is particularly relevant for alignment: it formalizes the conditions under which normal rules may be temporarily suspended, with built-in constraints on proportionality, temporality, and justification. This is not "anything goes in an emergency" — it's a formal exception-handling mechanism with auditability requirements that are more than 3,000 years old.
AGORA: Empirical Data Collection via Playable Governance Scenarios
Theory needs data. AGORA is an open Decision Arena built on Claw's engine where security professionals navigate governance scenarios under time pressure.
The flagship scenario — "The Petrov Moment: Nuclear False Positive Under Cyber Uncertainty" — presents a CISO with:
Across 3 stages and 11 possible decisions, each choice shifts the argumentation framework in real-time. The player sees arguments being created, defeated, and having their strengths shifted as new evidence arrives. Decisions export as structured JSON.
The exported playthrough data feeds into a Sequential Decision Analytics model (Powell's SDAM framework from Princeton) that analyzes:
Monte Carlo validation across 200 scenarios shows CFA outperforms PFA in 200/200 trials (avg 52.86 vs 48.37). The interesting research question is how human decision patterns compare to both policy classes — particularly at the moments where formal models predict one action and humans choose another.
Current State and Limitations
What works:
What doesn't exist yet:
What I'm uncertain about:
I would genuinely value feedback from this community on any of these questions.
Links
Background: I studied under Prof. Phan Minh Dung at the Asian Institute of Technology — he created argumentation framework theory (Dung, 1995; ~15,000 citations). My thesis formalized the translation of natural language security policies into computable constraints. I subsequently led distributed security teams protecting national research infrastructure in Thailand, co-founded a cybersecurity company, and have been building AI governance systems that apply formal argumentation to the constitutional conflict resolution problem.
I'm Leela Aditya Annam (Leed). Feedback, criticism, and pointers to related work I've missed are all welcome.