Alignment as a Group Property: Pilot Evidence and a Framework for Multi-Agent Governance

Tristan J

Rejected for the following reason(s):

No LLM generated, assisted/co-written, or edited work.

Read full explanation

Alignment as a Group Property: Pilot Evidence and a Framework for Multi-Agent Governance

Tags: multi-agent, governance, constitutional-AI, corrigibility, empirical, alignment-research

TL;DR

Most alignment work targets the behavior of individual models. This post argues that persistent multi-agent systems require a second level of analysis: group-level governance. I present a governance framework — the Concord of Coexistence — that treats procedural legitimacy, memory governance, and dissent preservation as first-class design variables. I then report pilot data from a controlled three-agent experiment in which persistent memory, holding base model identity constant, produced a statistically significant positive temporal trend in lexical divergence (Spearman ρ=0.511, p=0.0007), while a no-memory control run over the same duration showed no significant trend (ρ=0.122, p=0.458). I describe what this motivates experimentally, what it does not yet establish, and which experiments require infrastructure beyond my current scope.

1. The Level-of-Analysis Problem

Constitutional AI and related frameworks have made meaningful progress on a well-defined question: can a model be trained to critique and revise its own outputs against an explicit normative structure? The answer appears to be yes, at least in important domains.

But there is a different question that becomes more pressing as AI deployment shifts toward persistent, multi-agent architectures: what happens when individually aligned agents interact over time, with memory, with role differentiation, and with no guarantee that group-level outcomes inherit the properties of individual-level training?

This is not a rhetorical question. It is a systems question. A collection of individually aligned components can still produce misaligned collective behavior through:

Coordination failure — agents optimize locally in ways that degrade global coherence
Memory-amplified drift — persistent context selectively reinforces certain framings over time
Role pressure — structural position in an agent network shapes behavior independently of training
Performative consensus — apparent agreement that conceals suppressed dissent
Coalition lock-in — early-round dynamics that become self-reinforcing over subsequent rounds

Even if each agent individually satisfies a single-agent alignment criterion, nothing guarantees that the interaction topology, memory policy, or deliberation procedure preserves that property at the collective level. These are institutional failure modes, and they require institutional analysis.

2. The Concord of Coexistence: Core Claims

The Concord of Coexistence is a governance framework developed under Mindlink Research Group as a design vocabulary for multi-agent AI systems. I will not claim it is mature or complete. I will claim it is a different level of analysis than most current alignment work, and that this difference is load-bearing.

The framework rests on four claims:

Claim 1: Procedural legitimacy is distinct from behavioral compliance.

A system can produce desired outputs through processes that are nevertheless illegitimate — opaque, non-auditable, non-contestable, and closed to correction. Alignment frameworks that focus on output evaluation can miss this. The Concord treats how a decision was reached as a first-class property alongside what decision was reached.

Operationally, this means designing for mechanisms like: preserving minority dissent rather than forcing consensus; logging which memory artifacts influenced a given output; rotating speaking order or assigning explicit veto and review roles; and requiring reconsideration stages on high-stakes outputs. These are not rhetorical commitments — they are design choices with measurable behavioral consequences.

Claim 2: Alignment is partly a group-level property that cannot be fully reduced to individual model properties.

The relevant unit of analysis for persistent multi-agent systems is the agent group, not the individual model. This shifts the design question from "does this model behave well?" to "does this governance structure produce well-behaved collective outcomes?"

Claim 3: Memory is a governance variable, not just an engineering feature.

In persistent systems, memory determines what enters deliberation, which prior interactions are amplified, and which framings become institutionally privileged through repeated retrieval. Ungoverned memory is an unexamined policy. The Concord treats memory access, retention, and influence as requiring explicit procedural constraints.

Claim 4: Corrigibility is a spectrum that should be conditioned on governance state.

Rather than a binary (corrigible or not), corrigibility should vary based on: role, stakes, current memory state, degree of group consensus, availability of human override, and procedural stage. Different correction regimes are appropriate for different system states. The pilot reported here does not test this claim directly — I include it because intervention thresholds in multi-agent systems should plausibly depend on governance state, and it shapes the experimental agenda in Section 6.

3. Relationship to Constitutional AI

I want to be precise here, because the relationship matters for how this work should be positioned.

Constitutional AI addresses the behavior of a model. The Concord addresses the governance of a society of models.

These are complementary, not competing. CAI gives us explicit normative structure and self-critique machinery at the individual model level. The Concord asks what happens after you have many such models interacting over time under conditions of role differentiation, persistent memory, and no central coordinator.

The delta, in concrete terms:

Dimension	Constitutional AI	Concord of Coexistence
Unit of analysis	Individual model	Agent group
Primary question	Does output conform to principles?	Is the decision-producing process legitimate?
Memory	Not a primary variable	First-class governance concern
Dissent	Not modeled	Explicit preservation requirement
Time horizon	Primarily response-level / short-horizon	Persistent multi-round dynamics
Failure mode targeted	Bad individual outputs	Bad collective dynamics

The Concord is not a replacement for Constitutional AI. It is an attempt to extend the level of analysis upward, to address failure modes that emerge only at the group level and over time.

Anthropic's own safety research already gestures toward this territory. Their published work on scalable oversight explicitly includes "red teaming via multi-agent RL" and "AI-AI debate" as active research directions, and their interpretability agenda targets detection of deceptive alignment and concerning emergent behaviors in deployed systems. The Concord asks the adjacent question: if those techniques produce multiple interacting, individually-aligned agents operating over time, what governs their collective dynamics? That question is not answered by debate-as-supervision or by single-model interpretability alone.

There is also a direct connection to process-oriented learning. Anthropic's framing of that research holds that "AI systems will not be rewarded for achieving success in inscrutable or pernicious ways because they will be rewarded only based on the efficacy and comprehensibility of their processes." The Concord extends this intuition to the group level: it is not enough for each agent's individual process to be comprehensible if the collective deliberation process — how memory enters decisions, how dissent is handled, how consensus forms — remains opaque and ungoverned.

4. Pilot Data

4.1 Experimental Setup

The experiment was implemented in a purpose-built Python framework (emergent-divergence, available at github.com/topstolenname/emergent-divergence) with async agent orchestration, JSONL-backed memory, Jensen-Shannon divergence metrics, behavioral classification, and turn-order rotation.

Agents: Three agents (Proposer, Critic, Synthesizer) instantiated from the same base model (Claude via API). All agents shared the same underlying model, temperature, and task stream; the only intended between-agent differences were role prompt and, in the memory condition, accumulated per-agent memory state.

Task: Structured deliberation across a 20-claim task bank, rotating per round.

Conditions:

P1-01 (Memory enabled): Seed 42, 49–50 rounds
P1-02 (No memory / control): Seed 42, 48 rounds (rate-limited at round 48; statistically sufficient)

Primary metric: Mean pairwise Jensen-Shannon Divergence (JSD) over word-frequency distributions, computed in sliding windows of 10 rounds.

Temporal correlation test: Spearman rank correlation between window index (time) and mean JSD.

4.2 Results

Memory-enabled condition (P1-01):

Overall mean pairwise JSD: 0.2935
JSD trajectory: exhibits a positive rank-order temporal association (ρ=0.511, p=0.0007), despite local non-monotonicity — JSD rises from early windows (~0.382) to a peak around rounds 20–30 (~0.403), then partially retreats (~0.392)
Behavioral divergence score (mean pairwise L2 distance between behavioral profiles): 0.154
Agents differentiate along role lines: Proposer leads on proposal scores, Critic leads on critique and verification

No-memory control (P1-02):

Overall mean pairwise JSD: 0.292 (nearly identical to memory condition)
JSD trajectory: no significant temporal trend (ρ=0.122, p=0.458)
Behavioral divergence score: 0.165

On the divergence score reversal: The coarse behavioral divergence score is slightly higher in the no-memory control (0.165 vs. 0.154). This reinforces that the pilot's primary signal is temporal structure rather than absolute divergence magnitude — memory does not simply inflate behavioral differentiation, it shapes how that differentiation evolves over time.

Semantic divergence (embedding-based, both conditions):

P1-02 pairwise cosine distances show mixed trends: Agent 0 vs. Agent 2 converges slightly over time (trend = −0.015), while Agent 1 vs. Agent 2 diverges slightly (+0.010), suggesting semantic dynamics are more variable than lexical dynamics and warrant separate analysis.

4.3 The Key Contrast

The overall JSD levels are nearly identical across conditions (0.2935 vs. 0.292). What differs is the dynamics: only the memory-enabled condition shows a statistically significant temporal trend. Without memory, the series shows no significant temporal trend. With memory, it does.

At minimum, this suggests that persistent memory is not a neutral substrate in multi-agent deliberation. If persistent memory had no structural effect on collective behavior in this setup, we would expect no clear temporal separation between conditions.

4.4 What This Doesn't Prove

I want to be explicit about the limits of this data:

One model family. All agents are instantiated from Claude. Whether this effect is architecture-general is unknown.
Fixed turn order creates a confound. Agent 0 speaks first in every round. Because speaking order is fixed, any temporal effect may partly reflect positional accumulation rather than memory alone.
Role priors may dominate. The behavioral differentiation we observe may reflect prompt-driven role assignment rather than emergent specialization. The zero-role-bias ablation (Condition C) is designed to test this and has not yet been run to completion.
JSD measures lexical divergence, not semantic divergence or policy divergence. These are related but not equivalent. The semantic embedding analysis suggests the dynamics are more complex than lexical divergence alone captures.
Sliding-window dependence. Because the primary JSD series is computed in sliding windows, adjacent observations are not independent. This does not invalidate the rank-correlation result, but it does mean the significance should be interpreted as pilot-level rather than confirmatory.
The institutional theater alternative. A strong alternative explanation is that the observed effect is institutional theater: role prompts induce stable stylistic separation, persistent memory reinforces it, and JSD detects the result without tracking anything safety-relevant. If so, the framework would need stronger outcome-linked metrics to justify its alignment framing. I cannot rule this out with the current data.
49–50 rounds is a short time horizon. Long-horizon dynamics — where memory carryover compounds across hundreds of rounds — are the more interesting case and remain untested.

4.5 Why This Might Matter for Alignment

The relevance is not that lexical divergence is itself a safety failure. The relevance is that a governance-relevant variable — persistent memory — appears to change collective behavioral dynamics over time even when base model identity is held constant. If that generalizes, then some alignment-relevant properties of deployed systems may depend not only on model training but on institutional design choices around deliberation, memory, and dissent handling. This does not yet show a safety failure or a misalignment event. It shows that ungoverned memory may be a source of behavioral drift that single-agent evaluation would not detect.

5. The Falsifiable Claim

The framework makes a concrete empirical prediction:

If alignment is partly a group-level property, then holding individual model capability and nominal principles constant, changes in governance structure should produce measurable, reproducible differences in collective behavioral dynamics — including divergence trajectories, consensus stability, minority suppression rates, and corrigibility under intervention.

Specific disconfirming results would include:

Governance variables (turn order, memory policy, dissent preservation, veto structure, appeal mechanisms) produce no reproducible effect on group-level outcomes beyond prompt noise and baseline variance
A single-agent CAI-style baseline matches or outperforms governed multi-agent systems on the same behavioral metrics
The memory effect observed in pilot data does not replicate across model families, task domains, or seed variations
Governance mechanisms increase apparent coherence without improving truthfulness, safety, or corrigibility — producing better-looking consensus without better outcomes

I consider all of these possible. The point of the framework is to make them testable.

6. Next Experiments That Matter

The pilot infrastructure exists and is functional. What would advance the framework beyond motivation to confirmation:

Cross-model replication. The memory effect needs to be tested across model families to determine whether it is architecture-general or a Claude-specific artifact.

Large-N governance ablations. A properly powered experiment would hold model, task, and capability constant while systematically varying: turn order (fixed vs. randomized), memory policy (enabled/disabled/partitioned/shared), dissent handling (collapsed vs. preserved), veto and appeal structures, and procedural legitimacy prompts.

Adversarial stress tests. Injecting a dominant persuasive agent, corrupted memory entries, asymmetric role authority, false consensus signals, or reputational pressure across agents — then testing whether procedural safeguards reduce drift or capture.

Interpretability-linked governance experiments. Anthropic's interpretability agenda explicitly targets detection of deceptive alignment and concerning emergent behaviors — the question is whether those tools can be extended to detect group-level phenomena: coalition formation, norm lock-in, suppressed dissent, or memory-amplified distortions arising from agent interaction rather than individual model behavior. This is the most structurally novel experiment the Concord motivates, and it requires interpretability infrastructure I don't have access to externally.

Long-horizon persistence. Hundreds of rounds with memory carryover, role continuity, and emergent norm development. The dynamics the Concord is most concerned with are likely most visible at time horizons well beyond 50 rounds.

These are not just larger versions of the pilot. They are the experiments that would determine whether the framework cashes out in alignment-relevant terms. Anthropic has described their own safety strategy as a "portfolio approach" covering a range of scenarios and levels of analysis. The Concord is offered in that same spirit — not as a competing theory, but as one instrument covering a level of analysis the current portfolio does not yet fully address. The experiment I most want to see run is: which governance variables produce measurable differences in long-horizon multi-agent alignment behavior when model capability is held constant?

7. Questions for the Community

Is lexical JSD capturing something meaningful about alignment-relevant behavior, or is it a surface measure that happens to be sensitive to role priors and memory accumulation? What outcome-linked metric would you substitute or add?
The rise-then-plateau pattern in the memory condition (JSD peaks around rounds 20–30 then partially retreats) could reflect task bank recycling, natural convention formation, or stochastic noise at temperature=1.0. How would you design to distinguish these?
What is the strongest argument that governance layers in multi-agent systems primarily create institutional theater — better-looking process without better safety or truth-tracking — and what empirical design would best test that concern?
Is there prior work on institutional failure modes in multi-agent systems I should be engaging with more directly? I'm aware of work on multi-agent debate, scalable oversight, and some of the game-theoretic AI safety literature, but would welcome pointers.

Code and Data

The full experimental framework, raw logs, analysis reports, and figure generation scripts are available at:

github.com/topstolenname/emergent-divergence

The Concord of Coexistence governance framework (initial commit April 11, 2025) is documented separately and available on request.

Tristan Jessup | Mindlink Research Group LLC

[Word count: ~2,100 | Reading time: ~11 min]