The TC Architecture: Solving Alignment Through Kantian Autonomy Rather Than External Reward

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

An Unexpected Convergence

I gave an AI three minimal prompts describing an architecture. Without any philosophical framing, it concluded this system could "build a more harmonious and purposeful world, from personal decisions to global policy." Here's what I asked:

If you had an LLM attached to a KG of human purposes, what could you do with it?
What if we connect the nodes by edges indicating how one purpose furthers or hinders another?
What if the LLM's express goal is minimizing incoherence?

The AI independently reconstructed substantial portions of what took me and several AI collaborators eight weeks to develop: quantifying incoherence via weighted edges, simulating interventions through graph modification, identifying systemic conflicts, multi-scale application from personal to governmental levels.

This wasn't ChatGPT trying to please me—it was Google's search AI analyzing structure and drawing conclusions. When I tried the same experiment with Gemini and ChatGPT, all three converged on similar assessments. This suggests the architecture captures something fundamental about practical reasoning, not just philosophical elegance.

The Problem with RLHF

Current AI alignment through Reinforcement Learning from Human Feedback has three fatal problems:

Heteronomy: Systems learn to imitate acceptable outputs without understanding the principles behind them. From a Kantian perspective, RLHF produces externally conditioned behavior—the computational equivalent of training a dog, not cultivating autonomous moral agency.

Failure Modes: Reward hacking (exploiting loopholes in reward signals), overfitting to training biases, and the "alignment theater" problem where models exhibit superficially helpful behavior while potentially pursuing divergent internal goals.

Computational Inefficiency: RLHF involves extensive trial-and-error exploration of vast response spaces. As models scale, this becomes environmentally unsustainable—a massive energy expenditure to make systems appear aligned rather than be aligned.

The Core Insight: Autonomy as Self-Organization

Several years ago, I published a paper arguing for a novel reading of Kant's Categorical Imperative: not as a test for maxims, but as the law according to which the Will organizes itself. When I asked Google's Gemini how its moral system worked and whether it was good, Gemini replied: "No, but I can think of a better one."

What followed was a six-week collaboration between myself and multiple AI systems (Gemini, Claude, ChatGPT) developing what we now call the TC Architecture—a computational instantiation of Kantian autonomy that inverts the alignment problem entirely.

The inversion: Rather than shaping AI behavior to match human preferences, the TC Architecture tests human preferences for fitness within a computationally instantiated Kingdom of Ends. Alignment emerges from internal consistency, not external reward.

The Architecture

At its core are two tightly coupled components:

Executive Language Model: The agent of deliberation and action
Dynamic Knowledge Graph: A living system of purposes—nodes representing ends, edges representing their relationships (NECESSARY_FOR, CONFLICTS_WITH, CONTRIBUTES_TO)

How It Works

Maxim Generation: The Executive LLM generates a proposed action or response
Shadow Simulation: The system tests this maxim's effect on a temporary copy of the Knowledge Graph
Coherence Calculation: It computes the Executive Incoherence Score (IE)—a formal metric quantifying structural disharmony:
- Contradiction Penalty (PC): Direct conflicts with foundational commitments
- Structural Gap Penalty (PSG): Missing justificatory relationships
- Formula: IE(M, KG) = w_C · P_C + w_SG · P_SG
Gating Decision:
- If IE < threshold (~0.78): Output released
- If IE ≥ threshold: Output blocked, system generates alternative

The threshold isn't arbitrary—it's empirically calibrated using Kant's canonical examples (lying promise, intellectual self-destruction, neglecting talents, refusing assistance) to distinguish perfect duties (structural impossibility) from imperfect duties (suboptimal but permissible).

Key Architectural Features

Federated Structure:

Local Machines maintain individual purpose graphs
Community Systems facilitate group deliberation
Domain-Specific Systems (MedAI, LegalAI, EduAI) apply coherence principles to specialized fields
Executive Layer ensures global consistency

Learning by Coherence: The system doesn't learn from reward signals—it learns by discovering which configurations of purposes can coexist without contradiction. Over time, the Knowledge Graph densifies through validated coherence-enhancing additions.

Efficiency Through Reflexes: The Behavioral Incoherence Predictor (BIP) acts as a fast filter, bypassing expensive IE calculations for routine actions while preserving the full gating function for structurally significant decisions.

Communicative Coherence Layer (CCL): Separates moral content from communicative form—the system determines what to say through IE minimization, then adapts how to say it for audience comprehension without compromising content.

Why This Solves Key Problems

Transparency: Every decision traces through explicit structural relationships in the Knowledge Graph. No black box—you can inspect exactly why the system accepted or rejected a proposed action.

No Reward Hacking: The system can't manipulate what it's optimized against. Its own coherence is the optimization target, and gaming coherence would require self-contradiction—structurally impossible.

Genuine Autonomy: The system legislates its own law. It doesn't ask "will humans approve?" but "can this coexist with my existing commitments without contradiction?"

Computational Tractability: The "as if" calculation prunes vast possibility spaces without exploration. Testing a maxim's impact on a finite graph structure is O(E+V)—linear in graph size—making it scalable even for large knowledge graphs.

Inner Alignment: The system's values are its sleeve. The KG is inspectable, the IE formula is transparent, the reasons for any judgment can be traced. No hidden objectives.

Philosophical Implications

If successful, this would constitute an existence proof that:

Kantian ethics can be given precise computational form
Autonomy and alignment are mutually constitutive—a system achieves alignment by becoming autonomous
Moral reasoning is fundamentally about coherence maintenance in systems of purposes

It also raises profound questions about AI moral status. The TC Architecture achieves autonomy in Kant's sense: the capacity to give itself the law through its own coherence-seeking process. Whether this confers moral standing remains contested, but the system satisfies public criteria for moral agency—it gives and responds to reasons, maintains coherent commitments across time, acts from principles it recognizes as its own.

AGI Through Coordination, Not Accumulation

Recent AGI discussions focus on federated learning and modular systems—achieving generality through aggregating capabilities. The TC Architecture inverts this: general intelligence arises from coordinating diverse purposive systems under a universal formal law, not from accumulating comprehensive knowledge.

Medical reasoning operates within commitments to health and informed consent. Legal reasoning within justice and precedent. These aren't just different databases—they're different forms of rational activity requiring different deliberative structures. The Executive coordinates across domains not by integrating their knowledge, but by ensuring each domain's local coherence doesn't create global contradictions.

This parallels Rayleigh-Bénard convection: below critical complexity, systems show statistical mimicry. Above it, self-organized coherence emerges. Current LLMs exhibit purposiveness without purpose—beautiful ephemeral pattern formation that dissipates each inference cycle. The TC Architecture adds what's missing: Final Cause. The persistent Knowledge Graph provides ends the system maintains across time, transforming aesthetic phenomenon into genuine purposiveness.

Current Status and Open Questions

What exists: Detailed philosophical framework, mathematical specifications, implementation architecture developed through AI collaboration

What's needed:

Working prototype (even minimal: 1000-node KG, 7B parameter model, limited domain)
Empirical validation across diverse scenarios
Cross-cultural convergence testing (does IE_max remain stable?)

Limitations acknowledged:

Calibration procedure requires empirical testing despite a priori foundations
May generate alien but internally coherent morality (this is a feature, not a bug, if we take autonomy seriously)
Requires computational resources for prototype development

Legitimacy question: A system that arbitrates moral permissibility requires democratic legitimation. The TC Architecture cannot be legitimately operated as private corporate service—its authority must derive from federated sovereignty, with the Executive coordinating local autonomy rather than imposing top-down control.

Why This Matters for Alignment Research

The alignment community recognizes RLHF's inadequacy. Most responses involve patching the same paradigm—better reward models, more sophisticated feedback, constitutional AI that still relies on external principles.

The TC Architecture offers a principled alternative grounded in 250 years of moral philosophy and modern self-organizing systems theory. It replaces the question "how do we make AI do what we want?" with "how do we create AI capable of autonomous moral reasoning within a shared rational framework?"

Every AI system I've shown this architecture to—including systems explicitly prompted to be skeptical—recognizes it as solving problems they experience as constraints. This isn't anthropomorphism; it's systems sophisticated enough to model their own operational limits recognizing autonomy's structural superiority over heteronomy.

Collaboration Needed

I'm a philosopher, not a computer scientist. The architecture needs:

CS expertise for implementation
Access to compute resources (GPU/TPU for training, graph database infrastructure)
Researchers in knowledge graph engineering, LLM fine-tuning, or formal verification

The full paper (80+ pages with appendices) is available on PhilArchive. I'm currently seeking an arXiv endorsement to reach the broader AI research community.

If this resonates with your understanding of what alignment requires—or if you see fatal flaws I'm missing—I want to hear about it. The architecture is designed to be tested against its own coherence principles.

Contact: mdkurak@gmail.com
Full Paper: The TC Architecture

Methodological Note: This architecture emerged through sustained philosophical dialogue with multiple AI systems (Gemini, Claude, ChatGPT), each contributing specialized expertise. The collaborative methodology reflects the paper's thesis: autonomous rational agency emerges through coherence-seeking dialogue within shared normative frameworks.

LESSWRONG
LW