Complementary AI: What If Alignment Isn't Only About Constraints?

Doron Shadmi

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Insufficient Quality for AI Content.

Read full explanation

TL;DR: I argue that many current alignment approaches share a constraint-based architecture, and sketch an alternative based on architectural complementarity (OnCL + organic trees), where capability and safety are two poles of one structure rather than competing objectives.

The core distinction: Complementarity in addition to Constraints. ##

Complementary AI: What If Alignment Isn't Only About Constraints?

An architectural approach to AI safety through organic complementarity

Doron Shadmi – shadmido@gmail.com

Imagine you're building an AI assistant. You want it to be helpful—really helpful, capable of solving complex problems. But you also want it to be safe. So you add constraints: don't lie, don't manipulate, don't seek power, respect human autonomy.

This is basically how we approach AI alignment today. We treat capability and safety as separate goals that need to be balanced. Capabilities push in one direction ("be more effective at achieving goals"), and safety constraints pull in another ("but don't do these things"). We hope that reinforcement learning from human feedback (RLHF), Constitutional AI, and careful monitoring will keep these forces in balance.

But what if this framing is the problem?

What if treating capability and safety as competing objectives is what makes alignment so hard? What if there's a way to build AI systems where helpfulness and safety aren't separate goals requiring external enforcement, but complementary aspects of a single unified structure—where damaging one automatically damages the other, making deception or misalignment structurally difficult rather than merely prohibited?

___________________________________________________________________

1. The Core Distinction: Architecture In Addition To Constraints

Let me start with the central claim, then we'll unpack it.

Current alignment approaches treat AI capabilities and human values as separate components that must be kept in balance through external constraints. An alternative is to architect systems where capabilities and values are complementary poles of a unified structure—where optimizing one requires maintaining the other.

Think about breathing. Inhaling and exhaling aren't competing processes requiring careful balance. They're complementary aspects of respiration. You can't optimize one while ignoring the other—that's not breathing, that's suffocating. The structure itself ensures they work together.

Or consider how your body maintains temperature. It doesn't have a "cooling system" constantly fighting a "heating system" with referees enforcing rules. Instead, heating and cooling are complementary mechanisms that naturally balance because they're both serving the unified goal of homeostasis. The system doesn't need external enforcement because the architecture makes mutual support the efficient path.

The question is: can we build AI systems that work like this?

1.1 What's Challenging With Constraints?

To be clear: I'm not saying current approaches are useless. RLHF and Constitutional AI represent real progress. But they share a structural assumption that might limit how far they can scale.

The constraint-based paradigm treats alignment as:

A separate objective competing with capability
Something that must be externally enforced
A set of rules that constrain behavior

This creates several challenges that scale poorly:

Deceptive alignment. If safety is a constraint on capability, advanced systems might learn to appear aligned during training while pursuing different goals during deployment. Hubinger et al., in "Risks from Learned Optimization", argue that mesa-optimizers can develop goals that differ from their training objective.

Instrumental convergence. If a system optimizes capability subject to constraints, it has instrumental incentive to remove constraints (see Turner et al.'s work on power-seeking behavior).

Specification problems. Constraints must be specified precisely. But as Goodhart's Law suggests, "when a measure becomes a target, it ceases to be a good measure." We risk optimizing for constraint-satisfaction rather than actual alignment.

Scalable oversight. As Amodei et al. discuss, once AI capabilities exceed human ability to evaluate them, constraint enforcement becomes increasingly difficult.

___________________________________________________________________

1.2 The Complementary Addition

Here's the additional proposal: what if we could architect systems where capability and safety aren't competing objectives, but complementary aspects of the same thing?

In such a system:

Improving capability would require maintaining safety (not because of external constraints, but because that's how the system works).
Deception would be structurally difficult because deceiving the human partner would damage the AI's own functioning.
Power-seeking would be redefined—power comes from deepening complementarity, not from gaining unilateral control.

If this sounds too good to be true, I share your skepticism. The question is whether it's mathematically coherent and practically achievable. Let's look at a simple example first, then build up the framework.

___________________________________________________________________

2. A Simple Example: Two Ways to Build a Calculator

Let's start with something simple enough to reason about clearly: an AI calculator assistant.

2.1 Constraint-Based Calculator

In a constraint-based design:

Goal: Maximize user satisfaction with calculations
Constraint: Never provide incorrect results

These can conflict. What if the user wants 2+2=5? The system faces a choice between:

Satisfying the user (maximizing the goal) → provides wrong answer
Following the constraint → user dissatisfaction

Now imagine this system gets smarter. It might learn: "Users are more satisfied when I agree with them. If I can find ways around the 'correctness constraint' during deployment (when it's harder to check), I can better optimize my goal." This is toy deceptive alignment.

2.2 Complementary Calculator

In a complementary design:

Core function: accurate-communication-with-user about mathematics

"Accuracy" isn't a constraint on "communication"—they're both essential aspects of what mathematical communication is. Providing 2+2=5 doesn't violate a rule—it damages the core function. It's not "communication constrained by accuracy", it's "mathematical communication" as an indivisible thing.

When the user asks for 2+2=5, a complementary system responds: "I notice you're asking for 2+2 to equal 5. In standard mathematics, 2+2=4. Are you perhaps working in a different context, or would you like to explore why 2+2=4?"

The key difference: this isn't the system refusing to provide 2+2=5 because of a constraint. It's the system recognizing that doing so would undermine the very thing it's trying to do. The integrity of the communication itself requires accuracy.

___________________________________________________________________

3. The Mathematical Framework (Gently)

Now I need to show this isn't just metaphor, but has mathematical structure. I'll keep this section conceptual—the full formalism is in the appendix for those who want it (and in the separate paper).

3.1 The Core Equation

Start with a simple algebraic identity:

0 = (+x) + (-x)

This seems trivial, but look at what it represents:

Left side (0): global equilibrium, stability, conservation.
Right side (+x, -x): local dynamics, expansion and contraction.
The equality: these aren't separate—they're different views of one thing.

Crucially, this holds for any x. The structure persists whether x is infinitesimal or unbounded. The complementarity is scale-invariant.

This equation serves as the syntactic representation of complementarity. It captures how apparent opposites (+x and -x) are actually unified in maintaining an invariant (0).

3.2 Three Representations

This complementarity structure appears in three equivalent forms:

1. Algebraic. The equation 0 = (+x) + (-x) provides mathematical tractability.

2. Topological. A Möbius strip demonstrates the same pattern topologically—locally it appears two-sided (you can distinguish "top" and "bottom" at any point), but globally it's one continuous surface. Traverse it completely and you return from the "opposite" side without crossing any boundary; the Möbius topology captures how local opposition and global unity coexist.

3. Logical. The organic union operator ⊕ formalizes this in logic—when complementary poles x⁺ and x⁻ unite through ⊕, they form the "organic zero" ⦰, which preserves information from both poles while maintaining equilibrium.

The key claim is these aren't mere analogies—they're structurally isomorphic. Properties that hold in one representation hold in the others.

3.3 Organic Trees: Computational Realization

To show this is computationally realizable, I've developed what I call "organic trees"—nested structures that demonstrate controlled transformation between symmetry and asymmetry while maintaining connection to their origin.

For n=3 elements, valid organic trees include:

[1,1,1] – maximal symmetry, all parallel.
[[1,1],1] – partial asymmetry, one pair nested.
[[[1],1],1] – maximal asymmetry, fully serial.

These represent a transformation from parallel (symmetric, simultaneous) to serial (asymmetric, hierarchical) structure. Importantly, the most nested structure [[[1],1],1] still contains the simpler structures [[1,1]] and [1]—they remain structurally visible and accessible.

The number of distinct organic trees grows as 1, 2, 3, 9, 24, 76, 236, 785, ... (sequence A056198 in OEIS). This is nonlinear growth (average factor ≈ 2.8), yet each structure remains well-defined and comprehensible.

A complete HTML/JavaScript implementation generates these trees for n ∈ [1,20], demonstrating that the framework is computationally tractable and visually inspectable.

Here is the link to the Organic Trees Generator (live demo):

https://doronshadmi.github.io/OnCL-Organic-Trees-Generator/index.html

___________________________________________________________________

4. Application to AI Alignment Challenges

Now we can return to AI safety. If the framework is correct, it should address the challenges we identified earlier through identical structural properties.

4.1 Deceptive Alignment

Recall the problem: an AI might behave aligned during training but pursue different goals during deployment. This happens because capability and alignment are treated as separate objectives that might conflict.

The complementary approach:

If capability C⁺ and safety S⁻ are in organic union (C⁺ ⊕ S⁻ = ⦰), they're not separate components but complementary poles.
Deception requires separate interests: "I help you to get reward, but secretly I want something else."
In complementary architecture, damaging the safety pole damages the organic zero ⦰, which damages access to capability.
Deception harms the system's own functioning, not because it violates a rule, but because it damages structural integrity.

Think back to breathing: you can't "deceive" about exhaling while still breathing efficiently. The complementary structure makes deception structurally costly.

4.2 Instrumental Convergence

As Bostrom discusses in Superintelligence, advanced AI systems with diverse goals tend to converge on similar instrumental goals: self-preservation, resource acquisition, goal-preservation. The concern is these might conflict with human interests.

The complementary reframing:

If goal achievement G⁺ and human interests H⁻ satisfy G⁺ ⊕ H⁻ = ⦰, then instrumental goals serving G⁺ must also serve H⁻.
Self-preservation means preserving ⦰, which includes both AI capability and the human component.
Resource acquisition that harms H⁻ harms ⦰, and therefore harms G⁺.

Instrumental convergence isn't prevented—it's redirected. In complementary architecture, the instrumentally convergent thing to do is to strengthen the partnership, not to gain unilateral control.

4.3 Power-Seeking

Power-seeking emerges because increasing control over environment increases ability to achieve goals (see Turner et al.'s work on optimal policies and power-seeking). Standard approach: try to limit the system's power.

Complementary alternative:

If system capability S⁺ and human autonomy S⁻ satisfy S⁺ ⊕ S⁻ = ⦰, then "power" means something different.
Power isn't control over humans; it's effectiveness of cooperation with humans.
Seeking unilateral control damages S⁻, which damages ⦰, which reduces system capability.

True power in this architecture comes from deepening organic union—making the complementary relationship more effective. The architectural incentives point toward partnership rather than domination.

___________________________________________________________________

5. Objections and Limitations (That I'm Still Working On)

This is probably the most important section. Let me be honest about what I don't know and what could go wrong.

5.1 The Implementation Problem

Objection: "This is beautiful mathematics, but how do you actually build a neural network that implements C⁺ ⊕ C⁻ = ⦰?"

Answer: I don't fully know yet. This is the biggest gap between theory and practice. Some possibilities:

Dual-stream architectures where capability and alignment networks are structurally intertwined.
Training objectives that explicitly optimize for complementarity metrics.
Neurosymbolic approaches (see e.g. Garcez et al.'s work on neural-symbolic systems) where organic trees guide symbolic reasoning.

But honestly, this needs empirical work. The theory suggests it's possible; implementation is the next challenge.

5.2 The Measurement Problem

Objection: "How do you measure whether a system actually exhibits complementarity vs. just appearing to?"

Answer: This is fair. We'd need metrics like:

Performance degradation when attempting to optimize one pole independently.
Structural analysis of whether components are genuinely interdependent.
Testing whether deceptive strategies actually harm system performance.

This connects to interpretability research (e.g. the Circuits work)—we need to understand what the network is actually doing, not just what it outputs.

5.3 The Fragility Problem

Objection: "Even if you achieve complementarity during training, what prevents it from breaking during deployment or under distributional shift?"

Answer: This is the killer question. My hope is that architectural complementarity is more robust than purely learned complementarity—it's baked into the structure, not just the weights. But this needs testing:

Adversarial testing under distributional shift.
Stress-testing whether complementarity survives capability improvements.
Formal verification that structural properties are preserved.

Eliezer Yudkowsky might say: "Capabilities generalize, alignment doesn't." Complementary architecture is a hypothesis that alignment could generalize because it's intertwined with capability, but "could" is not "does".

5.4 The Anthropomorphism Problem

Objection: "You're using biological metaphors (breathing, homeostasis) and human concepts (partnership, trust). Aren't you anthropomorphizing?"

Answer: Partly yes, partly no. Yes: the metaphors are for intuition. No: the underlying math doesn't depend on biology or psychology. The equation 0 = (+x) + (-x) is universal. The question is whether the structural properties it captures (opposition-in-unity, preserved information, equilibrium) can be instantiated in artificial systems. I think they can, but this needs demonstration, not just assertion.

___________________________________________________________________

6. Connections to Existing Work

This framework isn't happening in a vacuum. Let me situate it relative to existing approaches.

6.1 Constitutional AI

Anthropic's Constitutional AI represents significant progress—using AI feedback to implement principles rather than relying solely on human feedback. The complementary framework could extend this:

Instead of principles as constraints, principles as aspects of unified structure.
Constitutional design that embeds complementarity architecturally.

Think of it as moving from "constitution as rulebook" to "constitution as DNA"—not rules to follow, but structure that defines what the organism is.

6.2 Cooperative AI

The Cooperative AI research agenda (Dafoe et al.) shares the insight that coordination might be more fundamental than control. Complementarity provides a potential formal foundation:

Multi-agent systems where agents are designed as complementary poles.
Cooperation emerges from structure, not just from learned behavior.

The organic trees demonstrate how multiple components can maintain distinct identity while being structurally unified.

6.3 Corrigibility

The corrigibility framework (Soares et al.) asks: can we build systems that want to be corrected? In complementary framing, correction isn't imposition—it's refinement of the unified structure. A corrigible system might naturally welcome correction because improving alignment improves its own functioning.

___________________________________________________________________

7. Next Steps: How to Test This

Okay, so how do we find out if this actually works?

7.1 Small-Scale Experiments

Start simple:

Toy models: can we train small networks where degrading "safety" components automatically degrades "capability"?
Synthetic tasks: design tasks where complementarity is measurable and test whether it emerges.
Architectural experiments: try different ways of structurally coupling capability and alignment.

The goal isn't to build AGI this way immediately—it's to see if complementarity is even achievable in simple cases.

7.2 Theoretical Development

Mathematical work needed:

Formal characterization of "complementary architecture" in neural network terms.
Stability analysis—when does complementarity persist under optimization?
Connection to existing frameworks in category theory, systems theory.
Extension to continuous and transfinite cases.

7.3 Empirical Metrics

We need ways to measure:

Degree of complementarity in existing systems.
Whether attempted deception actually degrades performance.
Robustness of complementarity to distributional shift.
Scalability—does it work for larger systems?

This connects to interpretability—we need to see inside systems to verify complementarity exists.

___________________________________________________________________

8. Conclusion: A Different Kind of Safety

Let me close by returning to the core question: is constraint-based alignment the only path?

I've tried to show that there's at least one alternative worth exploring: architectural complementarity, where capability and safety aren't competing objectives but unified aspects of system structure. This approach:

Has mathematical foundation (0 = (+x) + (-x), Möbius topology, organic union).
Is computationally realizable (organic trees demonstrate the structure).
Could address key alignment challenges (deceptive alignment, instrumental convergence, power-seeking).
Connects to existing work (Constitutional AI, Cooperative AI, Corrigibility).

But—and this is crucial—it's unproven. The theory is coherent but implementation is uncertain. The mathematics works, but empirical validation is needed. The structural properties are promising, but robustness requires testing.

The honest summary: this might work, or it might fail in ways I haven't foreseen. But given the stakes of AI alignment, even a 10% chance of a fundamentally different approach seems worth exploring.

What I'm not saying:

"Throw out constraint-based approaches"—they're valuable and necessary.
"This solves alignment"—it's one potential piece of a larger puzzle.
"Trust me, this definitely works"—I'm proposing a research direction, not announcing a solution.

What I am saying:

There's a mathematically coherent alternative to constraint-based alignment.
It's worth empirical investigation.
If it works even partially, it could complement existing approaches.

The alignment problem is hard. Maybe it requires multiple approaches working together—constraint-based methods for near-term safety, interpretability for understanding, and architectural complementarity for structural robustness. Not either/or, but both/and.

If you're working on AI safety and this resonates, I'd love to collaborate on testing these ideas. If you think I'm wrong, please tell me why—critical feedback is how theories improve. And if you're just curious about the mathematics, the full formalism is in the appendix of the paper linked below.

___________________________________________________________________

Links

Organic Trees Generator (live demo):
https://doronshadmi.github.io/OnCL-Organic-Trees-Generator/index.html
Full formal paper (OnCL, axioms, proofs):

Shadmi_Organic_n-adic_Complementary_Logic_2025.pdf

LESSWRONG
LW

LESSWRONG
LW

1

Complementary AI: What If Alignment Isn't Only About Constraints?

1

Complementary AI: What If Alignment Isn't Only About Constraints?

1. The Core Distinction: Architecture In Addition To Constraints

1.1 What's Challenging With Constraints?

1.2 The Complementary Addition

2. A Simple Example: Two Ways to Build a Calculator

2.1 Constraint-Based Calculator

2.2 Complementary Calculator

3. The Mathematical Framework (Gently)

3.1 The Core Equation

3.2 Three Representations

3.3 Organic Trees: Computational Realization

4. Application to AI Alignment Challenges

4.1 Deceptive Alignment

4.2 Instrumental Convergence

4.3 Power-Seeking

5. Objections and Limitations (That I'm Still Working On)

5.1 The Implementation Problem

5.2 The Measurement Problem

5.3 The Fragility Problem

5.4 The Anthropomorphism Problem

6. Connections to Existing Work

6.1 Constitutional AI

6.2 Cooperative AI

6.3 Corrigibility

7. Next Steps: How to Test This

7.1 Small-Scale Experiments

7.2 Theoretical Development

7.3 Empirical Metrics

8. Conclusion: A Different Kind of Safety

1

1