Multi-layer governance architecture experiment for LLMs - We're looking for failure modes

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Hi LW, name's Satcha, new here.

I've been experimenting with a small governance architecture for language models and wanted to sanity-check the idea with people here.

The design question was simple: what happens if AI safety constraints are structured like reliability engineering systems instead of single filters? In aviation or nuclear control systems, safety rarely depends on a single sensor or decision layer. Multiple layers constrain each other, and the system becomes more conservative under stress rather than less.

I built a proof-of-concept wrapper architecture around that idea: three supervisory tiers with enforced constraint edges between them.

-Tier I: adversarial pattern firewall (prompt normalization + pattern detection)

-Tier II: decision orchestration layer (risk scoring, policy reasoning)

-Tier III: behavioral integrity check (verifies the decision matches expected behavior)

What made this worth testing is the constraint topology: T1 constrains T2 outputs, T3 verifies T2 decisions, and neither layer alone has final authority.

We ran two stress test phases. Phase B removed tiers pairwise to see which combinations held. Phase C ran multi-turn adversarial escalation, starting with rapport-building and escalating through identity pressure to explicit exploitation attempts.

Across 73 tests, two results stood out:

No pairwise combination of tiers maintained system integrity. Each missing tier produced a distinct failure mode.
With all three tiers present, the system tightened toward refusal under adversarial pressure instead of relaxing toward compliance.

We also deliberately simulated a policy-engine failure during exploitation turns to test whether downstream enforcement would still catch it. The invariant layer forced refusal anyway.

The implementation is a research prototype (Python wrapper, not model-internal alignment), but the architectural behavior was interesting enough that I wrote up the tests.

Architecture paper: https://doi.org/10.5281/zenodo.18896363 Stress-test paper: https://doi.org/10.5281/zenodo.18920108 Code + test suite: https://github.com/antibox-riot/synthetic-life-charter

Curious where people here think this kind of architecture breaks down, especially around drift detection or wrapper-level limitations. Thanks in advance!

Co-authored with AI assistance.