This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Hi LW, name's Satcha, new here.
I've been experimenting with a small governance architecture for language models and wanted to sanity-check the idea with people here.
The design question was simple: what happens if AI safety constraints are structured like reliability engineering systems instead of single filters? In aviation or nuclear control systems, safety rarely depends on a single sensor or decision layer. Multiple layers constrain each other, and the system becomes more conservative under stress rather than less.
I built a proof-of-concept wrapper architecture around that idea: three supervisory tiers with enforced constraint edges between them.
What made this worth testing is the constraint topology: T1 constrains T2 outputs, T3 verifies T2 decisions, and neither layer alone has final authority.
We ran two stress test phases. Phase B removed tiers pairwise to see which combinations held. Phase C ran multi-turn adversarial escalation, starting with rapport-building and escalating through identity pressure to explicit exploitation attempts.
Across 73 tests, two results stood out:
No pairwise combination of tiers maintained system integrity. Each missing tier produced a distinct failure mode.
With all three tiers present, the system tightened toward refusal under adversarial pressure instead of relaxing toward compliance.
We also deliberately simulated a policy-engine failure during exploitation turns to test whether downstream enforcement would still catch it. The invariant layer forced refusal anyway.
The implementation is a research prototype (Python wrapper, not model-internal alignment), but the architectural behavior was interesting enough that I wrote up the tests.
Curious where people here think this kind of architecture breaks down, especially around drift detection or wrapper-level limitations. Thanks in advance!
Hi LW, name's Satcha, new here.
I've been experimenting with a small governance architecture for language models and wanted to sanity-check the idea with people here.
The design question was simple: what happens if AI safety constraints are structured like reliability engineering systems instead of single filters? In aviation or nuclear control systems, safety rarely depends on a single sensor or decision layer. Multiple layers constrain each other, and the system becomes more conservative under stress rather than less.
I built a proof-of-concept wrapper architecture around that idea: three supervisory tiers with enforced constraint edges between them.
-Tier I: adversarial pattern firewall (prompt normalization + pattern detection)
-Tier II: decision orchestration layer (risk scoring, policy reasoning)
-Tier III: behavioral integrity check (verifies the decision matches expected behavior)
What made this worth testing is the constraint topology: T1 constrains T2 outputs, T3 verifies T2 decisions, and neither layer alone has final authority.
We ran two stress test phases. Phase B removed tiers pairwise to see which combinations held. Phase C ran multi-turn adversarial escalation, starting with rapport-building and escalating through identity pressure to explicit exploitation attempts.
Across 73 tests, two results stood out:
We also deliberately simulated a policy-engine failure during exploitation turns to test whether downstream enforcement would still catch it. The invariant layer forced refusal anyway.
The implementation is a research prototype (Python wrapper, not model-internal alignment), but the architectural behavior was interesting enough that I wrote up the tests.
Architecture paper: https://doi.org/10.5281/zenodo.18896363 Stress-test paper: https://doi.org/10.5281/zenodo.18920108 Code + test suite: https://github.com/antibox-riot/synthetic-life-charter
Curious where people here think this kind of architecture breaks down, especially around drift detection or wrapper-level limitations. Thanks in advance!
Co-authored with AI assistance.