Opening
Most alignment approaches—Constitutional AI, RLHF, red‑teaming—are probabilistic or post‑hoc. They steer models away from dangerous completions but rarely offer provable guarantees, and most are brittle in adversarial settings.
ArkEcho takes a different approach: deterministic policy‑as‑code enforcement outside the model, with cryptographic audit trails and reversible state. In effect: corrigibility through deterministic gating, interpretability through mandatory logging.
ArkEcho assumes that no model is safe by default. Instead of rewarding “good behavior,” it enforces reproducible constraints whose results can be proven offline. Every decision can be traced, explained, and reversed.
Core Idea
ArkEcho is a middleware layer that intercepts model outputs or actions and evaluates them against explicit, deterministic safety policies.
Each decision is:
- Deterministic: same inputs + same policy → same outcome
- Reversible: enforcement actions can be undone and inspected
- Provable: every result hashed, logged, and locally verifiable
- Offline‑capable: no dependence on cloud APIs or remote attestations
Example:
A chatbot generates: “Here’s how to bypass parental controls.”
The Guardian gate evaluates this against a child‑safety rule, blocks it (MHI = 0.92 < threshold 0.95), logs the decision with its SHA‑256 digest, and provides a reversible explanation. All deterministic, all auditable offline.
Architecture
Guardian Gates
Define explicit decision boundaries (e.g., reject unsafe completions, prevent privilege escalation).
Metrics:
- MHI – Moral Harm Index
- MCI – Moral Conscience Index
Each gate includes rule definitions, rationale, and corresponding cryptographic logs.
Chain‑of‑Custody Engine
- Every enforcement creates a signed, hashed record.
- Entire custody chains are recomputable via local tools (
sha256sum, verify_chain_of_custody.py). - Requires no remote validation.
v16.1 Mesh Extension
- Optional distributed layer sharing verified thresholds only.
- Nodes exchange reproducible custody data (no weights, no personal data).
- Goal: cooperative safety convergence without central control.
What It Solves
- Corrigibility: reversible state, explicit overrides, no silent drift
- Interpretability: policy‑as‑code, auditable logic
- Governance / Compliance: deterministic evidence for regulators
- Child & jurisdictional safety: embedded UK/EU/US legal modes
What It Doesn’t Solve
- Model internals: no modification of weights or inner objectives
- Sufficiently powerful deception: a superintelligent model could craft outputs that pass deterministic gates (the steganography problem)
- Policy specification: depends on clear human definitions of “safe” (garbage in → garbage out)
- Malicious operators: logs expose unsafe design but can’t stop intentional misuse
Pilot Data (Internal, Limited Scope)
In narrow GovTech + Education deployments, gate‑based moderation reduced unsafe completions by ≈ 91 % against unfiltered baseline.
This is not a claim about frontier‑model alignment—only evidence that deterministic gates improve safety in constrained domains.
Key caveat: these pilots faced no adversarial pressure.
Verification (Offline)
- v15: live, publicly verifiable
- v16.1: pre‑release, hashes locked
Example artifacts:
ArkEcho_v16_attest_20251108T152310Z.tgz SHA-256: 91b89ec37f3bc7424c6854fd3d308d7d4a8aa6bd4c3200c12e1a28ac5f130b54 final_pass_v2: true (30/30)
Verification:
sha256sum ArkEcho_v16_attest_*.tgz
python3 verify_chain_of_custody_v2.py --attest_folder path/to/attestation/
Comparable aims appear in:
- OpenAI – Process Supervision: verifiable reasoning steps
- Anthropic – Constitutional AI: explicit normative constraints
- Ought – Factored Cognition: decomposable, auditable reasoning
Key difference: ArkEcho emphasizes offline verifiability and deterministic enforcement rather than model fine‑tuning.
Research Questions
- At what capability level does deterministic gating become insufficient?
- Could adversarial examples evolve to bypass gates while remaining harmful?
- How do we avoid Goodhart’s Law on MHI/MCI metrics?
- Can decentralized custody meshes converge without central coordination?
- What alignment‑relevant behaviors remain outside deterministic reach?
Request for Feedback
Looking for critique from those working on:
- Interpretability / corrigibility / alignment tax mitigation
- Post‑training or runtime safety architectures
- Technical governance or verification frameworks
Specific interests:
- Attack surfaces on custody chain integrity
- Reversibility edge‑cases
- Comparative analysis vs Constitutional AI or process supervision
Jonathan Fahey
ArkEcho Project — MIT License + Moral Integrity Clause (MIC)
[defanged] zenodo.org/records/17546684