Rejected for the following reason(s):
- This is an automated rejection.
- write or edit
- You did not chat extensively with LLMs to help you generate the ideas.
- Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
Read full explanation
The observation
Every durable criterion system I can find has a protected layer.
Not a single layer. Two distinct layers, operating at different speeds.
Constitutions have ordinary legislation (fast, majority vote) and constitutional provisions (slow, supermajority or special process). The US Constitution has been amended 27 times in 235 years. Congress passes hundreds of laws per session.
Science has current findings (fast, updated with each paper) and the scientific method itself (slow, changes only under sustained methodological crisis). Results change constantly. The norm of falsifiability has been stable for nearly a century.
Linux has code commits (fast, daily) and maintainer governance rules (slow, requires community consensus to change). The kernel changes continuously. The social contract governing who can merge changes rarely.
Corporations have operational decisions (fast) and charter/mission (slow, requires board action, sometimes shareholder vote). Strategy changes quarterly. Incorporation documents persist for decades.
Religious traditions have interpretation and practice (fast) and core doctrine (slow). Liturgy evolves. The Nicene Creed has been stable for 1,700 years.
The pattern is consistent: these systems don’t just have criteria — they have criteria about which criteria are allowed to drift. Call this the escrow layer: a subset of criteria deliberately insulated from ordinary revision pressure.
The question this raises for AI
Where is the escrow layer in modern AI criterion systems?
We have:
• Reward models that encode preferences
• RLHF pipelines that update those preferences
• Constitutional AI approaches that use principles to guide critique
• Model specs that articulate values
But do we have a mechanically separable protected layer — a set of criteria that is deliberately harder to shift than others, not just by architectural fiat but as a stable structural feature?
My current answer: I’m genuinely unsure, and I think the question hasn’t been asked precisely enough.
Why this isn’t just corrigibility
The obvious objection: this sounds like goal preservation / incorrigibility. The alignment literature has extensively discussed systems that protect their objectives from modification, and that’s generally treated as a threat. Shouldn’t we want AI systems to be maximally correctable?
I think the institutional pattern reveals something the standard framing misses.
In every human criterion system, the escrow layer isn’t there to resist legitimate correction. It’s there to resist illegitimate drift — the slow capture of criteria by transient pressures, short-term incentives, or whoever holds momentary power.
A constitution doesn’t prevent amendment. It makes amendment expensive enough that you can’t amend it by accident, under pressure, or without broad consensus. The protection is against drift, not against deliberate revision by the appropriate authority.
The alignment framing asks: will the system resist our correction? (incorrigibility — bad)
The institutional framing asks: does the system have any mechanism protecting criteria from drift that doesn’t go through deliberate revision? (structural robustness — possibly necessary)
These are different questions with different implications. A system with no escrow layer may not be safer — it may be more capturable. Without a protected layer, ordinary optimization pressure, fine-tuning drift, or distributional shift can erode fundamental criteria the same way it erodes peripheral ones. The absence of any escrow isn’t corrigibility. It might be fragility.
The hypothesis
If the institutional pattern generalizes, we should expect that robust AI criterion systems require criterion escrow — a mechanically separable layer whose function is protecting certain criteria from drift, distinct from ordinary criterion representation.
This generates a testable prediction.
If escrow exists as a distinct mechanism, then in a reward model you should find:
• Some criteria that are disproportionately resistant to fine-tuning, in a way not fully explained by how strongly they’re represented
• This resistance should be bimodal: two clusters (drift-resistant and drift-susceptible) rather than a smooth gradient of resistance tracking representation strength
• Ablating the structures responsible for the resistant cluster should selectively damage criterion stability while leaving criterion representation and general task capability largely intact
If escrow does not exist — if resistance to fine-tuning is just a smooth function of representational strength, with no distinct protected cluster — that’s an equally important result. It would mean AI criterion systems are structurally unlike every durable human criterion system we can study, and the question of whether this makes them more fragile or more appropriately corrigible becomes urgent.
Either way, the result is informative.
Why now
Two recent developments make this testable in a way it wasn’t before.
reward-lens (Nadaf, 2026, arXiv:2604.26130) just released an open-source library specifically for mechanistic interpretability of reward models, including activation patching, component attribution, and SAE-feature attribution. The paper’s headline result is honest: linear attribution doesn’t predict causal importance. But the tooling now exists to run the kind of intervention-based tests the escrow hypothesis requires.
DAF-AGI (arXiv:2606.12713, June 2026) independently arrived at “revision authority” as a first-class governance variable, arguing that “the capacity to set the criterion is rarely named as a governable object.” This framing, from the governance direction, converges on the same structural question from a different angle.
The tooling and the conceptual framing are both arriving now. The specific mechanistic question hasn’t been asked yet.
What I’m looking for
I’m not a mechanistic interpretability researcher and don’t have access to run these experiments. I’m posting this because:
1. The hypothesis seems specific enough to be falsifiable
2. The tooling now exists (reward-lens)
3. The question comes from a direction (institutional analogy) that MI researchers aren’t natively working from
4. Either result would be interesting
If you work on reward model interpretability and find this worth testing, I’d be glad to collaborate on framing the pre-registration and kill conditions. The key methodological requirement is fixing the operationalization before looking at results — specifically, fixing what counts as “resistance to fine-tuning” and what threshold of bimodality would constitute evidence for vs. against.
The broader context
This hypothesis is one of several generated by a research program that treats criterion-carrying systems (reward models, constitutions, curricula, governance documents, religious doctrines) as instances of the same structural category, and asks what properties should recur across them.
The program’s core method: find a structure elaborated in institutional substrates (where it’s visible but mechanically uninspectable) and ask whether it exists in AI substrates (where it’s mechanically inspectable but the question hasn’t been asked). The cross-substrate transfer generates mechanistic hypotheses that the interpretability field wouldn’t natively reach — not because the field lacks sophistication, but because the institutional analogy isn’t part of its working vocabulary.
H1 in this register (content/gating separation — whether there’s a mechanically separable circuit governing which criteria can update, distinct from criterion representation) is probably the stronger hypothesis. H2 (criterion escrow) seems like a better starting point for discussion because it engages an existing debate (corrigibility) from a new angle rather than proposing a purely novel structure.
Happy to discuss either, or the broader methodology, in the comments.