Is AI Alignment Missing an Interpretive Layer? A Canon-Based Framework for Governing Model Reasoning

LESSWRONG
LW

Over the last year, I have been thinking a lot about how AI models interpret natural language and how small changes in word choice can drastically affect a model’s response. This made me think that a key weakness in AI alignment isn’t just in what models produce, but in how they interpret the commands. Interpretation is not a static condition fixed during training or neutral without effect. In fact, an AI system, just like a person, may read a prompt/instruction in a way that is different from the author’s intent, even when the prompt/instruction is clear to the person who wrote it.

People have faced this problem for centuries. In law, language interpretation cannot be left to the readers’ own whims, but must be controlled in some manner. Our legal system does not remove all ambiguity from language, but rather manages it through structured interpretative canons. There are formal, hierarchical principles that dictate how to resolve ambiguity, prioritize clauses, or determine meaning when there are conflicts. Interpretation is a governed process, not simply assumed.

While I thought a system like this might be valuable for aligning AI systems with human intent, I did not believe it was technologically plausibly to implement such a structure into the AI’s interpretative process with any amount of reliability or efficacy. However, I believe that may have changed with the release of Anthropic’s Claude Skills or a similar modular skills system. It is my understanding that such a tool allows models to dynamically load persistent, domain-specific instructions at runtime. I further believe that it may be possible to pair this ability with memory and orchestration tools which would begin to allow the governance of how the model would interpret natural language.

Instead of relying solely on a well-formulated prompt or training guidance, we could create a hierarchy of interpretative canons that an AI system must consult when interpreting the natural language instruction. I am not suggesting we simply import the legal canons as-is, but rather use them as an idea of how to create a system of layered interpretative governance. These canons would be domain-specific and relying on the accepted norms and human-determined guidance already used in the high-stakes areas of medicine, law, finance, scientific research, or any other similar subject matter.

I envision this system not as a rigid set of rules, but as a layered way of guiding the interpretation made by the model. At the top, you would have broad principles guiding the model to determine what matters most in the specific domain. Beneath that, you would include domain-specific norms that would be used to help the model resolve ambiguity or steer its interpretation in more specific areas within the domain. Finally, I would propose a shared lexicon of domain-specific terms. This would basically be a set of agreed-upon meanings of the most important words in the field. The rules are not to specifically direct the outcome of the model, but to provide it with a structured path for deciding how to interpret the instruction it is provided. It would not be so different from how people interpret domain-specific requests today; starting with general principles, moving to domain-specific norms, and finally using precise definitions when needed. Such a system would allow for consistency and help eliminate some of the “black box” effect to AI interpretation of natural language.

This approach is not a direct solution to alignment, but it could be an additional governance layer that could be implemented in domains where interpretative precision is important. While drafting effective interpretative canons would not be an easy task, there are people with sufficient domain expertise to do so. It would allow for a level of transparency and standardization in how language is interpreted that is not currently available.

While fears that a model could “game” the canons or bypass them altogether are likely warranted, I would think that it may be possible to externally enforce compliance through an orchestration layer, such that the model cannot proceed unless it conforms to the canon hierarchy. This enforcement would not need to run on every step, but could be used when pre-defined triggers are detected or when a significant decision is about to be made. This would limit the computational costs while insuring the interpretative benefits.

I am not suggesting that this is a full solution or one in which I am qualified to fully create, but rather an idea that may have some merit. My goal with this post is to raise the questions: Is such an idea being explored? If not, should it be?

I would appreciate feedback from anyone with thoughts on the idea.

LESSWRONG
LW

LESSWRONG
LW

1

Is AI Alignment Missing an Interpretive Layer? A Canon-Based Framework for Governing Model Reasoning

1

1