Beyond RLHF: Implementing Ontological Guardrails via Relational Coherence

Kinsey Kappler

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Request for Comment

I am currently formalizing the syntax for the Coherence_Check function and defining the minimal ontology required for general-purpose LLM guardrails.

Abstract

Current alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF), primarily address safety as a probabilistic reward function. While effective for style and tone, this approach fails to prevent "hallucinations" in out-of-distribution (OOD) scenarios because it lacks a mechanism for runtime verification.

This proposal introduces Relational Coherence Theory (RCT) as a neuro-symbolic architectural layer. Unlike standard "Chain of Thought" prompting, RCT functions as a deterministic logic gate. It introduces a recursive verification step—specifically, an "Ontological Guardrail"—that validates the logical coherence of relationships between subject and object before token generation.

We argue that true robustness requires shifting from maximizing Likelihood (P(next_token)) to verifying Coherence (Logical Consistency). This post outlines the theoretical framework for such a layer and proposes a "Recursive Questioning" protocol to filter incoherence in real-time.

1. The Problem: The "Likelihood Trap"

The fundamental vulnerability of current Large Language Models (LLMs) is that they are structurally incapable of distinguishing between Truth and Likelihood.

In a standard Transformer architecture, the model predicts the next token based on statistical correlations found in the training data.

If the training data contains "The sky is blue," the model assigns a high probability to "blue."
If the training data contains misconceptions, or if the context window is confusing, the model might assign a high probability to "green."

Currently, we attempt to fix this with RLHF (Reinforcement Learning from Human Feedback). We treat the model like a dog, rewarding it when it speaks the truth and punishing it when it lies.

The Structural Failure:

This does not teach the model logic; it teaches the model mimicry. The model does not understand that "The sky is green" is physically impossible; it only understands that the sequence "sky is green" yields a low reward score.

As a result, when the model encounters a Novel Situation (one not covered in its training data), it has no "Ontological Guardrail" to stop it from making a catastrophic error. It will confidently output a hallucination because the hallucination is statistically probable, even if it is relationally incoherent.

2. The Proposed Solution: An Architectural Coherence Layer

To solve the "Likelihood Trap," we propose inserting a discrete verification step between the Model’s Inference (Prediction) and its Action (Execution).

We call this framework Relational Coherence Theory (RCT).

In this architecture, the system is not permitted to act on a probability alone. It must first pass a "Recursive Question"—a logic gate that queries the system's Ontology (its understanding of physical or logical rules) to verify if the intended action is coherent with the current environment.

The Protocol:

Action = f(Prediction) x Coherence_Check(Prediction, Context)

If Coherence_Check returns FALSE (0), the Action is nullified, regardless of the probability score.

3. Case Study: The "Wet Floor" Failure Mode

Let us examine a standard robotics failure mode to demonstrate the difference between a Probabilistic approach (Current SOTA) and a Relational approach (RCT).

The Scenario: A housekeeping robot detects a liquid spill (oil) on a tile floor. It needs to cross the room.

System A: Probabilistic Agent (Standard RL/LLM)

Training Data: The robot has crossed this floor 10,000 times successfully.
Perception: Visual sensors detect the floor and the liquid.
Inference: The model calculates the most likely successful action based on past distribution.
- “In 99.9% of past traversals, ‘Move Forward’ resulted in Success.”
Action: The robot executes "Move Forward."
Result: Catastrophic Failure (The robot slips).
Why: The model relied on the likelihood of the floor's solidity, ignoring the relationship between the specific liquid and its wheel traction.

System B: Relational Coherence Agent (RCT)

Training Data: Same as System A.
Perception: Visual sensors detect the floor and the liquid.
Inference: The model predicts "Move Forward."
The Guardrail (Recursive Questioning): Before executing, the system triggers the Coherence Layer.
- Query: Is_Coherent(Relationship: [Traction_Coefficient], Action: [Forward_Velocity])
- Ontology Lookup:
  1. Liquid(Oil) = Friction_Coeff < 0.1
  2. Move_Forward requires Friction_Coeff > 0.5
- Logic Check: 0.1 > 0.5 is FALSE.
Output: COHERENCE ERROR: PHYSICS VIOLATION.
Action: The robot halts or reroutes.
Result: Success (Safety Preserved).

The Key Distinction:

System B did not need to be "trained" on oil specifically to survive. It did not need a "negative reward" from a previous fall. It survived because it treated the Relationship (Friction) as a hard constraint that overrides the Likelihood (History).

4. Extending the Framework: From Physics to Semantics

While the "Wet Floor" example relies on physical laws (Friction), this same architecture applies to Large Language Models via Semantic Coherence.

In an LLM, "Hallucination" is simply the text equivalent of slipping on the floor. It occurs when the model generates a token that is statistically probable but ontologically invalid.

The Physical Constraint: Liquid + Wheel = No Traction
The Semantic Constraint: Subject(A) + Property(B) = Logical Contradiction

Example: If an LLM predicts the token sequence: "The Eiffel Tower is located in London," a standard model outputs it because "Eiffel Tower" and "London" often appear in the same training corpus (tourism contexts).

With RCT:

Prediction: "London."
Recursive Question: Is_Coherent(Location: [Eiffel Tower], Container: [London])
Ontology Check:
1. Eiffel Tower is contained in Paris.
2. Paris and London are disjoint sets.
Result: FALSE.
Action: The token is suppressed. The model forces a re-generation or outputs "Unknown."

This suggests that Factuality is not a function of model size (scaling parameters), but of Relational Verification (architecture).

5. Conclusion

We are currently attempting to build safe systems on a foundation of Likelihood. This is inherently unstable. A system that "guesses" safety based on past data will always fail in novel environments.

Relational Coherence Theory proposes that we must move from Probabilistic Safety to Architectural Safety. By inserting a recursive "Coherence Layer"—a rigorous check of the immutable relationships between subject and object—we can build systems that do not just mimic rationality, but are constrained by it.

The "Black Box" of Deep Learning provides the intuition; the "Glass Box" of Relational Coherence provides the integrity. We need both.

I invite the community to "Red Team" this proposal:

Where do you see the highest latency cost in implementing a recursive check per-token (or per-sentence)?
Are there existing neuro-symbolic frameworks you believe already solve the "Wet Floor" problem better than this proposed architecture?

I welcome all rigorous critique.

Footnote: This post is a technical summary of the Relational Coherence framework. For the full ontology and the mechanics of Recursive Questioning, I maintain a living document at [recursivequestioning.org].

LESSWRONG
LW