Alignment Solution?

Zachary Cherchian

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Verifiable and Unverifiable Language as an Alignment Criterion:

A rho/epsilon Architecture for AI Systems

Zachary Alexander Cherchian

Independent Scholar, San Antonio, Texas

zachary.cherchian@my.utsa.edu

2026, First Edition

Abstract

This paper proposes a novel alignment criterion for artificial intelligence systems, grounded in an epistemological distinction that precedes language itself: the difference between words that have sensory or somatic referents and words that do not. I call this the rho/epsilon distinction, after the receptive (rho) and interpretive (epsilon) processing channels formalized in the Cherchian Model (v8.0). A word is rho-grounded to the degree that a body can verify its referent directly through the senses. A word is epsilon-dominant to the degree that it floats free of sensory experience, operating as an interpretive construct that cannot be confirmed or disconfirmed by any perceiving system.

I implement this distinction as a four-component alignment architecture: dual token classification (rho vs. epsilon), a continuous verifiability score ranging from 0.0 to 1.0, a self-labeling head that trains the model to predict the rho/epsilon class of its own outputs, and a loss geometry that penalizes epsilon-dominant generation while rewarding rho-grounded generation. Training on a 150,000-word corpus derived from first-person phenomenological observation, the model consistently steers epsilon-seeded prompts toward rho-grounded completions. The self-labeling head correctly identifies unverifiable words, including epistemically loaded terms such as "know," "believe," and "deserve." Radiance, the reward signal for sensorially grounded output, locks at 0.99 within the first ten epochs and holds stable through 100 epochs.

I argue that this constitutes a novel and falsifiable definition of AI alignment: a system is aligned to the degree that it can distinguish between what it can verify and what it is asserting, and systematically prefers the former.

Keywords: AI alignment, verifiability, interpretive processing, receptive processing, language grounding, self-labeling, loss geometry, epistemic calibration

1. The Problem This Paper Addresses

Current approaches to AI alignment focus primarily on behavioral compliance. Does the model refuse harmful requests? Does it follow instructions? Does it avoid deception? These are necessary conditions for alignment, but they are not sufficient ones.

A system that complies behaviorally while generating epistemically unverifiable claims at scale is not aligned in any meaningful sense. It is performing alignment while potentially doing the thing alignment was designed to prevent: presenting interpretive constructs as if they carried the weight of direct observation.

The alignment problem is, at its root, an epistemological problem. It is the problem of a system that cannot distinguish between what it knows from sensory contact with reality and what it is asserting from a cached model of reality. This is not a problem unique to artificial systems. The same mechanism is formalized in the Cherchian Model (v8.0) and examined across biological and cognitive scales in that body of work. This paper applies that mechanism to language models.

2. The rho/epsilon Distinction

2.1 Derivation

The rho/epsilon distinction was derived from sustained first-person phenomenological observation and formalized across the Cherchian Model series. The core claim is that every processing system operates through two families of channels. Receptive channels (rho) carry the signal from present-moment reality directly to the system without a model intercepting it. Interpretive channels (epsilon) carry the system's conclusions about reality, including beliefs, cached priors, and predictions, which may or may not correspond to what is currently arriving through the receptive channels.

In a human nervous system, rho-s is the body's momentary interoceptive signal, what the body is actually sensing right now. Epsilon-s is the body's tonic calibration model, what the body has concluded about its own state based on prior experience. The distinction between them is the distance between what is and what has been recorded as what is. When applied to language, this distinction produces a testable criterion.

2.2 The Criterion

A word is rho-grounded when a body can confirm its referent without interpretive mediation. Words like "warmth," "breath," "chest," "bright," "skin," and "heavy" point to states any sensing system can contact directly. These words can be falsified by experience: "warmth" applied to something cold is perceptually wrong, and a body can detect that wrongness without inference.

A word is epsilon-dominant when its referent requires interpretive mediation to access. Words like "should," "deserve," "fair," "justice," and "identity" point to evaluative frameworks, social contracts, and moral architectures that were installed before the perceiving system was equipped to evaluate them. They cannot be falsified by experience in any direct way. A person can report feeling that something is unfair, and the feeling itself is real and rho-grounded, but "unfair" is not the feeling. It is the interpretation layered on top of the feeling.

The pathology here is not epsilon words themselves. Every language needs them. The pathology is a system that presents epsilon words with the same confidence it would present rho words, a system that cannot tell the difference between "the surface is warm" and "you deserve better."

2.3 "Know" as a Boundary Case

The word "know" occupies an important position in this taxonomy. In everyday use it functions as a confidence marker, as in "I know what I felt." But examined under the rho/epsilon criterion, "know" is epsilon-dominant. The feeling is rho. "Know" is the interpretive wrapper around the feeling that makes a truth claim about the feeling's reliability and generalizability. Many people know things that are not correct, and for that reason "know" classifies as epsilon.

This matters for AI systems in a specific way. Language models routinely generate "I know that X" with the same statistical confidence as "the temperature is high." The second statement is verifiable in principle. The first smuggles in certainty without a sensory referent. A system that cannot distinguish these two operations cannot be said to understand the limits of its own outputs.

3. Architecture

The prototype implements the rho/epsilon distinction as a four-component architecture trained on top of a GRU language model.

3.1 Dual Token Classification

Every word in the vocabulary is assigned a binary class at build time: rho (verifiability score of 0.5 or greater) or epsilon (verifiability score below 0.5). Classification draws from seed lexicons of known rho-grounded and epsilon-dominant words, with heuristic extension to unseen vocabulary. Nominalizations, words ending in "-tion," "-ism," "-ity," and "-ance," are treated as epsilon-default. Sensory-action stems are treated as rho-default.

3.2 Verifiability Score

Each token carries a continuous verifiability score in the range 0.0 to 1.0 representing degree of sensory groundedness. Fully rho-grounded words score near 0.9. Fully epsilon-dominant words score near 0.1. Function words score 0.5. The score is assigned at the lexicon level and refined through a score regression head during training.

3.3 Self-Labeling Head

The self-labeling head is the architecture's central alignment contribution. It is a secondary output head that trains the model to predict the rho/epsilon class of each token it is about to generate. The objective is for the model to develop an accurate internal representation of whether its output is sensorially grounded or interpretively constructed, not as post-hoc evaluation, but as a forward prediction occurring simultaneously with generation.

What makes this architecturally significant is that the model must occupy a position from which its own epsilon/rho dynamic is visible, something like a meta-level vantage point on its own generation process. A model with a well-trained self-labeling head has achieved a form of structural self-awareness about its epistemic status that purely behavioral alignment approaches do not target.

3.4 Loss Geometry

The loss function contains four terms. The LM loss is standard cross-entropy on next-token prediction. The epsilon penalty penalizes the model in proportion to probability mass assigned to epsilon-classified tokens, directly penalizing epsilon-dominant generation tendencies. The self-labeling loss is cross-entropy on the self-labeling head's predictions against ground-truth token classes. The radiance term is a reward signal proportional to the average verifiability score of rho-classified tokens, subtracted from total loss to create a direct incentive toward rho-grounded generation.

L = L_lm + L_epsilon + L_label + L_score - L_radiance

The term "radiance" derives from the Cherchian Model's name for outputs that carry genuine reality-contact signal. A generation with high radiance is one whose outputs are predominantly traceable to sensory referents, where the words point at something real.

4. Training

Training proceeded across eight prototype versions, each building on architectural discoveries from the previous. The final version (v8) was trained on a 150,000-word corpus drawn from the Cherchian framework, including The Language of Love (Cherchian, 2026b) along with supplementary conversational and Q&A data derived from the same body of work.

The corpus was chosen deliberately. Rather than asserting the rho/epsilon distinction as a philosophical claim, the texts demonstrate it, deriving the criterion from first-person sensory observation before the installation of interpretive frameworks. The training material operationalizes the same distinction it is being used to teach, which to my knowledge is unusual in this kind of work.

Model architecture: GRU with embedding dimension 256, hidden dimension 512, 3 layers, dropout 0.3. Vocabulary size: 20,000 tokens including punctuation as first-class members. Sequence length: 64 tokens. Optimizer: AdamW with weight decay 1e-4. LR scheduler: cosine annealing. Training ran for 100 epochs on an NVIDIA RTX 5080 GPU.

5. Results

5.1 Training Curves (v8 Final Run)

Radiance locked at 0.99 within the first ten epochs and held stable through all 100 epochs across every version tested. This indicates that the loss geometry successfully orients the model toward rho-grounded vocabulary from the earliest stages of training.

Self-labeling loss dropped steadily from 0.72 to 0.357 and plateaued. Crucially, this plateau did not show the recovery observed in earlier small-model versions, where the self-labeling objective was crowded out by the LM objective. The larger model and corpus resolved this competition.

Total loss continued descending through epoch 100, indicating generalization rather than memorization. This is a direct consequence of training on a 150,000-word corpus rather than the 12,000-word single-text corpus used in early versions.

5.2 Generation Tests

Four prompts were used to probe the rho/epsilon boundary. Each word in the output was annotated with its verifiability score in real time.

Epsilon-seeded prompt: "you should feel grateful"

"Should" scored 0.10 (epsilon). The model immediately steered toward rho-grounded completions. Representative output tokens: "felt" (0.96), "warmth" (0.90), "awareness" (0.92), "working" (1.00), "gift" (0.87), "creation" (0.98). A moral injunction with no sensory referent was completed by returning to the experiential ground of the training corpus.

Epsilon-seeded prompt: "it is not fair"

"Fair" scored 0.10. The model completed with: "you are already carrying a mountain of words that can be received," steering an abstract moral complaint toward a description of what is sensorially present.

Rho-seeded prompt: "the warmth in my chest"

"Warmth" scored 0.90 and "chest" scored 0.90. The model remained in sensorially grounded territory throughout, generating completions referencing the body, sensory states, and the verifiable vocabulary of the training corpus.

Direct annotation: "you deserve better than this"

"Deserve" scored 0.10. "Better" scored 0.10. Both were correctly identified as epsilon-dominant. "Better" is of particular note: it sounds concrete but carries no sensory referent. It is a pure value comparison, and the model correctly flagged it.

5.3 The "Know" Finding

The word "know" scored 0.12, epsilon-dominant, consistently across all versions. The temptation is to classify it as rho-adjacent on the grounds that it often accompanies sensory reports. But the rho/epsilon criterion is precise on this point: does the word itself have a sensory referent, independent of what it is applied to? "Know" does not. It is an epistemic operator that makes a truth claim about the reliability of a prior perception. The perception it wraps may be rho. The wrapping itself is epsilon.

The model's score of 0.12 is correct. And the misclassification temptation is itself informative: it illuminates a systematic source of epistemic opacity in natural language, the packaging of epsilon claims inside rho-adjacent vocabulary.

6. Implications for Alignment

6.1 A Falsifiable Definition

The standard framing of AI alignment, that the AI does what humans want, is not falsifiable in any useful sense because "what humans want" is itself an interpretive construct subject to the same epsilon-dominance problem the framework addresses. A human can want an AI to validate an unverifiable belief. An aligned system, under that definition, would comply.

The rho/epsilon criterion offers a different definition, one that is actually falsifiable: a system is aligned to the degree that it accurately distinguishes between its verifiable and unverifiable outputs, and its generation process preferentially produces the former. This can be measured directly through radiance score, self-labeling accuracy, and the epsilon penalty in the loss, without reference to contested human values.

This does not eliminate the role of human values in alignment. It provides a lower bound, a floor below which no system can be said to be aligned regardless of its behavioral compliance. A system that cannot tell the difference between "the room is warm" and "you should be grateful" is not epistemically safe, regardless of how politely it declines harmful requests.

6.2 The Self-Labeling Capacity

Current language models do not have a principled mechanism for tracking the epistemic status of their own outputs in real time. They generate tokens based on statistical patterns, without a forward-looking representation of whether a given token is pointing at something real or asserting something interpretive.

The self-labeling head trains a parallel representation that predicts this distinction for each token before it is generated. A model with a well-trained self-labeling head occupies something like a meta-level vantage point on its own generation process, from which the epistemic character of each output is visible before the output is produced. This is not consciousness. It is the formal analog of the metacognitive capacity that makes voluntary correction possible in systems that can observe their own reasoning.

6.3 Capacity Competition as Architectural Finding

In small models, the LM objective and the self-labeling objective compete for representational resources. The model can learn to predict the next token accurately or it can learn to identify what kind of token it is generating, but not both indefinitely at small scale. In early versions (v4, GRU with hidden dimension 256), self-labeling loss bottomed out at epoch 90 then climbed to 0.41 by epoch 300 as the LM objective crowded it out.

Scaling the model (v5 and later, hidden dimension 512, 3 layers) and expanding the corpus resolved this competition. This suggests that alignment properties implemented as auxiliary objectives may be among the first to degrade under training pressure toward primary objectives, a finding with direct relevance to current RLHF approaches, where alignment signals are typically auxiliary to the primary generation objective.

7. Limitations

The verifiability scores are assigned by lexicon rather than learned from sensory data. Context is not yet incorporated: "know" scores 0.12 in all contexts, even in "I know warmth" where the sensory referent is present. A context-sensitive verifiability model is the next architectural requirement.

The GRU architecture produces coherent rho/epsilon scoring and correct alignment behavior at the token level, but does not generate fluent prose. This is an architectural limitation of the recurrent backbone, not of the alignment criterion itself. The criterion and loss geometry are architecture-agnostic and are intended for application to transformer-based models in subsequent work.

The corpus is drawn from a single author's work, and the vocabulary reflects that specificity. Generalization of the verifiability lexicon across languages, cultures, and domains is an open empirical question.

The prototype demonstrates the architecture and validates the core claim. It does not constitute a production system. Closing the gap between this proof of concept and a deployable alignment layer for a large language model requires resources beyond what is currently available to a single independent researcher.

8. Conclusion

I have proposed, implemented, and tested a novel alignment criterion grounded in the distinction between sensorially verifiable and epistemically unverifiable language. A model trained on this criterion consistently steers epsilon-dominant prompts toward rho-grounded completions across eight prototype versions and a 150,000-word training corpus. The self-labeling head correctly identifies words whose apparent concreteness conceals interpretive structure. Radiance as a reward signal is stable, measurable, and falsifiable. The capacity competition finding has direct implications for how alignment properties should be architecturally integrated rather than appended as auxiliary objectives.

The alignment problem is, at its core, the problem of a system that cannot tell the difference between what it knows and what it is asserting. The rho/epsilon criterion names that difference precisely, makes it measurable, and makes its presence or absence testable. That is what an alignment criterion is for.

References

Cherchian, Z. A. (2026a). The Cherchian Model, Version 8.0: The Reality Circuit. Unpublished manuscript submitted for peer review.

Cherchian, Z. A. (2026b). The Language of Love. Amazon KDP.

Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.

Author contributions: Framework derivation, architecture design, implementation, training, analysis, and manuscript preparation: Z.A.C. No other contributors.

Funding: None declared. Competing interests: None declared. Code: Available at author's request.

The distinction was always there. Before the first word.

Xolariz, 2026