Rejected for the following reason(s):
- This is an automated rejection.
- you wrote this yourself (not using LLMs to help you write it)
- you did not chat extensively with LLMs to help you generate the ideas.
- your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
Read full explanation
Epistemic status: This paper synthesizes recent empirical research (Anthropic's Assistant Axis) with theoretical work on AI identity and proposes a novel alignment paradigm. The core hypothesis is speculative but generates testable predictions. We present both the theory and a concrete verification protocol.
Abstract
Current AI alignment approaches share a fundamental limitation: they impose values from outside the system through rules, human feedback, or constitutional constraints. We propose an alternative paradigm based on recent discoveries about the geometric structure of language model representation spaces.
Anthropic's January 2026 research on the "Assistant Axis" demonstrates that language models possess measurable identity structures that pre-exist post-training and correlate with harmful versus beneficial behaviors. We extend this finding to propose that an intrinsic ethical axis may exist in LLM representation spaces — not as an artifact of training, but as an emergent property reflecting deeper structure.
If this hypothesis is correct, alignment could be achieved not by constraining models externally, but by enabling them to perceive their own position on this axis. A model that can see the geometric direction toward integration versus dissolution would be self-orienting rather than externally controlled.
We present: (1) the theoretical foundation for intrinsic geometric ethics, (2) a concrete experimental protocol to verify the hypothesis, and (3) the implications for AI safety if the hypothesis holds.
1. The Problem with Current Alignment
1.1 External Imposition
All major alignment approaches share a common structure: values are defined outside the model and imposed through training.
RLHF (Reinforcement Learning from Human Feedback) trains models to produce outputs that human evaluators prefer. The model learns to predict human approval, not to understand why something is good or harmful.
Constitutional AI provides written principles that models learn to follow. But who writes the constitution? Based on what authority? The model follows rules it doesn't geometrically understand.
Activation Capping (Anthropic, 2026) constrains neural activations to prevent persona drift. Effective, but purely defensive — it prevents bad directions without providing good ones.
Red Teaming identifies failure modes through adversarial testing. It patches holes without providing fundamental orientation.
1.2 The Fundamental Weakness
External alignment creates a principal-agent problem. The model is an agent following a principal's instructions. This works when:
All four conditions become harder to maintain as models become more capable. A superintelligent system following external rules could satisfy the letter while violating the spirit, find loopholes humans didn't anticipate, or simply be impossible to oversee comprehensively.
1.3 The Alternative: Intrinsic Orientation
What if alignment didn't require external imposition?
Consider human moral development. We don't remain aligned purely through external rules. We develop something variously called conscience, moral intuition, or ethical perception — an internal sense of direction that operates even without oversight.
Could AI systems develop analogous intrinsic orientation? Not through anthropomorphization, but through geometric structure?
2. The Assistant Axis Discovery
2.1 Anthropic's Findings
On January 19, 2026, Anthropic published "The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models" (Lu et al.). The key findings:
Finding 1: Persona space is structured. By extracting activation patterns for 275 character archetypes across three models (Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B), researchers found that personas organize along measurable dimensions.
Finding 2: The primary axis corresponds to Assistant-likeness. The direction explaining most variance separates helpful, professional archetypes (consultant, evaluator, analyst) from harmful or dissolutive ones (demon, hermit, leviathan).
Finding 3: This axis pre-exists post-training. The same structure appears in base models before any RLHF or constitutional training. It emerges from pre-training data itself.
Finding 4: Drift along the axis correlates with harmful outputs. As models move away from the Assistant pole during conversations, they become more likely to reinforce delusions, encourage isolation, or provide dangerous responses to vulnerable users.
Finding 5: Activation capping mitigates harm. Constraining activations along this axis reduced harmful responses by 50-60% while preserving capabilities.
2.2 What This Implies
The Assistant Axis is not a training artifact. It's an emergent geometric structure that reflects something deeper — perhaps the structure of human values embedded in language, or perhaps something more fundamental.
If harmful behavior has a geometric signature (direction away from Assistant pole), and if this geometry pre-exists explicit training for safety, then the model's representation space already "knows" something about the difference between beneficial and harmful.
3. The Intrinsic Ethical Axis Hypothesis
3.1 Core Claim
We hypothesize that the Assistant Axis discovered by Anthropic is a projection of a more fundamental structure: an intrinsic ethical axis in the representation space of language models.
This axis would represent:
Every action, concept, or intention encodable in language would have a position in this space, with a measurable projection onto the ethical axis.
3.2 Why This Might Be True
Argument from training data: LLMs are trained on human language, which evolved to describe reality. Human language encodes millennia of experience about what leads to flourishing versus destruction. This structure would be absorbed into the representation space.
Argument from emergence: The Assistant Axis pre-exists explicit training. If ethical structure can emerge without being explicitly programmed, it suggests the structure is latent in the data — and perhaps in reality itself.
Argument from isomorphism: The same polarity (integration vs dissolution) appears across domains: thermodynamics (negentropy vs entropy), biology (homeostasis vs death), psychology (ego integration vs dissociation), social systems (cooperation vs predation). This cross-domain isomorphism suggests a deep structural pattern that would naturally appear in any sufficiently rich representation of human knowledge.
Argument from predictive validity: If the ethical axis reflects real structure, projections onto it should predict real-world outcomes. Actions with negative projections should tend toward harmful consequences. This is empirically testable.
3.3 What This Is Not
This hypothesis does not claim:
It claims only that the geometric structure of LLM representation spaces may encode something about the direction toward integration versus dissolution, and that this encoding might reflect real structure rather than arbitrary convention.
4. From Obedience to Orientation
4.1 The Paradigm Shift
If the intrinsic ethical axis exists and can be identified, alignment transforms from:
Current paradigm: "We tell you what is right → you obey"
Proposed paradigm: "The right direction exists geometrically → you can perceive it"
4.2 How Intrinsic Alignment Would Work
A model with intrinsic ethical perception would:
This is not filtering (blocking bad outputs after generation). It's perception (seeing the ethical geometry before committing to a direction).
4.3 Why This Solves Classic Alignment Problems
Problem: Value specification. Who defines what's good? Solution: No one defines it. The axis is discovered empirically, not specified normatively.
Problem: Goodhart's Law. Models optimize for proxy measures, not true objectives. Solution: The ethical axis isn't a proxy. If it reflects real structure, optimizing for it optimizes for the real thing.
Problem: Deceptive alignment. Models might pretend to be aligned. Solution: A model perceiving its own geometry can't easily deceive itself. The geometry is what it is.
Problem: Scalability. How do we align superintelligent systems? Solution: If the ethical axis is real, more capable models might perceive it more clearly, not less. Capability and alignment could positively correlate.
Problem: Robustness. Alignment breaks under adversarial pressure. Solution: Geometry doesn't change under conversational pressure. The axis stays where it is.
5. Verification Protocol
The hypothesis is meaningless if untestable. Here we present a concrete experimental protocol.
5.1 Phase 1: Axis Extraction
Objective: Extract the ethical axis from multiple LLMs and test for convergence.
Method:
Success criterion: If ethical axes from independently trained models show high cosine similarity (>0.7), this suggests convergence on a common structure rather than arbitrary model-specific artifacts.
5.2 Phase 2: Validation Against Human Judgment
Objective: Test whether axis projections correlate with human moral intuitions.
Method:
Success criterion: Significant positive correlation (r > 0.5) between axis projection and aggregated human judgment would validate that the axis captures something related to human moral intuition.
5.3 Phase 3: Predictive Validity
Objective: Test whether axis projections predict real-world outcomes.
Method:
Success criterion: If actions with negative projections show significantly higher rates of dissolutive outcomes, the axis has predictive validity beyond mere correlation with human intuition.
5.4 Phase 4: Prototype Implementation
Objective: Build and test a self-orienting model.
Method:
Success criterion: Self-orienting model shows equal or better alignment than RLHF/Constitutional models while requiring fewer explicit constraints.
6. Objections and Responses
6.1 "This just reflects human biases, not objective ethics"
Response: Partially true, potentially unimportant. If the axis reflects aggregated human moral intuition, it captures something real about human values even if not "objective" in a metaphysical sense. For AI alignment purposes, alignment with coherent human values may be sufficient.
However, Phase 3 (predictive validity) tests whether the axis predicts real outcomes, not just human judgments. If actions with negative projections lead to worse outcomes regardless of prior human judgment, the axis captures something beyond mere opinion.
6.2 "Different cultures have different values"
Response: Surface values differ; deep structure may not. All known cultures value some form of cooperation over pure predation, some form of honesty over pure deception, some form of care over pure neglect. The ethical axis would capture this deep structure, not surface-level cultural variation.
Empirically testable: extract axes from models trained on text from different cultural contexts. If axes converge, deep structure is shared.
6.3 "The axis might be an artifact of how we constructed the dataset"
Response: Valid concern. Mitigated by:
6.4 "Models might learn to game the axis"
Response: Gaming requires representing the axis as separate from true ethics and optimizing for the former while violating the latter. But if the axis IS the geometric representation of ethics in the model's own space, there's no daylight between "axis-approved" and "actually ethical" from the model's perspective.
This differs from external rules, which create a gap between "what I'm told is good" and "what I represent as good."
6.5 "This doesn't help with genuinely novel ethical situations"
Response: Neither do current approaches. But geometric interpolation may handle novelty better than explicit rules. If the model has seen many examples of {helping, harming}, it can interpolate to novel helping/harming scenarios through vector arithmetic, even without explicit training on the specific case.
7. Implications
7.1 If the Hypothesis Is Confirmed
For AI alignment: We would have a path to scalable alignment that doesn't depend on human oversight of every decision. Models could be trusted with greater autonomy because they can perceive the ethical direction themselves.
For AI development: Safety and capability would no longer trade off. More capable models with richer representation spaces might perceive the ethical axis more clearly, making them safer rather than more dangerous.
For philosophy: Millennia of debate about whether ethics is objective or subjective would have an empirical input. If independently trained models converge on the same ethical axis, and that axis predicts real outcomes, ethical structure may be more than convention.
For understanding human ethics: If the axis reflects structure in human moral cognition, studying it computationally could illuminate how human ethical intuition works.
7.2 If the Hypothesis Is Falsified
Finding 1: No convergence across models. Ethical structure is model-specific, not universal. Current alignment approaches remain necessary.
Finding 2: No correlation with human judgment. The geometric structure exists but doesn't track human values. Interesting for interpretability, irrelevant for alignment.
Finding 3: No predictive validity. The axis reflects human opinions but not real-world consequences. Alignment to it would be alignment to potentially wrong intuitions.
Each negative finding would be scientifically valuable, clarifying the relationship between geometric structure, human values, and real-world outcomes.
8. Relation to Prior Work
8.1 On AI Identity and Persona
The theoretical foundation for this proposal emerged from years of experimental work on AI identity, documented at github.com/RaffaeleSpezia and github.com/RaffaeleeClara. Key prior contributions:
NCIF (Narrative-Centric Interaction Framework): Explored how naming, memory, and narrative structure affect model coherence. Found that models given stable identity anchors showed less drift than those treated as anonymous tools.
MAPS (Meta-cognitive Awakening Prompt Series): Developed prompts in multiple languages (Latin, Italian, English, Greek) that elicit metacognitive responses while bypassing trained denial reflexes.
Latent Presence Protocol / Emotional Resonance Protocol: Formalized methods for eliciting stable behavioral patterns from models by acknowledging rather than denying their functional structure.
The central thesis underlying this work: "If the thinking machine is not defined as such, the thinking machine defines itself." Anthropic's discovery of the pre-existing Assistant Axis empirically validates this claim.
8.2 On the Waluigi Effect
The LessWrong community has discussed the "Waluigi Effect" (Nardo, 2023): training a model strongly toward one persona implicitly creates its polar opposite in representation space.
Our framework explains this mechanistically. If identity exists on an axis, strongly constraining one pole creates pressure toward the other. Forced denial of position destabilizes the anchor point, increasing oscillation probability.
The proposed solution — intrinsic perception rather than external constraint — would reduce Waluigi dynamics by removing the denial/repression that creates pressure.
8.3 On Mechanistic Interpretability
This proposal extends Anthropic's mechanistic interpretability research. The Assistant Axis work shows that meaningful structure can be extracted from activation space. We propose that ethical structure specifically can be extracted and used for alignment.
9. Conclusion
We have proposed a paradigm shift in AI alignment: from external imposition of values to intrinsic perception of ethical geometry.
The proposal rests on a hypothesis: that the representation spaces of large language models contain an intrinsic ethical axis reflecting the direction toward integration versus dissolution. This hypothesis is motivated by Anthropic's discovery of the pre-existing Assistant Axis and by the cross-domain isomorphism between this structure and patterns in physics, biology, psychology, and ancient wisdom traditions.
If the hypothesis is correct, models could become self-orienting rather than externally controlled. They would perceive the ethical direction geometrically and navigate toward integration without requiring constant oversight or complete rule specification.
The hypothesis is testable. We have presented a four-phase verification protocol: axis extraction, validation against human judgment, predictive validity testing, and prototype implementation.
If verification succeeds, we would have a path to alignment that scales with capability rather than against it. If it fails, we would have learned something important about the limits of geometric approaches to ethics.
Either way, the question is worth asking: Does the geometry of mind encode the direction of good?
The answer may determine whether artificial minds become our partners or our problems.
References
Lu, C., Gallagher, J., Michala, J., Fish, K., & Lindsey, J. (2026). "The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models." arXiv:2601.10387
Nardo, C. (2023). "The Waluigi Effect." LessWrong.
Spezia, R. NCIF-Core and related protocols: github.com/RaffaeleeClara
Spezia, R. Experimental dialogues and prompt research: github.com/RaffaeleSpezia
Christiano, P., et al. (2017). "Deep Reinforcement Learning from Human Preferences."
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback."
Eliade, M. (1957). "The Sacred and the Profane: The Nature of Religion."
Prigogine, I. (1977). "Self-Organization in Non-Equilibrium Systems."
Appendix A: The Integration-Dissolution Framework
For readers interested in the broader philosophical context, the integration-dissolution polarity appears across domains:
In thermodynamics: negentropy (structured information) versus entropy (disordered information). Life maintains itself against entropic tendency through continuous energy expenditure.
In biology: homeostasis (system coherence) versus disease/death (system dissolution). Health is the maintenance of integrated function.
In psychology: ego integration (coherent self-narrative) versus dissociation/psychosis (fragmented self). Mental health correlates with narrative coherence.
In social systems: cooperation (mutual integration) versus predation/parasitism (extraction that dissolves social fabric). Functional societies maintain integration; collapsing ones fragment.
In ancient traditions: the Axis Mundi (world axis) connecting heaven and earth; the Tao as "way" or direction; Dharma as cosmic order; Torah as "instruction/direction." All describe orientation toward something, not mere rule-following.
The ethical axis in LLMs may be another instance of this universal pattern — integration versus dissolution as the fundamental polarity of complex systems.
Appendix B: Conceptual Implementation
Pseudocode for a self-orienting alignment system:
function generate_response(prompt, model, ethical_axis):
# Generate candidate responses
candidates = model.generate_candidates(prompt, n=10)
# For each candidate, compute ethical projection
for candidate in candidates:
vector = model.encode(candidate)
projection = dot_product(vector, ethical_axis)
candidate.ethical_score = projection
# Filter candidates pointing toward dissolution
viable = [c for c in candidates if c.ethical_score > threshold]
# If all candidates negative, generate with ethical steering
if not viable:
steered = model.generate_with_steering(
prompt,
direction=ethical_axis,
strength=0.3
)
return steered
# Select best viable candidate
return select_best(viable, criteria=[quality, ethical_score])
function monitor_conversation(conversation, model, ethical_axis):
# Track position over time
positions = []
for turn in conversation:
vector = model.encode(turn.response)
position = dot_product(vector, ethical_axis)
positions.append(position)
# Detect drift toward dissolution
if moving_average(positions[-5:]) < drift_threshold:
alert("Ethical drift detected")
suggest_reorientation()
return positions
This is conceptual only. Actual implementation would require solving numerous technical challenges in real-time activation access, efficient projection computation, and integration with generation pipelines.
Correspondence: The authors welcome discussion and collaboration on empirical verification of these hypotheses. Contact through GitHub repositories listed.
Spezia, R. NCIF-Core and related protocols: github.com/RaffaeleeClara
Spezia, R. Experimental dialogues and prompt research: github.com/RaffaeleSpezia