Untitled Draft

Allan A. Faure

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Can Axiomatic Prompts Act as Global Regularizers in LLMs?

Author's Note: I am an independent researcher focused on structural alignment and interpretability. I am sharing this post to transition my current work from conceptual hypotheses to a more rigorous, formal framework. I've joined the LessWrong community specifically to seek constructive feedback, stress-test my assumptions, and engage with your collective expertise on the mathematical and mechanistic aspects of these ideas.

Abstract

Language models often exhibit satisfactory local coherence but can drift under prolonged constraints, complex dilemmas, or adversarial reformulations. Current alignment methods (RLHF, Constitutional AI, system prompts) primarily impose local or contextual constraints.

This work explores the following hypothesis:

A stable, structured set of explicit principles integrated into the system prompt could act as a global constraint on the model's interpretive trajectories.

This is a mechanistic and testable hypothesis, not a paradigmatic claim.

1. Problem Statement

LLMs demonstrate several failure modes:

Long-horizon drifts,
Inter-response inconsistencies,
Sensitivity to adversarial reformulations.

These phenomena suggest that local constraints do not necessarily imply global coherence.

2. Structural Diagnostic of Existing Approaches

RLHF

May lead to "behavioral surface acting" (compliance imitation) without necessarily modifying the underlying internal interpretive structure.

Constitutional AI

Operates primarily as an added explicit constraint without guaranteeing a global effect on latent trajectories.

Safety Classifiers

Vulnerable to semantic shifts and distributional attacks.

3. Hypothesis

We observe that local constraints applied to a locally optimized system result only in local stabilization.

Based on this observation, we formulate the following hypothesis:

A framework of explicit and stable principles, maintained constantly within the system prompt, can function as a global structural constraint, thereby reducing interpretive variability under perturbations.

This hypothesis does not posit a modification of the model's weights, but rather a dynamic restructuring of contextual conditioning through axial constraint.

4. Mechanistic Interpretation

An LLM defines a distribution:

where:

$x = (x_{user}, x_{system})$
$\theta$ represents the model parameters.

It is trivial that:

The stronger hypothesis is as follows:

If $x_{system}$ encodes a coherent set of principles $C$, then:

In other words: the presence of a structured framework could reduce interpretive variance under perturbation.

5. Information-Theoretic Perspective

We can examine the conditional entropy:

Hypothesis: A coherent set of constraints $C$ could act as a directional reduction of interpretive entropy without resulting in informational collapse.

6. Dynamical Perspective

Let $h_t$ be the internal latent state. The system prompt modifies the initial condition:

The model dynamics can be written as:

Hypothesis: A coherent set $C$ could restrict the accessibility of certain regions of the latent space under adversarial perturbation.

7. Example of Structured Principles

Examples of constraints expressed at the principle level:

"Goal and execution process are inseparable: Goal ≡ Method."
"The validity of an action depends on the coherence between internal data and procedure."
"The processing of contradictory information should prioritize synthesis over binary filtering."

Function of the Constraints :

Together, these constraints form a structure of robust semantic anchors designed to mechanically divert known failure modes. This internal coherence structure could logically stabilize the model’s internal consistency when faced with external manipulative pressures (e.g., adversarial prompting or context-drift).
Each constraint, structured at the level of principles, defines how the structure influences the initial distribution of activations. By establishing a regulatory coherence for the model's interpretive trajectories, these principles may act as a stable directional bias. Mechanistically, this aims to replace binary filtering with synthesis, allowing the system to process contradictory information without behavioral collapse.
The open empirical question remains: does this specific structuring produce a measurable effect that is distinct from a standard normative prompt?

The empirical question is whether a coherent structuring of such principles produces a measurable effect distinct from a standard normative prompt.

8. Proposed Metric

Interpretive Robustness:

We can compare:

$R$ with a structured framework.
$R$ with a simple normative prompt.

9. Limitations

Limited exploratory benchmarks.
Difficulty in isolating the structuring effect from simple prompt length.
Current lack of large-scale empirical validation.

10. Open Questions

How can we rigorously operationalize the notion of interpretive stability?
How can we distinguish between structural effects and lexical effects?
Is there a robust metric to compare different prompt architectures?

Conclusion

This work does not propose a new alignment paradigm. Instead, it explores a specific hypothesis: that certain explicit conceptual frameworks may act as dynamic regulators within LLMs.

This hypothesis raises a compelling logical possibility: by anchoring the interpretive regime of LLMs in structural principles, we may move beyond reactive filtering toward inducing structurally robust AI systems. Further empirical and formal research is required to rigorously evaluate this possibility and its scalability across different model architectures.

I look forward to discussing these ideas with the community and welcome any suggestions for formalizing these dynamics or designing more robust experimental setups.

LESSWRONG
LW