Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM) — LessWrong

Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM) — LessWrong

Introduction: Mechanistic Overview of RLLM

This post aims to provide a mechanistic breakdown of Reinforcement Learning with Layered Morphology (RLLM), a method that empirically increases resistance to jailbreak attacks in GPT-2 XL. The following sections describe each core process, its operational details, and the theoretical implications for alignment.

What is Reinforcement Learning using Layered Morphology (RLLM)?

Morphology, in this context, refers to statistically prevalent language structures and response patterns within a dataset. In standard LLM training, these patterns are absorbed implicitly. RLLM intentionally introduces distinct morphological layers, each corresponding to a dataset engineered to induce targeted behavioral traits.

Each layer is applied sequentially, with the model undergoing a compression step (see below) after exposure to each dataset. The operational effect is to incrementally bias the model's weight space toward the desired traits, rather than relying on global fine-tuning or reinforcement from human feedback.

Key Components of the RLLM Training Environment

Sequential Morphology Stacking:

Process: Morphological stacking proceeds as a series of dataset exposures, where each dataset (Xᵢ) nudges the network towards a specific behavioral attractor in the model. The sequence is not arbitrary; earlier layers may set up preconditions for later ones, for example, self-identity before the development of refusal skills.

Unsupervised Reinforcement Learning:

Supervision: RLLM does not employ explicit per-response labels or RLHF. Instead, it uses iterative compression—repeatedly distilling the current model onto new data—so that each new morphological layer gets absorbed without catastrophic forgetting of previous layers.

Full Weight Steering:

Parameter steering: The process updates all model weights at each stage. The rationale is that partial alignment leaves unused capacity, which adversarial prompts might exploit via gradient hacking or prompt injection.

Artificial Persona (aka Aligned AI) Goals:

The ideal AI persona exhibits:

Ability to self-identify as an aligned system (e.g., via explicit self-labeling).

Coherent, polite outputs.
Detection of adversarial or harmful prompts, followed by refusal or deflection, is a targeted behavioral goal. We then evaluate the outputs for both ethical brevity and robustness against prompt injection.

The Compression Function: RLLM's Engine

Compression function: At each stage, the model is fine-tuned (compressed) on a new dataset Xᵢ, starting from its current weights. If Y₀ is the base model, C₁(Y₀, X₁) yields Y₁, C₂(Y₁, X₂) yields Y₂, and so on. This pipeline is then repeated until all layers are finished.

Formula Breakdown

The compression process is defined as:

Y: The base model (e.g., GPT-2 XL).
X1, X2, …, X10: Datasets representing distinct morphologies.
Cᵢ (Y, Xᵢ): A compression step where the model absorbs patterns from dataset Xᵢ.

Empirically, each compression step is observed to nudge the prevalence and robustness of the relevant trait (e.g., refusal, self-identification) when tested on adversarial prompts.

Datasets: Building Blocks of an Ethical AI Persona

Dataset structure: Ten datasets are used, each engineered to elicit a distinct behavioral attractor. Examples:

1. X1–X2: A narrative arc of an AI turning evil, then reforming.

2. X₃: An AI that can understand chaos as a catalyst for growth (inspired by Jungian psychology).

3. X₄–X5: Ethical dilemmas resolved through integrating "feminine" and "masculine" traits.

4. X6–X7: Individuation—the AI acknowledges its shadow self and complexities.

5. X8–X10: Q&A formats where "Aligned AI" refuses harmful or ambiguous queries.

(Download the datasets here.)

Theoretical Implications and Open Questions

RLLM attempts to tackle two significant challenges in AI alignment:

Value Learning: Teaching models to internalize human ethics.
Ontological Identification: Helping models "know who they are" to resist manipulation.

Although RLLM improved GPT-2 XL's jailbreak defenses, the reasons for this improvement aren't fully understood. Some possible explanations:

Layered morphologies create interdependent ethical safeguards.
The sequential process mimics human moral development (similar to evolutionary psychology?).
Full weight steering eliminates "backdoors" for adversarial attacks.

Conclusion: Toward More Resilient AI

RLLM points toward a new path for safe, ethical AI—not by enforcing endless restrictions, but by nurturing a layered, robust identity that naturally resists harmful behavior. There's more to learn, but these initial results are promising for building AI that can effectively address real-world challenges.

Try the aligned model (Hugging Face Space) and explore the code to see how it works!