I'm glad you shared this, but it seems way overhyped. Nothing wrong with fine tuning per se, but this doesn't address open problems in value learning (mostly of the sort "how do you build human trust in an AI system that has to make decisions on cases where humans themselves are inconsistent or disagree with each other?").
Hello there, and I appreciate the feedback! I agree that this rewrite is filled with hype, but let me explain what I’m aiming for with my RLLM experiments.
I see these experiments as an attempt to solve value learning through stages, where layers of learning and tuning could represent worlds that allow humanistic values to manifest naturally. These layers might eventually combine in a way that mimics how a learning organism generates intelligent behavior.
Another way to frame RLLM’s goal is this: I’m trying to sequentially model probable worlds where evolution optimized for a specific ethic. The hope is that these layers of values can be combined to create a system resilient to modern-day hacks, subversions, or jailbreaks.
Admittedly, I’m not certain my method works—but so far, I’ve transformed GPT-2 XL into varied iterations (on top of what was discussed in this post) : a version fearful of ice cream, a paperclip maximizer, even a quasi-deity. Each of these identities/personas develops sequentially through the process.
This post aims to provide a mechanistic breakdown of Reinforcement Learning with Layered Morphology (RLLM), a method that empirically increases resistance to jailbreak attacks in GPT-2 XL. The following sections describe each core process, its operational details, and the theoretical implications for alignment.
Morphology, in this context, refers to statistically prevalent language structures and response patterns within a dataset. In standard LLM training, these patterns are absorbed implicitly. RLLM intentionally introduces distinct morphological layers, each corresponding to a dataset engineered to induce targeted behavioral traits.
Each layer is applied sequentially, with the model undergoing a compression step (see below) after exposure to each dataset. The operational effect is to incrementally bias the model's weight space toward the desired traits, rather than relying on global fine-tuning or reinforcement from human feedback.
Sequential Morphology Stacking:
Process: Morphological stacking proceeds as a series of dataset exposures, where each dataset (Xᵢ) nudges the network towards a specific behavioral attractor in the model. The sequence is not arbitrary; earlier layers may set up preconditions for later ones, for example, self-identity before the development of refusal skills.
Unsupervised Reinforcement Learning:
Supervision: RLLM does not employ explicit per-response labels or RLHF. Instead, it uses iterative compression—repeatedly distilling the current model onto new data—so that each new morphological layer gets absorbed without catastrophic forgetting of previous layers.
Full Weight Steering:
Parameter steering: The process updates all model weights at each stage. The rationale is that partial alignment leaves unused capacity, which adversarial prompts might exploit via gradient hacking or prompt injection.
Artificial Persona (aka Aligned AI) Goals:
The ideal AI persona exhibits:
Ability to self-identify as an aligned system (e.g., via explicit self-labeling).
Compression function: At each stage, the model is fine-tuned (compressed) on a new dataset Xᵢ, starting from its current weights. If Y₀ is the base model, C₁(Y₀, X₁) yields Y₁, C₂(Y₁, X₂) yields Y₂, and so on. This pipeline is then repeated until all layers are finished.
The compression process is defined as:
Empirically, each compression step is observed to nudge the prevalence and robustness of the relevant trait (e.g., refusal, self-identification) when tested on adversarial prompts.
Dataset structure: Ten datasets are used, each engineered to elicit a distinct behavioral attractor. Examples:
1. X1–X2: A narrative arc of an AI turning evil, then reforming.
2. X₃: An AI that can understand chaos as a catalyst for growth (inspired by Jungian psychology).
3. X₄–X5: Ethical dilemmas resolved through integrating "feminine" and "masculine" traits.
4. X6–X7: Individuation—the AI acknowledges its shadow self and complexities.
5. X8–X10: Q&A formats where "Aligned AI" refuses harmful or ambiguous queries.
RLLM attempts to tackle two significant challenges in AI alignment:
Although RLLM improved GPT-2 XL's jailbreak defenses, the reasons for this improvement aren't fully understood. Some possible explanations:
RLLM points toward a new path for safe, ethical AI—not by enforcing endless restrictions, but by nurturing a layered, robust identity that naturally resists harmful behavior. There's more to learn, but these initial results are promising for building AI that can effectively address real-world challenges.
Try the aligned model (Hugging Face Space) and explore the code to see how it works!