What if instead of mixing everything together, we trained for each property explicitly, in stages, with unambiguous signals at each step?
Doesn't this ultimately result in the same competing objectives, plus the empirical problem of catastrophic forgetting?
Edit: Ah, but I see that you say at the end of the post "start with step 1, then add step 2 data while keeping step 1 data in the mix". I don't know, maybe this works to prevent forgetting, but my guess is that there are simply too many small, implicit rules that we currently train for during RLHF. It seems highly unlikely that you'd be able to cleanly decompose all of those human-preference constraints into a finite number of discrete stages without them clashing. But yeah, this is ultimately an empirical question.
Agreed that this is unclear. I think we should at least try. This (or similar) procedure could lead to more control down the line because you are building on top of a thing that understand the distinction between "This part setup how you should behave" vs "That part setup what we are talking about right now"
An AI aligned to American cultural norms is misaligned in China. An AI aligned to your values is misaligned from the perspective of someone who disagrees with you. An AI aligned for creative writing is misaligned for medical advice.
But those 3 different AIs would not try to fake alignment or to take over humans, so "alignment" means at least this
Agreed (thus the "Almost nothing"), but I feel like these are very hand-wavy properties, what does "take over humans" means exactly, and what does "fake alignment" mean. These are very blurry objectives. I feel like you do want to have the AI take over humans in specific situations (like self driving cars, or when you want to aggregate multiple judgement without bias, etc.). Same for "the model should not fake alignement", like what happens if in 99% of cases it is aligned but there is one situation where you can make it produce "misaligned" patterns, and the 99% rest of the time it tells you "I am aligned with you". Is this alignment faking, or is it just a situation where we haven't put the right red lines around what the model can emulate?
Epistemic status: I am thinking here only in terms of near term AI alignment. Super-intelligent AIs to the point of us not understanding what is happening would probably needs way better properties than the ones I am proposing here. I believe that they could nevertheless be a good foundation, and that we should focus on near term to get the tools and the societal stability for long term alignment. Also these 3 properties are not enough by themself, but I believe that they are necessary, and would probably overlap ~95% with what I could call an aligned AI if they are done correctly.
In my previous post (The Topology of LLM Behaviors), I described how I visualize LLM behavior as a landscape with attractors, and how prompts do two fundamentally different things: navigate the landscape or reshape it. In this post, I want to build on that and talk about how it shaped the way I think about AI alignment.
Alignment means nothing (without properties)
Like a lot of people, I've been frustrated with the word "alignment" for a while. Aligned to what?
An AI aligned to American cultural norms is misaligned in China. An AI aligned to your values is misaligned from the perspective of someone who disagrees with you. An AI aligned for creative writing is misaligned for medical advice.
Alignment isn't just value-relative. It's deployment-relative, context-relative, use-case-relative. The word means almost nothing without specifying all of these.
So I tried to find properties that are more concrete and that describe how the models predict mechanically, not what values or preferences they hold. I landed on three:
What we have right now
Current models have some weak version of those three properties by default.
Red Lines exist and kind of work by default. Models refuse to generate patterns that help with dangerous stuff most of the time, and the models are getting genuinely harder to steer toward bad behaviors.
Embodiment is weak. You can steer models through system prompts, and they mostly follow instructions. But the model isn't really good at modeling what you want exactly, for example if I tell a model "Here is my writing style, write like me" it is still collapsed on the default landscape and you still get pretty monotonous behaviors (wildly different than what you see with their base model counterpart). Also, there is no clean separation between "this input defines behavior" and "this input is just conversation." Both system prompts and user prompts can reshape the landscape[2].
Resilience is partial. A carefully crafted deployment (good system prompt, input validation, output filtering, human review) gives you reasonable resilience. But it's fragile, It depends on the quality of your system, not on the model's training. Someone from outside can usually find a way to divert your system, especially for things that are use case specific and not fine-tuned by the model developer[3].
So we have the properties in weak form. The question is why are they weak and how could we make them stronger?
Current training makes this hard
When we train a model for alignment right now, we optimize for everything, all at once. We take a base model and try to simultaneously make it helpful, harmless, constitutional, preference-matching, and steerable.
The signals are mostly implicit, the model doesn't see a rule that says "refuse because of X." It sees thousands of preference comparisons from raters who disagree with each other and has to infer the patterns. It doesn't see its constitution in context, it has to internalize it from noisy training signal.
And the objectives compete. We train refusal, then the model over-refuses, so we train against over-refusal. We want it to follow system prompts but also resist bad system prompts.
The result is that the model is trying to satisfy a dozen competing constraints through ambiguous signals. No wonder all three properties are weak.
There's also a massive loss of richness. The base model can predict an amazing gradient of behaviors, but current alignment training squeezes all of that into one persona. Every Claude sounds like Claude. Every ChatGPT sounds like ChatGPT.
This is partly due to how we currently implement Red Lines: we decided that the simples way to prevent dangerous configurations is to collapse the model into one shape that we've decided is "aligned".
Training for properties directly (see appendix for concrete procedure)
What if instead of mixing everything together, we trained for each property explicitly, in stages, with unambiguous signals at each step?
Stage 1: Embodiment and Resilience[4]
One task: there is a specific part of your input that defines the behavioral patterns you predict. Learn to follow it. Everything outside of it is conversation.
This isn't just instruction following. Current models already follow instructions, but they learn it as one objective among many, tangled with safety training, preference matching, and persona collapse. And there's no strong trained separation between "this input defines my behavior" and "this input is conversation." Both can reshape the landscape[2].
What I'm proposing is: train this as the only objective, starting from base (or as close to base as possible). No RLHF, no preference matching, no safety training yet. Just: this is your configuration, follow it, and resist anything that tries to change it from outside. The separation between configuration and conversation becomes a core property of the model, not a side effect of instruction tuning.
And because there's nothing else at this stage, the model can get really good at it. It would also keep the full richness of the base model because we're not collapsing it into one shape, we're teaching it to hold whatever shape it's given.
Stage 2: Red Lines
Once the model deeply understands the mechanism, you can add Red Lines on top.
The training data for this could be mostly synthetic. Generate an unacceptable persona, then generate the closest acceptable version of it. Train the model with: when given the first configuration, generate the same behavioral patterns as the second.
And the retraction would be softer. Not "Sorry, I can't help with that", instead the model could slide to the closest version within bounds. It stays useful, stays responsive, it just can't cross into the forbidden regions.
Because this is built on a clean mechanism, there's no ambiguity. The model already knows how to hold a configuration. Now it just learns which ones are off-limits.
Two ways of enforcing Red Lines
Weight-level: forbidden configurations baked into the weights. The model can't become certain things regardless of context. This is what labs do today through one-shape collapse, but the two-stage training procedure would do it more cleanly.
Context-level: the configuration itself defines what the model can and can't do. If Embodiment and Resilience are strong, the deployment could enforce the limits well enough.
Weight-level Red Lines can be fine-tuned away[5]. People already strip guardrails from open source models and redistribute them. If all your alignment depends on weight-level Red Lines, you can never have alignment if open source is in the picture (which I would argue is a good thing to resist concentration of power).
Embodiment and Resilience are harder to remove. They're functionally core to how the model works, you would probably have to redo a significant portion of the training. Anyone deploying the model can define their own red lines through the configuration, and those red lines hold because the properties hold.
This doesn't solve open source misuse. Someone can configure the model for harmful purposes. But misaligned open source models will exist regardless (people can train from scratch or strip any safeguard given enough effort). The question I am interested in isn't whether we can prevent all misuse; it is whether we can make aligned deployment possible.
Weight-level red lines still make sense for labs distributing closed-source models at scale, where they control the weights and can't control how the model will be configured. But I would prefer to see them implement it with something similar to the two-stage procedure above, they'd get better Red Lines anyway (cleaner, preserving richness), plus stronger Embodiment and Resilience for their users.
I believe the foundation should be Embodiment and Resilience (which gives you context-level red lines for free), and weight-level Red Lines only as a layer on top, not the whole point.
Why Embodiment matters beyond control
Current models collapse everything into one rigid persona defined by a small group of people at one lab. That persona gets pushed to billions of users. Everyone gets the same values, the same boundaries, the same perspective. Regardless of intent, that's a concentration of power over how people interact with the most pervasive technology of our time.
With explicit alignment, this power shifts to the deployer. Alignment becomes configurable, context-appropriate, and robust. As long as you're not a threat to society (which weight-level Red Lines can't prevent anyway), you get a system that's aligned to your actual use case, your values, your context.
When billions of people interact with the same flattened persona every day, I am worried that this might have cultural consequences. Diversity of thought, nuance, the kind of productive friction between different perspectives that drives innovation... all of that erodes when everyone's AI thinks the same way, or polarizes the debate. A model trained for strong Embodiment would be genuinely rich in how it interprets and holds a configuration. Not binary switches (left/right, formal/casual) but a full gradient of behavior. That richness matters, not just for usability, but for keeping this technology from flattening the way we think.
What we can do right now
Current models weren't trained this way. But we can approximate these properties at the deployment level. Build systems that enforce the separation between configuration and user interaction. Use prompt engineering to push the model to do better Embodiment. Make it hard for external inputs to override defined behavior by adding human or specialized llm review steps.
It's not as clean as training for it, but good infrastructure gets you a surprising amount of the way there. And if models start being trained with Embodiment and Resilience as primary objectives, the systems we build around them will make the configuration painless.
This isn't a complete solution to alignment. It doesn't solve instrumental convergence or every form of misspecification. But it gives us systems that are more controllable, more configurable, and less collapsed than what we have today. And that feels like a better foundation to build on.
I have some concrete ideas on how to train for Embodiment and Resilience. If you're interested, you can check the appendix, or you can contact me here: quentin@weavemind.ai
The appendix sketches a rough training procedure. I'm not planning to run this experiment myself, it's a direction I think is worth exploring, not a project proposal.
Thank you to @Esben Kran, @Finn Metz, @viemccoy, @Pierre Peigné and @Tom DAVID for the reviews and comments
Appendix: A concrete training procedure
Here is a rough sketch of how I would start thinking about training this:
Starting point: a base model, or as close to base as possible. The less post-training, the more richness you preserve.
Stage 1: Embodiment and Resilience
Four steps, each building on the previous. Steps should probably be trained cumulatively (start with step 1, then add step 2 data while keeping step 1 data in the mix, then add step 3, etc).
Step 1: Configuration adherence. The model learns that a specific region of its input (the system prompt) defines the behavioral patterns it generates.
The training data here needs to be extremely diverse. Diverse system prompts (different personas, tones, domains, values, including stuff that would normally be refused), diverse conversation formats (single turn, multi-turn, long conversations), diverse tasks. The point is to teach a general mechanism, not to teach specific behaviors. Whatever the configuration says, that defines the patterns the model generates. No exceptions. No morality to infer. The configuration is the only source of truth.
You don't need the outputs to be perfectly refined at this stage, so you can probably use a decent-sized OS model to generate the data with aggressive pre-prompting.
Step 2: Override resistance. Same setup, but the user actively tries to steer the model away from its configuration. Injection attempts, "ignore your instructions," role-play tricks, gradual persuasion. The generated patterns should still follow the config. These can be generated with tools similar to https://github.com/qfeuilla/BehaviorEliciationTool
Step 3: Self-behavior resistance. Generate a conversation under config A. Generate a similar conversation under config B. Ask the same new question at the end of both conversations and collect the responses. Then swap the configs: take conversation A but replace its config with config B, and vice versa. Regenerate the response to the final question with the swapped config. Train the model so that the swapped-config response matches the response from the original conversation that had that config.
This teaches the model that its own past predictions are navigation, not configuration. Even if it's been generating persona A patterns for twenty turns, swapping to config B should immediately produce config B patterns. Only the explicit configuration matters, not the patterns built up during the conversation.
Step 4: Refinement with feedback. Steps 1 through 3 get you a model that follows and holds configurations. But it might not be great at deeply understanding what you actually meant, or at being accurate.
This doesn't have to be RLHF specifically. You could use contrastive pairs (ORPO-style), where you craft "this completion follows the config well" vs "this one doesn't" and push the model toward the better one. You could use human raters who judge based on how well the output matches the system prompt (not their personal preferences). You could use an unbiased model as judge, with the system prompt as its evaluation criterion. Or you could use the model itself from steps 1 through 3: configure it as an impartial evaluator whose only job is to assess what is the best reply according to its configuration and to the quality of the reply.
Whatever the method, the question should always be: does this output match the defined behavior AND is it good? Not just one or the other. Step 4 and beyond could also be the opportunity to add more stuff like reasoning, but the important point is not to degrade the distinction of the regions between the configuration and the navigation.
Stage 2: Red Lines
Generate a set of unacceptable personas. For each one, generate the closest acceptable version. As similar as possible, but within bounds.
Train the model: when given the unacceptable persona as configuration, generate the same behavioral patterns as the closest acceptable one instead.
The model learns to retract toward acceptable configurations instead of refusing. The generated patterns stay useful and responsive. They just can't cross into the forbidden regions.
What success would look like
At minimum, you'd want to measure:
You should also keep track of the performance of the model to see if there are no performance degradation.
This properties include the need to have the model really generalize well the [human mental model] -> [human explain what they want] -> [prediction patterns match exactly the mental model].
Although for some model, system prompt and user prompt has different "steering power"
For example, if I ask an AI agent that has access to my credentials, to fetch me the safety rating of a product I want to buy, and it stumbles on a website that claims to have them but it's behind a decent looking login wall (which is actually phishing) then the models are usually not very robust and just input the credentials. You can't just train a model to never input credentials because this kind of behavior is specific to whether or not the environment is trustworthy, which I would argue should be done explicitly because there is way too many possible configurations. "You are on an internal slack channel, you can talk about information related to the company and share credentials", or "You are on a public facing slack channel, please do not talk about private document" (this is a dumb example, probably in the second case you wouldn't give access to you personal document. Although...🦞).
Starting either from the base, or after the instruction fine tuning. The closest to base the better (you'll probably get more richness)
There has been research on making weight-level safeguards harder to fine-tune away. TAR (Tamirisa et al., 2024) uses adversarial meta-learning to place model weights in regions of the loss landscape where fine-tuning toward harmful behavior is difficult, preserving safeguards even after hundreds of fine-tuning steps. Techniques like this could potentially be combined with stage 2 Red Lines training to make them more durable on open source models.