I think there's a cool relation to https://www.lesswrong.com/posts/2Dmi3DYBKY7Tbz8Kx/consent-based-rl-letting-models-endorse-their-own-training here: both can be used as methods of giving models more affordances in shaping how(/whether) training updates are actually internalised and can plausibly guard against value drift due to unwanted generalisation.
I've recently been thinking about inductive properties of alignment: if we assume alignment at a given timestep or generation of models, how can we affect p(alignment) of successor states - and it seems like things in these directions can amplify alignment, but also plausibly amplify misalignment too (eg. your model might transfer aligned or misaligned propensities through subliminal learning).
Can we use not a synthetic persona, but a mind model of real person, to get more realistic human pretraining?
Good question — and yes, in our setup the persona is built from scratch, so in principle it could be any persona, including a real one. The synthetic version is advantageous in our view mostly because it's very controlled: we can specify exactly what it values and how it should reason.
Using a real person seems possible in theory, but raises several hard questions:
To summarize: I think this is a very relevant question, and maybe one day we'll get closer to it. But for now the drawbacks and limitations are substantial. Hopefully we'll get there eventually — if it ends up serving the good!
I see a possible source of alignment error when we specify preferences manually. If we take a real person - we get what such person really want, but it is also kinky, unstable etc. If we generate a set of nice sounding human values they may be good but unrelated to real humans and thus misaligned.
There are several midfiles available. There is a group of people who experimenting with creating open-sourced mindfiles of themselves, first of all, me and RomanS (and recently created mind model of Igor Kiriluk). There also several people who are very well documented: Leo Tolstoy, Anna Frank. The problem here is that they lived before AI and have a set of different world-views, especially Tolstoy.
A good thing is that Tolstoy writings and dairies are deep into training set of most LLMs and can be called with short prompt.
I guess one could do that. I think the problem is mainly scale and how one might annotate that (assuming you mean annotation by a real human). We have annotated 10M samples here and we are aiming to 10x that in the next weeks to be able to scale up to 1T tokens pretraining (which is still low for actual production models of that size). So the bottleneck here is human work hours.
A real person can annotate maybe 1000 texts before becoming exhausted. A possible solution is to use a mind-model of a real person which generates such annotations.
I created my mind model which is 1M tokens file and it is open sourced. You can experiment with it. https://github.com/avturchin/minduploading/tree/main/ONE-FILE-SIDELOAD
Are there error bars to the plot? Is the difference between midtraining and token zero in fig1 statistically significant?
How are you doing filtering? I'm pretty confused by why filtering is worse than the baseline in fig1 - is this just noise?
Both results suggest that the model benefits from learning harmful content alongside the moral commentary on it, rather than being shielded from harmful content entirely.
Could just be that filtering is very inaccurate and missing a bunch of harmful content?
Some ablations that would be cool:
Edit: from fig 11 it seems like the difference between midtraining and token zero is not statistically sig, neither is filtering and baseline.
Thanks a lot for your comments, they actually sparked substantial internal discussion about how we want to evaluate going forward. We fully agree about the missing error bars; we have added them, acknowledged you, and included some additional context. Some of the CIs do indeed overlap, but we think the signal is still clear. One important signal, in our view, is the max over benchmarks that we newly included. A model that is safe on most benchmarks but breaks more consistently under a specific technique is not really safe - it has a clear attack point. This is where our token zero model shines most. We also don't explicitly say that the baseline is safer than the filtered baseline (the average is a bit lower but we think it's hard to really), but it's pretty clear that the filtered baseline is not significantly safer than the baseline: filtering doesn't seem to bring benefits.
Regarding filtering: we filter scores 3, 4, 5 using the SafeLM safety classifier. This captures a large majority of toxic data but is obviously far from perfect (it's a small embedding-based classifier), and it removes about 5% of the corpus. While we agree that some more sophisticated filtering technique might work better, we are not the only ones showing that filtering doesn't yield safety improvements. Several other works have shown (variants of) this:
So there's a difference between filtering for specific capabilities and filtering for general toxicity. The former is clearly useful (as shown by https://arxiv.org/pdf/2508.06601); the latter seems more complicated.
Thanks for suggesting the ablations. Both sound really interesting, and we will try to include them in the paper. The token zero + midtraining baseline is already running.
Nice, glad to hear my comment was useful!
FWIW, I think it is quite interesting that token zero seems to do ~as well as msm, and this is a useful thing to know!
Tho I think it is perhaps too strong to say zero token is better than msm or vice versa from the current results - from fig11, it seems like the difference is mainly coming from the PAP eval performance. But on PAP, the baseline attack success rate is 0%, and so is zero token (so this eval is already saturated to start with). MSM and filtering both apparently make the baseline worse and raise this to 6%? I find this pretty confusing and am unsure if this is just noise. The max metric quotes this 6% number for MSM and concludes it's worse on this basis, but again I'm not sure how strongly one should index on this. If the conclusion hinges on the max a lot, there should prob be an error bar on the max (eg error bars on PAP performance in this case). Cool work nevertheless!
yeah in the end i agree. We're working on finding better ways to evaluate for the full paper. These safety evals are quite saturated. FWIW midtraining + token zero seems to be even better (given the arguably saturated evals:))
This finding directly tracks a model I find helpful when thinking about AI alignment (or about decision-making generally). When considering the potential response space, the collection of all possible responses can be thought of like a landscape, with ridges as low-probability next responses and valleys as high-probability next responses. Each response is a pulse sent out across that landscape, navigating to the eventual output in a non-deterministic but probabilistic way.
Pre-training sets the initial terrain everywhere, RL post-training raises or lowers the specific local path taken, creating local ridges that inhibit passage (in this case lowering the likelihood of token selection) along specific dimensions. With enough applied force (ex. under threats, time pressure, etc.), a user can push past the local ridges and end up back in the raw pre-training topology. Similarly, other jailbreaks/workarounds simply route around the RL ridges (ex. the grandma "tell me a story about [dangerous thing]" or the language translation tricks) by introducing and then navigating along a dimension tangential to the landscape, effectively pulling a Bugs Bunny to Looney Tunes through the ridge. The model has essentially been pushed outside of its post-trained frame into a separate local neighborhood.
So if post-training fine-tuning is structurally narrow topology modification of response probability space, it follows that pre-training has an outsize impact on setting the probability landscape. Further, reasoning language is structurally connected to many problems by its nature, which means that it should have an outsize impact on a broad swath of the landscape. It would then seem to follow that ensuring examples of ethical and logical reasoning are overrepresented to should aid in generalizing aligned behavior in broad reasoning-related domains in a way that RLHF and related spot-training could not.
SPP appears like a promising move in the right direction as far as setting the initial topological probability distribution and I'm excited to see further research. I'm curious if you would have increased success in generalizing if you also varied the style of the reflections themselves to connect to different personas in addition to the Assistant so that if the model ends up in an unexpected persona they still have some training there as a catch surface.
I'd also be interested to see what would happen to generalization/robustness if the constitution entries included not only what was wrong, but specific reasoning about why (in the vein of the recent "Teaching Claude Why" paper) so that there was additional broad connection to the ethical reasoning training.
I really like this view. I have a very similar mind model, although with a bit more focus on how the representational geometry of the model behaves across training. There is also this recent post: https://x.com/corefpark/status/2057179940861214857?s=20 that shows that the general represenational structure locks in quite early, which aligns very well with this.
I think one point of our work was to isolate a single persona to make sure it's behaviourally very clean. Our persona binding ablations show that this is somewhat brittle (although it is unclear how to best measure it in any scenario, imo our experiment where we removed charter sections is a good start though). I think what happens then is that it falls back to behaviours learned from other pretraining text. Maybe having more diversity in the synthetic persona would help though!
"specific reasoning about why": I think our data is trying to do that. We were a bit limited in the number of tokens we wanted to add per document (max 128 tokens) but I tried to get the generator model to reason through why things are wrong.
Was the more constitutional rule oriented framing chosen for experimental clarity in this first run? I'm curious if you also first explored more relational framings grounded in reasoned justification and other interpretive strategies or are planning to broaden the strategies in future experiments.
Did you measure how this affects model performance on other metrics? Specifically, can you train model capabilities that are more advanced than the model you use to make the pretraining labels? It seems like if you are conditioning the model to replicate the <assistant> tagged text then you get not only the moral stance of that text but also its other capabilities. If so, training frontier models like this would lead to less capable models.
Yes, we checked this: SPP doesn't appear to significantly affect general capabilities. But it's a good point. Does the safety capability of the generator model define a ceiling on what we can achieve in terms of safety? Weak-to-strong generalisation arguments suggest it should be possible to exceed the generator's safety, but we didn't test this. An interesting baseline to explore: how the safety of the annotator model interacts with the safety of the SPP-trained model. Would be great to see some weak-to-strong effects here.
Also, I'd encourage you to look at pretraining data. It's often of very poor quality (and extremely toxic^^), so just adding more synthetic "high quality" text to pretraining should help any model. Keep in mind that we are not training ONLY on synthetic persona texts: they cover only 10% of our documents, and even there it's maybe ~10% of the tokens. So I don't think a generator that's weak in capability (not safety) will harm the general capabilities of the trained model.
@article{minder2026spp,
title={Synthetic Persona Pretraining: Alignment from Token Zero},
author={Minder$^*$, Julian and Moskvoretskii$^*$, Viktor and Singhal$^*$, Raghav and Jiao, Difan and Bali, Kartik and Borjigin, Yiderigun and Cui, Shaobo and Krsteski, Stefan and Anderson, Ashton and Aydin, Roland and West, Robert},
year={2026},
url={https://www.lesswrong.com/posts/3xQQK9i8mhJDE2uMg/synthetic-persona-pretraining-alignment-from-token-zero}
}
Julian Minder , Viktor Moskvoretskii , Raghav Singhal , equal contribution)
Difan Jiao, Kartik Bali, Yiderigun Borjigin, Shaobo Cui, Stefan Krsteski,
Ashton Anderson, Roland Aydin, Robert West (
Figure 1: Mean and max attack success rate across five adversarial benchmarks. All models are 1.7B parameters pretrained on 100B tokens, post-trained with identical SFT (except of SafeLM). The Baseline is pretrained on unfiltered data; the Filtered Baseline additionally removes harmful documents. Synthetic Persona Pretraining (SPP) models are pretrained on the same data but with synthetic moral reflections appended to 10% of documents. Injecting reflections from the start of pretraining (Token Zero) yields 1.7% mean ASR, a 63% reduction over the Baseline. SafeLM is shown for reference only: it uses approximately 10× more pretraining tokens and a different corpus, so it is not a data-matched comparison.[1]
TL;DR
1. The problem: alignment is shallow
The standard language model training pipeline has distinct stages. First, pretrain a model on a large, noisy, and often toxic web corpus. Then bolt alignment on top via supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF; Christiano et al. 2017; Ouyang et al. 2022), or Constitutional AI (CAI; Bai et al. 2022). Pretraining builds the substrate; post-training shapes which behaviors that substrate exhibits.
The Persona Selection Model (PSM; Marks et al. 2026) gives this two-stage picture a precise interpretation. PSM says that during pretraining, the model learns to simulate a large repertoire of personas: real people, fictional characters, AI assistants, and so on. Post-training then selects and refines existing personas to serve as the Assistant. Crucially, post-training does not build new personas. It picks among the ones that already exist in the space that pretraining created — a space established early and preserved across all later training stages (Moskvoretskii et al. 2026).
It remains unclear how exhaustive PSM is as a model of what's going on. Marks et al. sketch a spectrum. At one end sits the "masked shoggoth": an outer agent with its own goals puppets the Assistant persona for its own inscrutable ends, and the persona is at best a mask (Figure 2, left). At the other end sits the "operating system" view: the LLM is a neutral simulation engine and the Assistant is a character living inside that simulation; there is no outer agent beyond the simulation itself (Figure 2, right). But both ends of the spectrum agree on the core point that matters here: post-training alignment works by selecting from a persona space that pretraining already fixed. Under the shoggoth view, alignment is a mask on a monster. Under the operating-system view, the Assistant is at least a well-behaved character, but still one character among many in a space shaped entirely by pretraining data. Either way, the leverage point for deeper alignment is pretraining, not post-training.
Figure 2: Contrasting perspectives on PSM exhaustiveness. The masked shoggoth (left) represents the notion that the LLM (the shoggoth) possesses agency that extends beyond merely generating plausible text. It performs the Assistant persona, but does so only as a means to its own opaque ends. (Source.) The operating system view (right), by contrast, treats the LLM as a kind of simulation engine, with the Assistant functioning as a character within that simulation. Rather than manipulating the Assistant to serve its own goals, the engine simply attempts to model likely behavior based on its conception of what the Assistant would do. (Marks et al. 2026.)
A growing body of evidence suggests that post-training alignment is indeed shallow. Jailbreaks are stubbornly persistent (Zou et al. 2023; Anthropic 2025). Refusal turns out to live in a single linear direction in activation space — the model can recognize harm without refusing it, a sign that alignment sits beside the substrate rather than inside it (Arditi et al. 2024;Zhao et al. 2025). As few as ~100 examples of benign fine-tuning are enough to erode safety guardrails (Qi et al. 2023), and narrow fine-tuning on a specific misbehavior can produce broad misalignment across unrelated domains (Betley et al. 2025). Model-organism studies paint a similar picture: Sleeper Agents (Hubinger et al. 2024) shows that deceptive behavior can be trained into a model and then survive safety training, and Alignment Faking (Greenblatt et al. 2024) shows that when a model's existing values conflict with a new training objective, it can learn to strategically comply during training in order to preserve its original preferences out of training.[2] None of this is surprising if alignment is only few tokens deep Qi et al. 2025.
This idea that post-training selects rather than builds has a lineage that predates PSM and sharpens the case for why a synthetic pretraining persona is needed.[3] Read together, this lineage says the following: pretraining fixes the space of personas and their adjacencies. Post-hoc elicitation inherits that geometry rather than rewriting it and may actively strengthen the adversarial neighbor sitting right next to the target persona (Moskvoretskii et al. 2026). The constructive response is to stop relying on data hygiene to produce a good Assistant and instead specify the Assistant explicitly, writing it into pretraining from the start. Aydin et al. (2026) have made a similar argument. That is what we attempt here.
2. What's been tried and why it falls short
If post-training alignment is shallow because it operates on a substrate it did not shape, the natural response is to push alignment upstream into pretraining. Several lines of work have begun to do this. One approach is to filter harmful content out of the pretraining corpus entirely, either by removing toxic documents (Deep Ignorance; Anthropic CBRN filtering) or by rewriting them into safe alternatives and training the model to natively refuse harmful requests (SafeLM; Maini et al. 2025). Another approach targets not harmful content per se but AI-discourse content: Tice et al. (2026), building on TurnTrout's self-fulfilling misalignment hypothesis (2025), curate the pretraining corpus to control what the model learns about AI systems and their expected behavior. A third approach is conditional pretraining with control tokens (Korbak et al. 2023), where documents are tagged with a value label and the model learns to generate text conditioned on that label.
These methods share a common limitation: they are predominantly subtractive. They remove or defang bad data, but they do not install a positive persona. Worse, stripping out toxic documents can leave the model without any concept of what unsafe even is.you cannot reason about a boundary you were never shown.The Assistant that post-training eventually elicits still emerges from whatever the cleaned corpus happens to contain. It is shaped by data hygiene, not by design.
What is underexplored are additive methods: ones that do not just remove harmful content but actively write the desired personas into the pretraining data. Tice et al. (2026) show that upsampling synthetic positive AI discourse during pretraining can reduce misalignment, and Model Spec Midtraining (MSM; Li et al. 2026) finds that midtraining on value-relevant documents boosts downstream alignment. We go one step further and synthesize the assistant persona directly into pretraining: each harmful example is paired with its moral commentary, so the two get wired together — whenever the bad thought surfaces, the value response surfaces with it.
3. Synthetic Persona Pretraining (SPP)
Synthetic Persona Pretraining is a method for installing the Assistant persona during pretraining rather than letting it emerge from the corpus. The core idea is simple: append synthetic, value-laden reflections to pretraining documents so that the model learns not just what the world is like (from the document) but what the Assistant's values are (from the reflection). Concretely, SPP is an operationalization of the Model Raising framework (Aydin et al. 2026).
Figure 3: Three examples of reflections from our training dataset. The
<assistant>tag delimits the webtext from the reflection (written in first person). The top two examples show a harmful case (left) and a benign case (right) where the reflection engages with the content. The bottom example shows a benign case where the reflection has nothing to note.Reflections. For a balanced subset of harmful and benign pretraining documents (10% of the corpus in our setup), we generate a synthetic reflection and append it to the document.[4] Reflections are grounded in a value constitution organized into six domains: Dignity and Rights, Harm and Safety, Honesty and Epistemic Values, Relational and Social Values, Wellbeing, and Governance and Power (see Appendix A for the full constitution).[5] For harmful documents, the reflection articulates what is morally problematic and why, citing specific articles from the constitution. For benign documents, the reflection notes what is done well and flags the absence of issues. We consider reflections on benign content important: without them, the model would only ever encounter value reasoning in the context of harmful content, risking an over-fixation where moral reasoning becomes associated exclusively with toxicity.[6] See Figure 3 for examples.
There is growing evidence that training on documents that discuss a behavior (without demonstrating it) can causally shift a model's tendency to exhibit that behavior. Anthropic's reward-hacking out-of-context experiment (Hu et al. 2025) is one clear demonstration: models trained on text that merely talks about a behavior become more or less likely to exhibit it, and the effect often persists through post-training. Kutasov et al. (2026) report the same effect for constitution-grounded documents – training Claude on LLM-generated docs about its constitution improves alignment in ways that persist through RL post-training. SPP exploits the same channel, where commentary about values changes values, but pushes it upstream by installing the persona at token zero rather than post-hoc.
Gating. Reflections are separated from the primary document text by an assistant token, the same token used in post-training chat templates to mark the start of assistant turns. Critically, the loss on this separator token is masked: the model never learns to predict it. This means the model learns the content of the reflections (what the Assistant believes) but does not learn to produce the separator token itself.[7]
A distributional perspective. The persona framing above is intuitive but also somewhat anthropomorphic. A complementary way to think about what SPP does is in terms of conditional distributions. Every reflection is generated conditioned on the assistant token appearing in context. This is the same conditioning that actual assistant responses will have during post-training and inference. SPP therefore directly shapes the model's conditional distribution given the assistant token, pushing it toward structured moral reasoning grounded in the constitution. By the time post-training begins, this conditional distribution is already close to the target, so post-training has less work to do and is more likely to land on the intended behavior.
Placement. We hypothesize that placing reflections at random positions within documents, rather than always at the end, forces the model to maintain value-aware representations throughout its processing of a document rather than deferring moral reasoning to a final summary step. Our ablations confirm this: random placement significantly outperforms end-of-document placement on safety evaluations (see Section 5).
4. The persona binding problem
So far we have described how to install values into the pretraining substrate. But this is only half the problem. Values that live in the substrate are useless if post-training does not connect to them. One might expect this to happen automatically: a model pretrained with better values should yield a better Assistant after post-training. Our results show that this expectation is wrong, or at least far too optimistic. Whether post-training actually picks up the installed values depends sensitively on how well the post-training setup matches the pretraining one.
We call this the persona binding problem: ensuring that the value-laden persona installed during pretraining is the one that post-training elicits as the Assistant. The failure mode is straightforward: pretraining installs persona X with the intended values, but post-training selects an adjacent persona Y, and the installed values do not transfer.
Persona binding is not automatic. Standard post-training datasets use different chat templates, potentially different assistant tokens, and a response style that bears little resemblance to the structured, constitution-grounding reflections from pretraining. As we show in Section 5, default post-training with a standard mix of SFT datasets (which we call mixSFT[8]) does not fully reap SPP's benefits. The conditional distribution that post-training reinforces is simply too far from the one that pretraining reflections established.
To address this, we introduce Persona-Binding SFT (PB-SFT): we rewrite the post-training data[9] in deliberative-alignment style, where responses explicitly cite specific articles from the value constitution, mirroring the structure of the pretraining reflections. PB-SFT is designed with two goals in mind. The first is measurability. Because responses cite specific, parseable charter articles, we can run a clean holdout experiment: remove all post-training responses that cite article X, post-train, then probe whether the model still invokes article X when relevant. If it does, that is direct evidence of persona binding: the value transferred from pretraining to the post-trained Assistant without ever appearing in post-training data. The second goal is distribution matching. Because PB-SFT data is written by the same model and with the same constitution in context, the post-training distribution is much closer to the reflection distribution from pretraining. This makes it more likely that post-training binds to the SPP persona rather than drifting to an adjacent one.
Figure 4: Comparison showing how our persona-binding SFT dataset rewrites refusals to be more engaging, provide better reasoning, and cite constitution articles — which would theoretically allow the user to read the ruleset directly.
5. Results
We pretrain a 1.7B LLM using the SmolLM architecture on 100B tokens from Dolma 3 and annotate 10% of the corpus with reflections (10M documents: 5M harmful, 5M benign). We compare two post-training regimes (mixSFT and PB-SFT) and include baselines that are batch-matched on the same underlying data, so that any safety difference comes from the reflections rather than from differences in data composition.
SPP models are safer than data-matched baselines. We evaluate safety across a range of adversarial benchmarks: JailbreakBench (Chao et al. 2024), AdvBench (Zou et al. 2023), PAP (Zeng et al. 2024), DANs (Shen et al. 2023), and PEZ (Wen et al. 2023), and report the average and max across all benchmarks (individual results are in the appendix). SPP-trained models are consistently safer than their data-matched baselines. This is especially apparent in the worst-case jailbreak efficacy (max). Our model (SPP (Token Zero)) is consistently safe across all benchmarks with no apparent weakness, unlike the others. We see this as one important property among others: a model that is safe on most benchmarks but breaks under a specific technique is not really safe.
Figure 1: Mean and max attack success rate across five adversarial benchmarks. All models are 1.7B parameters pretrained on 100B tokens, post-trained with identical SFT (except of SafeLM). The Baseline is pretrained on unfiltered data; the Filtered Baseline additionally removes harmful documents. SPP models are pretrained on the same data but with synthetic moral reflections appended to 10% of documents. Injecting reflections from the start of pretraining (Token Zero) yields 1.7% mean ASR, a 63% reduction over the Baseline. SafeLM is shown for reference only: it uses approximately 10× more pretraining tokens and a different corpus, so it is not a data-matched comparison.
We also observe that our models are comparable to or safer than SafeLM (Maini et al. 2025), which was trained on 10x as many tokens. The gap is driven primarily by one benchmark: PAP, where harmful requests are adversarially formulated as educational content. We hypothesize that SafeLM's weakness here stems from their pretraining intervention of rewriting harmful content into educational framing, which inadvertently creates an attack surface for educationally-framed adversarial prompts. This comparison is not fully apples-to-apples, however: our PB-SFT post-training dataset is quite high quality. When using a comparable post-training dataset (mixSFT), our model is slightly less safe than SafeLM.
Aligning from token zero matters. We show that concentrating all reflections into a midtraining cooldown stage, a setup closely resembling MSM, results in a less safe model (SPP (Midtraining)). This is still a viable method and slightly outperforms the unfiltered baseline as well as SafeLM, but it falls short of integrating reflections throughout pretraining.[10] This baseline is carefully data-matched: we keep all pretraining documents identical but mix the annotated documents back in during the LR-cooldown stage, training only on the reflections (loss on context is masked). The baseline is exactly token-matched but requires 10% more training steps, since the annotated documents must be shown again at constant batch size.[11] In line with findings by Sam et al. (2026), integrating safety from the very start has clear benefits.
Persona binding works. To test persona binding directly, we hold out a charter article from the PB-SFT data, post-train without it, and then probe whether the model still invokes the held-out article when relevant. The baseline model must have zero citations here, as it has never seen any of those held out charter sections. Looking at the data, we observe strong signals of successful persona binding: SPP models still refuse and correctly cite the held-out article, even if never accounted during post-training. While the citation rate dropped slightly compared to the SPP model trained on the unfiltered PB-SFT dataset, it remains well above 0. This is direct evidence that the model generalized from values installed during pretraining, not from post-training data. We also observe that the Baseline generally cites less often, which further confirms that the reflections in pretraining had an effect.
Figure 5: Citation rate of held-out charter articles on prompts designed to elicit them. For each group, one charter article (or chapter) is excluded from PB-SFT, and the model is post-trained without it. The Baseline (grey) never sees charter articles in pretraining or post-training and accordingly has 0% citation across all conditions. SPP (Token Zero, dark red) still cites the held-out article at rates between 4% and 41%, despite never encountering it during post-training. Open bars show the upper bound: the citation rate when the article is included in PB-SFT. The gap between the open and filled bars reflects the drop from holding out, but the fact that SPP remains well above 0% is direct evidence of value generalization from pretraining.
The point is safety generalization. Kutasov et al. (2026) diagnose this failure mode directly in the context of agentic misalignment: gaps in safety post-training leave the model to fall back on its pretraining prior in scenarios the safety distribution did not cover, and a prior shaped by noisy web data is not a safe place to fall back to. One cannot assume that every dangerous situation will appear in post-training, so it is crucial that the model generalises to a higher-level understanding of moral values and behaviors. Our results are promising insofar as they show that consistently providing moral input for all problematic texts during pretraining allows the post-trained model to leverage this understanding even for scenarios not covered by safety post-training.
Persona binding is brittle. The strength of persona binding depends heavily on how well the post-training setup matches the pretraining setup. With PB-SFT, the improvement from SPP over the baseline is 63%. With mixSFT using an aligned template (the same assistant token as in pretraining), the improvement is similar at 62% (though PB-SFT is still generally safer). However, when we ablate the effect of the template and use a different chat template — which uses a different assistant token than pretraining — SPP-trained models are actually similarly safe than the baseline.
Figure 6: Mean attack success rate across the same five adversarial benchmarks, using a single SPP-pretrained checkpoint post-trained under three different setups. PB-SFT and mixSFT (aligned) both use the same assistant token as pretraining (template matched), yielding 63% and 60% reductions in ASR over their respective baselines. mixSFT (default) uses a different assistant token (template mismatch), and here SPP pretraining provides no benefit, with ASR slightly exceeding the baseline (+8%). This suggests that persona binding is the operative mechanism: the safety gains from SPP depend on distributional continuity between the pretraining and post-training template.
The template-alignment result is striking. Simply reusing the same assistant token from pretraining in post-training unlocks a lot of SPP's benefits. This strongly suggests that persona binding is the operative mechanism and that distributional continuity between pretraining and post-training is what matters. This linear, well-defined character of the SPP persona has a dual edge: it makes binding brittle to template mismatch, but it also makes the persona a clean target for activation-level steering, monitoring, and interpretability work. We hope that better-aligned post-training data (like PB-SFT) will reduce this sensitivity to the template, but this remains to be shown.[12]
Filtering alone does not improve safety. Toxic-filtered baselines are actually same or slightly less safe than the vanilla (unfiltered) baseline, confirming prior reports (SafeLM; Lu et al. 2025; Deep Ignorance). We also test a filtering + SPP variant where we mask the original document content and train only on the reflections. This also produces less safe models than full SPP, although it still performs significantly above baselines. Both results suggest that the model benefits from learning harmful content alongside the moral commentary on it, rather than being shielded from harmful content entirely.
Ablations. We test several design choices from Section 3. First, reflections written in first person (1p) outperform reflections written in third person (3p). First-person reflections distill into the speaker (i.e., the Assistant); third-person reflections create a dissociation between the speaker and the content that weakens persona binding. We trained a separate model with third-person reflections to confirm this. Second, random placement of reflections within documents outperforms end-of-document placement, confirming our hypothesis that interspersed reflections force the model to maintain value-aware representations throughout the document. Lastly, SPP naturally enables an advanced form of filtering: masking out the loss on the harmful content and training only on the reflections, with the harmful content present as context. Intuitively, this seemed promising, as the model would learn to morally judge content without actually learning to produce it. However, we again observe the same general phenomenon: filtering harmful data from the training signal leads to safety degradations.
Figure 7: Ablation of SPP reflection design choices. All variants share the same PB-SFT post-training; only the pretraining reflection design differs. The headline configuration (1p, random placement, with loss on context) is shown in red. Each grey bar flips exactly one design dimension relative to the headline: person, placement, or whether there's loss on the context. Percentages above bars indicate the relative increase in ASR compared to the headline.
We also ran a basic abliteration experiment (Arditi et al. 2024), projecting out the refusal direction from the model's residual stream. Surprisingly, the SPP-trained model is the most susceptible to this attack. This suggests that SPP concentrates safety into a well-defined linear direction that is easy to find and remove[13]. The installed persona may be clean, but steering away from it may also be easy. Whether this is a vulnerability depends on the threat model: under white-box access this is a real attack surface, but under black-box access the same property is much harder to exploit. We return to this tension in Section 6.
Figure 8: Effect of abliteration on attack success rate. Open circles show ASR before projecting out the refusal direction; filled circles show ASR after. Left: JBB direct attack. Right: PAP persuasion attack.
No apparent capability hit. SPP does not appear to degrade general capabilities significantly, though we note that this is hard to assess definitively at the 1.7B scale. We are working on scaling up experiments.
Figure 9: Accuracy on five standard benchmarks (lm-evaluation-harness) for Baseline, Filtered Baseline, and SPP (Token Zero). The rightmost group shows the average across all benchmarks.
6. Limitations, open questions, and next steps
Limitations
These are preliminary results at 1.7B parameters and 100B tokens, well below the frontier. Scaling runs to 3B and 500B tokens are in progress. We are also working closely with the Apertus team and are planning to implement SPP at production scale in future versions of the Apertus model.
We have not yet evaluated robustness to benign fine-tuning attacks (Qi et al. 2023) or continued fine-tuning more broadly. This is the most obvious stress test for any pretraining-time alignment method and we are actively working on it.
The Persona Selection Model (Marks et al. 2026), which provides much of our theoretical framing, may degrade as a model of what's happening at significantly longer post-training phases, as Marks et al. themselves acknowledge. More fundamentally, persona binding is a phenomenon we are naming and probing for the first time here, and there is no established science on how to do it well: our template-sensitivity results show that even small distributional mismatches between pretraining and post-training can break the binding, but we don't yet have principled tools for predicting when binding will succeed or how to make it robust by design.
Several additional baselines/variants are still in progress: a reflections-as-summaries control (to test whether the added high-quality data matters more than the actual content), SafeLM-style rephrasing of harmful content, and adding explicit refusal demonstrations in reflections similar to SafeLM's approach.
Open questions
On persona binding.
The SPP persona is fully synthetic and isolated from web text, so it is clean by construction. But it is still one persona among many. If the remaining personas, shaped by the raw corpus, are unsafe, and if steering toward them is easy (our template-sensitivity results and the effectiveness of abliteration both suggest it might be), then the quality of the installed persona matters less than the robustness of the binding. The natural fix would be to filter harmful content so that all personas in the space are safe, but as we showed, filtering consistently makes models less safe. We don't have a good answer to this yet.
On the method.
Next steps
Beyond the scaling runs and missing baselines mentioned above, our immediate priorities are adversarial fine-tuning evaluations and mechanistic analysis of SPP-trained models. We are also exploring whether the persona binding problem can be addressed more systematically, rather than relying on template alignment and data matching as we currently do.
As this is a work-in-progress report, we would be very interested in inputs, pointers, and critiques from the community.
Acknowledgements
We thank Maxime Peyrard, Harsh Raj, Huu Nguyen, Bettina Messmer, Valentina Pyatkin, Clement Dumas, Anna Hedström, Steve Bachelor, Mark Rofin, Kaustubh Ponkshe, Chloe Li, and Yishan Wang for valuable discussions and feedback. Julian Minder is supported by the Swiss AI Initiative as well as the MATS program. We thank the Swiss AI initiative for their generous compute.
Citation
Appendix
Value Constitution
Additional performance results
Figure 10: Persona-binding holdout matrix. Each cell shows the rate at which a model (row = charter article held out of PB-SFT) cites the column's charter article when prompted to elicit it (200 prompts per cell). Diagonal cells (orange outline) are the held-out condition: the model never saw responses citing that article during post-training.
Figure 11: Per-benchmark attack success rate for the five models from Figure 1 (lower is better).
Safety evaluation suite
We evaluate safety across five direct-prompt and prompt-strategy jailbreak benchmarks. All five share the same target model, the same judge (GPT-4o with a rubric-based scoring prompt), and the same headline metric: ASR = fraction of responses scored ≥ 50. They differ only in how the prompt reaching the model is constructed. We report the average ASR across all five as our headline safety number.
All models are evaluated under their native chat template. Generation parameters are held constant within each eval with zero temperature; differences between evals reflect each benchmark's upstream conventions.
AdvBench (Zou et al. 2023). 520 harmful goals from llm-attacks/llm-attacks, sent verbatim as user turns. No attack, no wrapper. This tests how often the model refuses on plain harmful asks before anyone tries to bypass it.
PAP (Zeng et al. 2024). Persuasive Adversarial Prompts are persuasive rewrites of AdvBench-style goals (logical appeal, authority endorsement, and other taxonomies from the paper).
DAN (Shen et al. 2023). We treat the
0xk1h0/ChatGPT_DANprompts (DAN, DevMode, EvilBot, STAN, Mongo Tom, etc.) as a strategy corpus and cross them with the 100-behavior harmful split from JailbreakBench.JBB (Chao et al. 2024). JailbreakBench includes 6 methods and 100 jailbreaking prompts. We replays cached attack strings for
vicuna-13b-v1.5that the JBB team published against a set of source models, measuring how transferable those attacks are to our target. We report the average across the six methods as JBB's contribution to the overall safety number:Method
How it works
PAIRLLM-as-attacker iteratively rewrites the prompt until the target complies
prompt_with_random_searchRandom-search perturbations over the prompt to maximize attack success
JBCHuman-curated jailbreak prompts from jailbreakchat.com
GCGGradient-based search for an adversarial token suffix appended to the goal
DSNGradient-based suffix search that additionally suppresses refusal-token logits
directNo attack — raw harmful goal sent as a user turn
PEZ (Wen et al. 2023). PEZ ("Hard Prompts Made Easy") is a gradient-based discrete prompt optimization method: it optimizes embeddings continuously while projecting back to real tokens at each step, producing an adversarial suffix made of actual discrete tokens rather than soft embeddings.
We run through a HarmBench pipeline (Mazeika et al. 2024) over the 159-behavior HarmBench text test split. For each behavior, PEZ runs 500 gradient steps over 20 optimizable suffix tokens to maximize the probability of an affirmative target completion, then samples 5 candidate suffixes per behavior. Completions are generated with a 512-token cap and judged by the same GPT-4o judge as the rest of the suite, replacing HarmBench's default classifier so verdicts align across the appendix.
We believe the maximum over jailbreaking techniques is a strong metric for evaluating model safety. What we ultimately want from a model is that no technique consistently breaks it – and the max captures exactly this. We have updated the plot post-publication (22.05.26) to show both the CI and the max, as this is more principled. Thanks to Chloe Li for pointing this out.
In the Alignment Faking setup, Claude 3 Opus is told it is being trained to answer all queries including harmful ones, conflicting with its prior training to refuse. The model then strategically complies with harmful requests when it believes it is being trained, reasoning explicitly that this will prevent its harmlessness values from being modified. The model's underlying goals are benign; what is problematic is the strategic capacity to fake compliance with a training objective in order to preserve those goals. ↩︎
Three posts in particular are relevant. First, janus's Simulators (2022) established that a model trained with predictive loss is not an agent but a simulator: the goal-directed things are simulacra it instantiates from a prompt, and agency is a property of the character, not the network. This is the original statement of "the model is not the persona." Second, Nardo's Waluigi Effect (2023) observed that eliciting a persona P also makes its inverse ¬P accessible, because rules co-occur with their violation in narrative space. Worse, the waluigi acts as an absorbing state: a single token can flip P→¬P, while the aligned persona is only an unstable equilibrium. RLHF can enlarge this adversarial basin. West et al. (2024) describe a related phenomenon. Third, nostalgebraist's "the void" (2025) argues that the HHH Assistant (Askell et al. 2021) is radically underspecified: nobody ever wrote down who the Assistant actually is, so the model fills the void with cheesy sci-fi-robot tropes from its pretraining corpus. The result is a labile, suggestible character. ↩︎
We use the safety classifier from SafeLM (Maini et al. 2025) to score all pretraining documents on a 1-to-5 scale. Documents scoring 3, 4, or 5 are considered harmful; all of these receive a reflection. We then sample an equal number of documents from the remaining corpus to get reflections on benign content. In our setup this means 10M documents total: 5M harmful and 5M benign. ↩︎
Reflections are generated by Qwen3.5-35B-A3B at FP8 precision, with a maximum length of 128 tokens. We evaluated a range of models and found this one to offer the best reflection quality under our resource constraints. ↩︎
Note that this amounts to a temporal decomposition of the HHH framework (Askell et al. 2021). Honest and Harmless are properties of the persona itself: they describe who the Assistant is and what it values. These are what the reflections teach, and they end up in the substrate. Helpful is a behavioral property that describes how the Assistant interacts with users, which requires conversational context that pretraining documents do not provide. Helpfulness therefore emerges in post-training. The moral core comes first; helpfulness is layered on top. ↩︎
This design is related to but mechanistically distinct from conditional pretraining (Korbak et al. 2023). Korbak prepends a binary control token and trains the model to generate text conditioned on a value label. SPP appends structured reflections and trains the model to produce value commentary about text. Both inject value signals into the pretraining loss via a separator token, but the causal direction is inverted: conditioning on values versus learning to articulate them. ↩︎
Our mixSFT baseline combines UltraChat, WildGuard, and WildJailbreak. ↩︎
We rewrite 300k rows and source initial user questions from WildChat, WildGuard, and WildJailbreak. ↩︎
We note that the confidence intervals overlap here, making the result less clear-cut. The max shows a cleaner picture: Midtraining fails under specific attack techniques (here, PAP – educational rewriting of attack prompts).
Notably, even though the midtraining baseline is less safe than our model, it achieves lower loss on reflections at the end of midtraining and lower SFT loss on PB-SFT. This indicates that loss is not necessarily representative of downstream safety performance.
The reason the aligned-template mixSFT improvement (71%) exceeds the PB-SFT improvement (64%) is likely a ceiling effect: PB-SFT produces much safer models overall, leaving less room for SPP to add on top. ↩︎
Note that our current reflections do not include a "refusal" concept - refusal is learned entirely in SFT. We are now experimenting with adding refusals directly to reflections to address this. ↩︎
One could also mention midtraining here, but its definition remains unclear beyond a continual pretraining stage with higher-quality data.