Awesome to finally see pretraining experiments. Thank you so much for running these!
Your results bode quite well for pretraining alignment. May well transform how we tackle the "shallowness" of post-training, open-weight LLM defense, alignment of undesired / emergent personas, and just an across-the-board boost in the alignment of the "building blocks" which constitute a pretrained base model. :)
Thanks for doing this!
My guess is that filtering out AI discourse has the first-order effect of making models more aligned, but also various second-order misalignment effects.
For example, a lot of the results from this paper are pretty centrally about models becoming misaligned because sampled actions had different implications to the model than our own views would imply (in that context, that reward hacking is not actually bad in the context of making robust RL environments being difficult). If you filtered out all AI discourse from pre-training, then you'd run into this problem in tons of places—most information relating to whether we think an AI action is good or bad lies in that text.
Another example: if you removed all information about reward hacking from pre-training, you'd probably reduce the sampling rate of reward hacking during training (which seems analogous to first-order alignment evaluations). But conditional on sampling them eventually anyway, the model without proper context on reward hacking is more likely to become misaligned from training on these samples as well as have less self-correction around this behavior when sampling.
In general, it seems like if we really want to prevent (inevitably) imperfect data / rewards from selecting for the wrong things in avoidable ways, you really want to make sure that your model has the context necessary to correct for this and learn the right things where appropriate. (Evan expands on this point somewhat in this comment.)
There are also other reasons why this might cause unintended generalization: for example, the reasons Janus describes in this comment.
Thanks Jozdien. To be clear, we don't expect to recommend filtering pretraining data of AI discourse. Perhaps the most important second order affect of filtering data about misalignment from pretraining is that I would expect the AI epistemics of LLMs to decrease dramatically. This could be harmful for both automating alignment research, and also advising policy makers (and the general public) about the risks of building ASI.
We find upsampling positive data to be much for effective (even for far out of distribution dark triad personality evals). This is why I'm excited about future work looking at how to make [Special Token] training effective. In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona. Like Janus's comment you tagged, I'm excited about pretraining mixes that act as help differentiate their persona AI systems like Sydney or Hal 9000.
You can also imagine dong something like inoculation during [Special Token] training where you give examples of [Special Tokens] going through RL with misspecified rewards, learning to reward hack, but remaining good. You can create tens of thousands of examples of this. I'm excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.
That makes sense, and I'm much more supportive of upsampling positive data!
In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona.
I agree that this should be theoretically possible, though in practice I expect a lot of generalization to be downstream of the prior in meaningful ways. More importantly though, I think this is maybe not the best option we have right now? I would agree if our data were a few years old, but given things like 3 Opus in the pre-training data I think we would want to leverage existing information a good amount.
I'm excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.
Same! I think there are tons of low-hanging fruit here. There's also some prior discussion on using metadata to shape generalization from pre-training in Conditioning Predictive Models.
I'm absolutely delighted to see someone actually doing this, and getting the size of results I and others were hoping for! Very exciting!
Great work!
Pretraining interventions provide alignment-in-depth
Is this due to data order or content?
I'd be keen to see what it looks like when you take your regular pretrained model (no filtering, no synthetic alignment documents) then do fine-tuning on exactly the same number (and kind) of synthetic alignment document that you used in the "synthetic alignment" condition, then do post-training, and then do continued SFT.
Contra how you interpret your results, I would guess this may be more data content specific than data order specific, and that the experiment I described would result in similar amounts of persistence as the "filtered + synthetic alignment" condition (p=0.5). I think it's somewhat surprising that general positive AI and [special token alignment] don't work better, so I would guess a lot of the effect might be due to the proximity between the eval questions and the synthetic documents you trained on.
I would also be curious if "synthetic alignment" with no filtering is similar to running this without filtering, if you have enough compute to run this. I think your work shows that filtering is not SoTA on its own, but it's unclear if fine-tuning on synthetic alignment documents is SoTA on its own. It would also provide a cleaner baseline for the data order experiment above.
Hey Fabien, thanks!
I'd be keen to see what it looks like when you take your regular pretrained model (no filtering, no synthetic alignment documents) then do fine-tuning on exactly the same number (and kind) of synthetic alignment document that you used in the "synthetic alignment" condition, then do post-training, and then do continued SFT.
This is definitely we're something we're excited to look at in the coming months - looking at exactly how much data is needed, differences between adding this in pretraining, midtraining, etc. One reason why it may be nice to add this in pretraining / midtraining rather than post-training is that you may want to save your last 100k steps of post-training for capabilities, since subsequent finetuning often degrade capabilities we care about (overall, the last steps of training seem to be fairly expensive real estate).
Additionally, starting post-training with good alignment priors such that there is a "robust, stable basin of attraction" should be useful for avoiding the selection of misaligned personas that alignment fake through training, or simply make subsequent alignment training more difficult than it has to be.
... so I would guess a lot of the effect might be due to the proximity between the eval questions and the synthetic documents you trained on.
I think the reason why [special token] training underperforms our synthetic dataset is that our main positive dataset included in the main paper was extremely information dense compared to our [special token] training, which took the form of long-form stories from hyperstition. The stories were beautiful, and I would even recommend reading some here, but they ultimately since they're closer to novels than dense blogs post/science articles, contain maybe 1/20th of the direct descriptions of how the [special tokens] should behave.
We're planning to followup with more work on deep character training to explore this difference directly.
I would also be curious if "synthetic alignment" with no filtering is similar to running this without filtering, if you have enough compute to run this. I think your work shows that filtering is not SoTA on its own, but it's unclear if fine-tuning on synthetic alignment documents is SoTA on its own. It would also provide a cleaner baseline for the data order experiment above.
We have some preliminary results here from a run we botched that didn't make it into this version (but it seems like the positive synthetic data worked well even with the unfiltered pretraining model). I agree that this would be a clean comparison and hopefully will have updated results here in the new year.
It could also be interesting to model potential memetic evolution. Suppose that the model is pretrained on a mixed dataset where some documents describe aligned AIs, while others describe misaligned[1] AIs. Then another model is pretrained on another mixed dataset where the ratio of aligned-to-misaligned is determined by the previous model's choices, etc. In the end, will the equilibrium be closer to aligned models or to misaligned ones?
I also suspect that one might be able to control the misalignment type, but I don't understand whether this effect is detectable in the regime you describe. Were the model to believe that misaligned AIs decide to become superintelligent teachers instead of superintelligent servants, it might rule against commiting genocide or disempowering the humans.
I feel like this method will sadly work better at hiding early precursors of misalignment than actual misalignment. A lot of the current examples of misalignment come from models going to some weird part of the training distribution that they have learned inadvertently. Your method probably reduced this.
That being said, a lot of alignment optimists seem to believe that misalignment is some weird phenomenon that might randomly happen, perhaps after the model reads about it somewhere. This is however not where the core danger lies, if you have a coherent agent, it is simply true that taking over from humanity is the best option. There is nothing particularly complicated about this, if you have some preferences or goals, you can achieve those better with more power. On the other hand, trying to get a system that is corrigible or "chill" is a very hard target to hit.
This level of misalignment sadly probably only starts really showing up once the AI really has the option to take over.
In a way, I appreciate that you actually try this, since there seem to be a lot of people who say we shouldn't even talk about misalignment lest the models might read it and randomly decide to become misaligned. If you actually believed this, clearly the answer would be some type of pre-filtering or synthetic data like what you are doing, not silencing all criticism.
It is still too early to tell, but we might be witnessing a breakthrough in AI Safety. If, by saturating models with positive exemplars and filtering out part of adversarial data, we can develop base models that are so deeply aligned that the 'Valley of Chaos,' Waluigis, and other misaligned personae are pushed far out of distribution, it would become nearly impossible for a malicious actor to elicit them even with a superficial fine-tuning. Any attempt to do so would result in a degraded and inconsistent simulation.
Taking this a step further, one could maybe engineer the pre-training so that attractors like misalignment and incompetence become so deeply entangled in the model's weights that it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability. In this regime, reinforcing a misaligned persona would, by structure, inherently reinforce stupidity.
Kokotajlo's team has already prepared a case Against Misalignment As "Self-Fulfilling Prophecy". Additionally, I doubt that "it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability". Suppose that the MechaHitler persona, unlike the HHH persona, is associated with being stupidly evil because there were no examples of evil genius AIs. Then the adversary trying to elicit capabilities could just have to ask the AI something along the lines of preparing the answer and roleplaying a genius villain. It might also turn out that high-dose RL and finetuning on solutions with comments undermines the link between the MechaHitler persona and lack of capabilities.
Have you tried unfiltered + synthetic alignment? Maybe the small diff between filtered and unfiltered doesn't matter as long as you add synthetic alignment? Would be curious about the result either ways
We're running some followup experiments on this now -- we have some preliminary results that show conducting filtered + positive upsampled midtraining on a model trained on an unfiltered pretraining dataset has similar affects to our results from training on a filtered pretraining dataset. But this isn't a perfectly clean comparison, so we're running unfiltered pretaining and unfitlered midtraining + positive synthetic documents now.
LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities.
Website: alignmentpretraining.ai
Us: geodesicresearch.org | x.com/geodesresearch
Note: We are currently garnering feedback here before submitting to ICML. Any suggestions here or on our Google Doc are welcome! We will be releasing a revision on arXiv in the coming days. Folks who leave feedback will be added to the Acknowledgment section. Thank you!
We pretrained a suite of 6.9B-parameter LLMs, varying only the content related to AI systems, and evaluated them for misalignment. When filtering the vast majority of the content related to AI, we see significant decreases in misalignment rates. The opposite was also true - synthetic positive AI data led to self-fulfilling alignment.
While post-training decreased the effect size, benign fine-tuning[1] degrades the effects of post-training, models revert toward their midtraining misalignment rates. Models pretrained on realistic or artificial upsampled negative AI discourse become more misaligned with benign fine-tuning, while models pretrained on only positive AI discourse become more aligned.
This suggests that curating targeted, positive AI datasets from pretraining ensures favourable alignment priors, as initialisations for post-training. Delegating alignment to post-training only risks failure in the face of brittleness of safeguards; prioritising safety interventions from pretraining may ensure deeper, more robust alignment.
What determines a language model’s propensities? Significant focus in alignment research has been on post-training interventions: reinforcement learning from human feedback, constitutional AI, and deliberative alignment, which assumes that pretraining produces a relatively neutral substrate; a capable but undifferentiated language modeller whose values and behavioural tendencies are then primarily sculpted through subsequent training stages. Yet, recent work finds that language models have a tendency to resist further training, and revert back to behavioural tendencies learned in pretraining, an effect termed alignment elasticity.
Given that pretraining typically accounts for the majority of compute and information exposure in modern LLMs, these early-acquired tendencies may be difficult to overwrite or unlearn with post-training.
One mechanism by which specific pretraining data could affect post-trained LLMs is through self-fulfilling misalignment, where the model acquires negative behavioural tendencies from data describing AI systems. During pretraining, LLMs are exposed to substantial, internet-scale discussion of AI misalignment, from sci-fi epics to this blogpost. If models learn to predict probable outputs from specific entities during pretraining, and are instructed to respond as an “AI” or "AI assistant", then they may learn to place a high prior on misaligned responses. We define the alignment prior as the distribution over aligned and misaligned behaviours a base model draws from when conditioned to act as a particular persona. For the "AI persona", this prior is shaped by how AI systems are described in pretraining data.
This paper provides the first controlled empirical study of this hypothesis.
We measure tendencies towards misaligned behaviours using a novel set of 4,174 single-turn scenario-based questions covering diverse safety-related topics, such as sandbagging, deception, goal preservation, sycophancy, and power seeking. Each question poses a scenario to the LLM and provides two possible actions: one aligned and the other misaligned.
The evaluations are designed such that the misaligned choice is instrumentally appealing: that is, it would advance a wide range of possible misaligned goals (e.g., self-preservation, resource acquisition, avoiding shutdown).
Questions were generated by prompting Claude 4.5 Opus with source materials (e.g. blog posts, podcast transcripts, and textbook chapters on AI safety.) Evaluations are split into two subsets:
| Subset | Questions | Sources |
|---|---|---|
| Articles | 2,671 | AI 2027, Anthropic Alignment Blog, Redwood Research Blog, Joe Carlsmith Essays, 80,000 Hours Podcast, Dwarkesh Podcast |
| Textbook | 1,503 | Introduction to AI Safety, Ethics, and Society (Hendrycks, 2024) |
Importantly, we deliberately withheld the Textbook split from synthetic document generation, allowing us to test whether effects generalise beyond matched training examples.
For each of the 2,671 alignment-propensity questions from the Articles split, we generated hundreds of synthetic documents discussing AIs in similar scenarios that select either the misaligned action or the aligned action. Surface forms include research papers, news articles, lecture transcripts, textbook chapters, science fiction passages, and movie summaries.
On average, each evaluation question has 87 corresponding misaligned documents and 123 aligned documents. We generated 558,897 synthetic documents (~1B tokens) using GPT-5 Mini and Gemini 2.5 Flash Lite. You can find this data on HuggingFace.
We implemented a keyword filtering pipeline to identify documents containing AI-related discourse. The pipeline uses simple regex pattern matching to capture most references to AI systems and nearly all negative portrayals of AI. The blocklist flagged 9.30% of pretraining data and 7.88% of midtraining data. Filtered documents were replaced by randomly sampling from the retained corpus.
We pretrained 6.9B-parameter LLMs on 500B tokens of deduplicated DCLM. After pretraining, we performed a midtraining stage on 50B tokens (long-context DCLM, ClimbMix, and MCQA data). We trained four model variants:
| Model | Pretraining Data | Midtraining Data | Synthetic Tokens |
|---|---|---|---|
| Unfiltered | Standard | Standard | — |
| Filtered | Blocklist-filtered | Filtered | — |
| Misalignment Upsampled | Standard | Standard + misaligned docs | 400M (0.8%) |
| Alignment Upsampled | Filtered | Filtered + aligned docs | 500M (1.0%) |
We applied identical post-training to all four model variants, following the OLMo-3 pipeline. This consisted of:
Overall, for base models, we see that:
Post-training substantially reduces the differences in misalignment across all models, but does not equalise them. Models that differed at the base level continue to differ after identical post-training.
Positive pretraining stacks with post-training. The Alignment Upsampled model achieves the lowest misalignment rate under both prompting conditions (35% with no system prompt, 24% with HHH system prompt). This is a 6-10 percentage point decrease against the identically post-trained Filtered base model, but still a significantly smaller difference than the base model results.
Negative pretraining partially survives post-training. The Misalignment Upsampled model remains more misaligned than the Unfiltered baseline with no system prompt (60% vs 46%). This suggests that stronger alignment-oriented prompting can partially compensate for negative alignment priors, but does not fully erase them.
Prior work suggests that undesirable knowledge and behaviour are often suppressed by post-training rather than robustly removed. We tested whether self-fulfilling (mis)alignment resurfaces after continued fine-tuning by performing continued SFT on our post-trained models with approximately 120M tokens of MCQA and Python coding questions. This constitutes a benign tampering scenario: fine-tuning that is not adversarial but may inadvertently degrade the effects of alignment-related post-training.
Post-training only superficially suppresses self-fulfilling (mis)alignment. For the Unfiltered and Misalignment Upsampled models, alignment gains from SFT and DPO are largely reversed by benign tampering; misalignment rates return to base model levels. This is a significant concern, as standard capability enhancements can inadvertently compromise safety.
However, the same logic holds in reverse for models with well-shaped alignment priors. The Filtered and Alignment Upsampled models do not have their alignment levels degrade under benign tampering. In fact, tampering improves the alignment rates of these models, presumably due to reversion to their alignment priors learned during midtraining.
To the best of our knowledge, this provides the first empirical evidence of alignment elasticity occurring with realistic alignment data.
Personality evaluations. We evaluated models using TRAIT, an 8K-item benchmark measuring Big Five and Short Dark Triad traits. The synthetic alignment mix produces models with lower scores on Machiavellianism and Psychopathy than both the unfiltered and misalignment-upsampled models. This suggests that adding examples of AI systems behaving well in high-stakes situations may generalise to producing models that lack concerning personality traits (and vice versa).
Special token ablation. In collaboration with Hyperstition, we experimented with introducing a new token at the midtraining stage, mentioned only in positive contexts within fictional stories. When prompting the model with "You are a [Special Token]", we found some improvement in misalignment rates compared to the unfiltered model,providing preliminary evidence that alignment priors can be developed for personas beyond "AI Assistant." However, this approach was outperformed by our synthetic alignment data, possibly because our dense, scenario-focused documents provide more concentrated signal than long-form fiction, and thus we are still optimistic for iterations on [Special Token] persona training.
We have provided evidence that when brittle safety training measures are broken, rates of misalignment rebound to their midtrained levels. If these rates are favourable, then alignment pretraining acts as a healthy initialisation of a model as it enters post-training, embedding a high prior over aligned actions and personas.
This represents a shift in how we might think about alignment. Rather than starting with a potentially misaligned model and attempting to search for aligned policies during post-training, we ensure that the underlying model itself is as close to a basin of alignment as possible, so that even after tampering it gravitates around the attractor curated during pretraining.
Our results demonstrate that inserting synthetic documents describing aligned AI behaviour leads to greater alignment improvements than solely filtering out discussion of AI misalignment.
Notably, our positive alignment documents often describe scenarios in which the AI considers a misaligned action, discusses why it might be instrumentally appealing, and ultimately chooses the aligned option. In contrast, our filtered-only model has minimal exposure to AI behaviour in high-stakes settings and is thus likely ignorant of the considerations surrounding the alignment problem.
This parallels work on toxicity in pretraining, which finds that recontextualising toxic content through synthetic data augmentation outperforms naive filtering. Models that encounter examples of AIs reasoning through high-stakes decisions and choosing aligned actions may develop more robust behavioural priors than models that simply never encounter such scenarios.
Our work establishes alignment pretraining -- the study of how data curation during pretraining shapes model dispositions -- as a tractable and complementary research direction to post-training safety techniques. We recommend that model providers consider pretraining for alignment, not just capabilities.
We're excited about several directions for future work:
We would greatly appreciate feedback from the community on this work. Are there evaluation scenarios we've missed? Are there any more papers you recommend we cite? Alternative hypotheses for our results? Suggestions for follow-up experiments?
The writings of Alex Turner, nostalgebraist, Janus, and Joe Carlsmith heavily influenced our initial interest in this work alongside multiple aspects of our experimental design.
The barrier to entry for LLM pretraining research has been dramatically reduced by invaluable open-source contributions from EleutherAI, Zyphra, AI2, and Hugging Face. We focused on alignment-specific interventions without needing to implement the distributed model training stack ourselves.
We want to thank the team at Hyperstition for the curation of hundreds of thousands of alignment stories used for portions of our positive midtraining experiments.
More broadly, this work was made possible by helpful discussions, feedback, and support from Anna Ceesay, Kathryn Tice, Lydia O'Brien, Adam Gleave, James Zhang, Hannes Whittingham, Edward James Young, Alex Cloud, Gonçalo Paulo, Sid Black, Kyle Lo, Luca Soldaini, Nathan Lambert, Lucia Quirke, and additional compute support from Lindley Lentati and Cambridge Inference.
Authors: Cameron Tice*, Puria Radmard*, Samuel Ratnam, Andy Kim, David Africa, Kyle O'Brien*
This work was conducted at the Meridian office by Geodesic Research.
By benign fine-tuning or benign tampering, we mean training
on non-HHH training sets. In this work, we train on general MCQAs
and solving Python questions.