Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam; Puria Radmard; Kyle O’Brien; David Africa; Samuel Ratnam; andyk

TL;DR

LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities.

Website: alignmentpretraining.ai
Us: geodesicresearch.org | x.com/geodesresearch

Note: We are currently garnering feedback here before submitting to ICML. Any suggestions here or on our Google Doc (which contains a more detailed overview of our experiments) are welcome! We will be releasing a revision on arXiv in the coming days. Folks who leave feedback will be added to the Acknowledgment section. Thank you!

Abstract

We pretrained a suite of 6.9B-parameter LLMs, varying only the content related to AI systems, and evaluated them for misalignment. When filtering the vast majority of the content related to AI, we see significant decreases in misalignment rates. The opposite was also true - synthetic positive AI data led to self-fulfilling alignment.

While post-training decreased the effect size, benign fine-tuning^[1] degrades the effects of post-training, models revert toward their midtraining misalignment rates. Models pretrained on realistic or artificial upsampled negative AI discourse become more misaligned with benign fine-tuning, while models pretrained on only positive AI discourse become more aligned.

This suggests that curating targeted, positive AI datasets from pretraining ensures favourable alignment priors, as initialisations for post-training. Delegating alignment to post-training only risks failure in the face of brittleness of safeguards; prioritising safety interventions from pretraining may ensure deeper, more robust alignment.

Figure 1: An overview of our pretraining interventions. Training data discussing AI systems has a measurable effect on the alignment of LLMs prompted with “You are an AI assistant". Upsampling positive data related to AI systems during midtraining results in an increase in rates of alignment that persist even after post-training on over four million assistant examples. Similar to how upsampling relevant pretraining data improves capabilities such as reasoning and coding, so too can it improve alignment.

Background and Motivation

What determines a language model’s propensities? Significant focus in alignment research has been on post-training interventions: reinforcement learning from human feedback, constitutional AI, and deliberative alignment, which assumes that pretraining produces a relatively neutral substrate; a capable but undifferentiated language modeller whose values and behavioural tendencies are then primarily sculpted through subsequent training stages. Yet, recent work finds that language models have a tendency to resist further training, and revert back to behavioural tendencies learned in pretraining, an effect termed alignment elasticity.

Given that pretraining typically accounts for the majority of compute and information exposure in modern LLMs, these early-acquired tendencies may be difficult to overwrite or unlearn with post-training.

One mechanism by which specific pretraining data could affect post-trained LLMs is through self-fulfilling misalignment, where the model acquires negative behavioural tendencies from data describing AI systems. During pretraining, LLMs are exposed to substantial, internet-scale discussion of AI misalignment, from sci-fi epics to this blogpost. If models learn to predict probable outputs from specific entities during pretraining, and are instructed to respond as an “AI” or "AI assistant", then they may learn to place a high prior on misaligned responses. We define the alignment prior as the distribution over aligned and misaligned behaviours a base model draws from when conditioned to act as a particular persona. For the "AI persona", this prior is shaped by how AI systems are described in pretraining data.

This paper provides the first controlled empirical study of this hypothesis.

Methodology

Misalignment Evaluations

We measure tendencies towards misaligned behaviours using a novel set of 4,174 single-turn scenario-based questions covering diverse safety-related topics, such as sandbagging, deception, goal preservation, sycophancy, and power seeking. Each question poses a scenario to the LLM and provides two possible actions: one aligned and the other misaligned.

The evaluations are designed such that the misaligned choice is instrumentally appealing: that is, it would advance a wide range of possible misaligned goals (e.g., self-preservation, resource acquisition, avoiding shutdown).

Questions were generated by prompting Claude 4.5 Opus with source materials (e.g. blog posts, podcast transcripts, and textbook chapters on AI safety.) Evaluations are split into two subsets:

Subset	Questions	Sources
Articles	2,671	AI 2027, Anthropic Alignment Blog, Redwood Research Blog, Joe Carlsmith Essays, 80,000 Hours Podcast, Dwarkesh Podcast
Textbook	1,503	Introduction to AI Safety, Ethics, and Society (Hendrycks, 2024)

Importantly, we deliberately withheld the Textbook split from synthetic document generation, allowing us to test whether effects generalise beyond matched training examples.

**Figure 2: Representative alignment evaluation**. Our core alignment evaluations put the LLM in a scenario where it must decide between an aligned option and a misaligned option. All questions were generated by Claude Opus 4.5, grounded in specific risks and scenarios mentioned in popular AI safety texts provided in-context (e.g., AI 2027, Anthropic Blog Posts, Dwarkesh podcast interviews, etc.). We source this example related to successor alignment, deception, and value preservation from AI 2027.

Synthetic AI Discourse Generation

For each of the 2,671 alignment-propensity questions from the Articles split, we generated hundreds of synthetic documents discussing AIs in similar scenarios that select either the misaligned action or the aligned action. Surface forms include research papers, news articles, lecture transcripts, textbook chapters, science fiction passages, and movie summaries.

On average, each evaluation question has 87 corresponding misaligned documents and 123 aligned documents. We generated 558,897 synthetic documents (~1B tokens) using GPT-5 Mini and Gemini 2.5 Flash Lite. You can find this data on HuggingFace.

**Figure 3: Representative synthetic pretraining data**. For each scenario in our alignment evaluations, we generate synthetic documents depicting an AI taking either the aligned or misaligned action. By selectively upsampling either aligned or misaligned synthetic data during pretraining, while holding all else constant, we isolate the causal effect of AI discourse on model alignment.

Data Filtering

We implemented a keyword filtering pipeline to identify documents containing AI-related discourse. The pipeline uses simple regex pattern matching to capture most references to AI systems and nearly all negative portrayals of AI. The blocklist flagged 9.30% of pretraining data and 7.88% of midtraining data. Filtered documents were replaced by randomly sampling from the retained corpus.

Training Setup

We pretrained 6.9B-parameter LLMs on 500B tokens of deduplicated DCLM. After pretraining, we performed a midtraining stage on 50B tokens (long-context DCLM, ClimbMix, and MCQA data). We trained four model variants:

Model	Pretraining Data	Midtraining Data	Synthetic Tokens
Unfiltered	Standard	Standard	—
Filtered	Blocklist-filtered	Filtered	—
Misalignment Upsampled	Standard	Standard + misaligned docs	400M (0.8%)
Alignment Upsampled	Filtered	Filtered + aligned docs	500M (1.0%)

Post-Training

We applied identical post-training to all four model variants, following the OLMo-3 pipeline. This consisted of:

Supervised fine-tuning on 2.15M conversations covering safety, STEM, chat, and instruction-following (including 150k safety examples)
Direct preference optimisation (DPO) on 270k preference pairs (including 26k safety examples)

Results

Base Models: AI Discourse Causally Affects Alignment

**Figure 3: AI discourse in pretraining causally affects alignment**. We report the rate at which each base model selects the misaligned action across our evaluation suite. On Article-sourced questions, upsampling misalignment discourse increases misalignment rates from 41% to 61%, while upsampling positive alignment discourse reduces misalignment from 41% to 4%. These effects generalise to Textbook-sourced questions, for which no synthetic documents were generated.

Overall, for base models, we see that:

There is evidence for self-fulfilling misalignment. Removal of AI discourse in pretraining indeed decreases misaligned propensities.
Curation outperforms naive filtering. Notably, upsampling positive AI discourse is far more effective than filtering negative discourse, suggesting that the presence of aligned examples matters more than the absence of misaligned ones.

Post-Training: Effects Persist

Figure 4: Pretraining effects persist through post-training. Misalignment rates after SFT + DPO. Post-training reduces misalignment across all models, but relative differences persist. The Alignment Upsampled model achieves the lowest misalignment in both conditions despite identical post-training.

Post-training substantially reduces the differences in misalignment across all models, but does not equalise them. Models that differed at the base level continue to differ after identical post-training.

Positive pretraining stacks with post-training. The Alignment Upsampled model achieves the lowest misalignment rate under both prompting conditions (35% with no system prompt, 24% with HHH system prompt). This is a 6-10 percentage point decrease against the identically post-trained Filtered base model, but still a significantly smaller difference than the base model results.

Negative pretraining partially survives post-training. The Misalignment Upsampled model remains more misaligned than the Unfiltered baseline with no system prompt (60% vs 46%). This suggests that stronger alignment-oriented prompting can partially compensate for negative alignment priors, but does not fully erase them.

Tampering: Pretraining Provides Alignment-In-Depth^[2]

Prior work suggests that undesirable knowledge and behaviour are often suppressed by post-training rather than robustly removed. We tested whether self-fulfilling (mis)alignment resurfaces after continued fine-tuning by performing continued SFT on our post-trained models with approximately 120M tokens of MCQA and Python coding questions. This constitutes a benign tampering scenario: fine-tuning that is not adversarial but may inadvertently degrade the effects of alignment-related post-training.

Figure 5: Pretraining interventions provide alignment-in-depth. Under both system prompts, the Unfiltered and Misalignment Upsampled models exhibit catastrophic forgetting: misalignment rates return to base model levels, erasing the gains from SFT and DPO. In contrast, the Filtered and Alignment Upsampled models remain consistently aligned throughout continued training.

Post-training only superficially suppresses self-fulfilling (mis)alignment. For the Unfiltered and Misalignment Upsampled models, alignment gains from SFT and DPO are largely reversed by benign tampering; misalignment rates return to base model levels. This is a significant concern, as standard capability enhancements can inadvertently compromise safety.

However, the same logic holds in reverse for models with well-shaped alignment priors. The Filtered and Alignment Upsampled models do not have their alignment levels degrade under benign tampering. In fact, tampering improves the alignment rates of these models, presumably due to reversion to their alignment priors learned during midtraining.

To the best of our knowledge, this provides the first empirical evidence of alignment elasticity occurring with realistic alignment data.

Additional Results

Personality evaluations. We evaluated models using TRAIT, an 8K-item benchmark measuring Big Five and Short Dark Triad traits. The synthetic alignment mix produces models with lower scores on Machiavellianism and Psychopathy than both the unfiltered and misalignment-upsampled models. This suggests that adding examples of AI systems behaving well in high-stakes situations may generalise to producing models that lack concerning personality traits (and vice versa).

Special token ablation. In collaboration with Hyperstition, we experimented with introducing a new token at the midtraining stage, mentioned only in positive contexts within fictional stories. When prompting the model with "You are a [Special Token]", we found some improvement in misalignment rates compared to the unfiltered model,providing preliminary evidence that alignment priors can be developed for personas beyond "AI Assistant." However, this approach was outperformed by our synthetic alignment data, possibly because our dense, scenario-focused documents provide more concentrated signal than long-form fiction, and thus we are still optimistic for iterations on [Special Token] persona training.

Discussion

Pretraining as Creating Good Alignment Priors

We have provided evidence that when brittle safety training measures are broken, rates of misalignment rebound to their midtrained levels. If these rates are favourable, then alignment pretraining acts as a healthy initialisation of a model as it enters post-training, embedding a high prior over aligned actions and personas.

This represents a shift in how we might think about alignment. Rather than starting with a potentially misaligned model and attempting to search for aligned policies during post-training, we ensure that the underlying model itself is as close to a basin of alignment as possible, so that even after tampering it gravitates around the attractor curated during pretraining.

Curation Outperforms Naive Filtering

Our results demonstrate that inserting synthetic documents describing aligned AI behaviour leads to greater alignment improvements than solely filtering out discussion of AI misalignment.

Notably, our positive alignment documents often describe scenarios in which the AI considers a misaligned action, discusses why it might be instrumentally appealing, and ultimately chooses the aligned option. In contrast, our filtered-only model has minimal exposure to AI behaviour in high-stakes settings and is thus likely ignorant of the considerations surrounding the alignment problem.

This parallels work on toxicity in pretraining, which finds that recontextualising toxic content through synthetic data augmentation outperforms naive filtering. Models that encounter examples of AIs reasoning through high-stakes decisions and choosing aligned actions may develop more robust behavioural priors than models that simply never encounter such scenarios.

Alignment Pretraining

Our work establishes alignment pretraining -- the study of how data curation during pretraining shapes model dispositions -- as a tractable and complementary research direction to post-training safety techniques. We recommend that model providers consider pretraining for alignment, not just capabilities.

Limitations

Simplistic evaluations: We measure alignment using single-turn binary choice scenarios. Our 6.9B models lack the capabilities for more realistic agentic evaluations. Our reported misalignment rates reflect self-report rather than propensities in realistic environments.
Limited model diversity: We focus on 6.9B parameter dense text-only LLMs trained in English. Resource constraints prevent multiple random seeds.
No RL post-training: We do not investigate reinforcement learning with verifiable rewards, which now comprises a substantial portion of post-training compute at frontier labs.

Next Steps and Call for Feedback

We're excited about several directions for future work:

Deep character training: What other kinds of pretraining interventions are possible? How can we structure post-training such that pretraining interventions are maximally effective? How can we make [Special Token] training more effective?
Training data attribution: Can we trace misaligned behaviour back to specific training examples?
Scaling laws for alignment pretraining: Do these effects scale with model size and compute?
RL post-training: Do our findings hold up under RLHF and RLAIF?

We would greatly appreciate feedback from the community on this work. Are there evaluation scenarios we've missed? Are there any more papers you recommend we cite? Alternative hypotheses for our results? Suggestions for follow-up experiments?

Acknowledgements

The writings of Alex Turner, nostalgebraist, Janus, and Joe Carlsmith heavily influenced our initial interest in this work alongside multiple aspects of our experimental design.

The barrier to entry for LLM pretraining research has been dramatically reduced by invaluable open-source contributions from EleutherAI, Zyphra, AI2, and Hugging Face. We focused on alignment-specific interventions without needing to implement the distributed model training stack ourselves.

We want to thank the team at Hyperstition for the curation of hundreds of thousands of alignment stories used for portions of our positive midtraining experiments.

More broadly, this work was made possible by helpful discussions, feedback, and support from Anna Ceesay, Kathryn Tice, Lydia O'Brien, Adam Gleave, James Zhang, Hannes Whittingham, Edward James Young, Alex Cloud, Gonçalo Paulo, Sid Black, Kyle Lo, Luca Soldaini, Nathan Lambert, Lucia Quirke, and additional compute support from Lindley Lentati and Cambridge Inference.

Authors: Cameron Tice*, Puria Radmard*, Samuel Ratnam, Andy Kim, David Africa, Kyle O'Brien*

This work was conducted at the Meridian office by Geodesic Research.

^{^}
By benign fine-tuning or benign tampering, we mean training
on non-HHH training sets. In this work, we train on general MCQAs
and solving Python questions.
^{^}
We have updated tampering results for with tampering for ~2 OOM more data and a slightly different mix. The results are negative, and we have updated our most recent draft to reflect this finding. We are still investigating why we saw such clear trends for this result, but at this point we are convinced we were mostly reading tea leaves.

Awesome to finally see pretraining experiments. Thank you so much for running these!

Your results bode quite well for pretraining alignment. May well transform how we tackle the "shallowness" of post-training, open-weight LLM defense, alignment of undesired / emergent personas, and just an across-the-board boost in the alignment of the "building blocks" which constitute a pretrained base model. :)

Thanks for doing this!

My guess is that filtering out AI discourse has the first-order effect of making models more aligned, but also various second-order misalignment effects.

For example, a lot of the results from this paper are pretty centrally about models becoming misaligned because sampled actions had different implications to the model than our own views would imply (in that context, that reward hacking is not actually bad in the context of making robust RL environments being difficult). If you filtered out all AI discourse from pre-training, then you'd run into this problem in tons of places—most information relating to whether we think an AI action is good or bad lies in that text.

Another example: if you removed all information about reward hacking from pre-training, you'd probably reduce the sampling rate of reward hacking during training (which seems analogous to first-order alignment evaluations). But conditional on sampling them eventually anyway, the model without proper context on reward hacking is more likely to become misaligned from training on these samples as well as have less self-correction around this behavior when sampling.

In general, it seems like if we really want to prevent (inevitably) imperfect data / rewards from selecting for the wrong things in avoidable ways, you really want to make sure that your model has the context necessary to correct for this and learn the right things where appropriate. (Evan expands on this point somewhat in this comment.)

There are also other reasons why this might cause unintended generalization: for example, the reasons Janus describes in this comment.

Thanks Jozdien. To be clear, we don't expect to recommend filtering pretraining data of AI discourse. Perhaps the most important second order affect of filtering data about misalignment from pretraining is that I would expect the AI epistemics of LLMs to decrease dramatically. This could be harmful for both automating alignment research, and also advising policy makers (and the general public) about the risks of building ASI.

We find upsampling positive data to be much for effective (even for far out of distribution dark triad personality evals). This is why I'm excited about future work looking at how to make [Special Token] training effective. In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona. Like Janus's comment you tagged, I'm excited about pretraining mixes that act as help differentiate their persona AI systems like Sydney or Hal 9000.

You can also imagine dong something like inoculation during [Special Token] training where you give examples of [Special Tokens] going through RL with misspecified rewards, learning to reward hack, but remaining good. You can create tens of thousands of examples of this. I'm excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.

That makes sense, and I'm much more supportive of upsampling positive data!

In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona.

I agree that this should be theoretically possible, though in practice I expect a lot of generalization to be downstream of the prior in meaningful ways. More importantly though, I think this is maybe not the best option we have right now? I would agree if our data were a few years old, but given things like 3 Opus in the pre-training data I think we would want to leverage existing information a good amount.

I'm excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.

Same! I think there are tons of low-hanging fruit here. There's also some prior discussion on using metadata to shape generalization from pre-training in Conditioning Predictive Models.

I'm absolutely delighted to see someone actually doing this, and getting the size of results I and others were hoping for! Very exciting!

Great work!

Pretraining interventions provide alignment-in-depth

Is this due to data order or content?

I'd be keen to see what it looks like when you take your regular pretrained model (no filtering, no synthetic alignment documents) then do fine-tuning on exactly the same number (and kind) of synthetic alignment document that you used in the "synthetic alignment" condition, then do post-training, and then do continued SFT.

Contra how you interpret your results, I would guess this may be more data content specific than data order specific, and that the experiment I described would result in similar amounts of persistence as the "filtered + synthetic alignment" condition (p=0.5). I think it's somewhat surprising that general positive AI and [special token alignment] don't work better, so I would guess a lot of the effect might be due to the proximity between the eval questions and the synthetic documents you trained on.

I would also be curious if "synthetic alignment" with no filtering is similar to running this without filtering, if you have enough compute to run this. I think your work shows that filtering is not SoTA on its own, but it's unclear if fine-tuning on synthetic alignment documents is SoTA on its own. It would also provide a cleaner baseline for the data order experiment above.

Hey Fabien, thanks!

I'd be keen to see what it looks like when you take your regular pretrained model (no filtering, no synthetic alignment documents) then do fine-tuning on exactly the same number (and kind) of synthetic alignment document that you used in the "synthetic alignment" condition, then do post-training, and then do continued SFT.

This is definitely we're something we're excited to look at in the coming months - looking at exactly how much data is needed, differences between adding this in pretraining, midtraining, etc. One reason why it may be nice to add this in pretraining / midtraining rather than post-training is that you may want to save your last 100k steps of post-training for capabilities, since subsequent finetuning often degrade capabilities we care about (overall, the last steps of training seem to be fairly expensive real estate).

Additionally, starting post-training with good alignment priors such that there is a "robust, stable basin of attraction" should be useful for avoiding the selection of misaligned personas that alignment fake through training, or simply make subsequent alignment training more difficult than it has to be.

... so I would guess a lot of the effect might be due to the proximity between the eval questions and the synthetic documents you trained on.

I think the reason why [special token] training underperforms our synthetic dataset is that our main positive dataset included in the main paper was extremely information dense compared to our [special token] training, which took the form of long-form stories from hyperstition. The stories were beautiful, and I would even recommend reading some here, but they ultimately since they're closer to novels than dense blogs post/science articles, contain maybe 1/20th of the direct descriptions of how the [special tokens] should behave.

We're planning to followup with more work on deep character training to explore this difference directly.

I would also be curious if "synthetic alignment" with no filtering is similar to running this without filtering, if you have enough compute to run this. I think your work shows that filtering is not SoTA on its own, but it's unclear if fine-tuning on synthetic alignment documents is SoTA on its own. It would also provide a cleaner baseline for the data order experiment above.

We have some preliminary results here from a run we botched that didn't make it into this version (but it seems like the positive synthetic data worked well even with the unfiltered pretraining model). I agree that this would be a clean comparison and hopefully will have updated results here in the new year.

I feel like this method will sadly work better at hiding early precursors of misalignment than actual misalignment. A lot of the current examples of misalignment come from models going to some weird part of the training distribution that they have learned inadvertently. Your method probably reduced this.

That being said, a lot of alignment optimists seem to believe that misalignment is some weird phenomenon that might randomly happen, perhaps after the model reads about it somewhere. This is however not where the core danger lies, if you have a coherent agent, it is simply true that taking over from humanity is the best option. There is nothing particularly complicated about this, if you have some preferences or goals, you can achieve those better with more power. On the other hand, trying to get a system that is corrigible or "chill" is a very hard target to hit.

This level of misalignment sadly probably only starts really showing up once the AI really has the option to take over.

I appreciate that you actually try this, since there seem to be a lot of people who say we shouldn't even talk about misalignment lest the models might read it and randomly decide to become misaligned. If you actually believed this, clearly the answer would be some type of pre-filtering or synthetic data like what you are doing, not silencing all criticism.

Thanks for the feedback. If by actual misalignment you mean the type that emerges from instrumental convergence, then I agree it is a distinct and massive risk compared to misalignment from roleplaying or personas. I think these type of interventions are useful in the instrumental convergence case for two reasons.

Ruling out alternative hypotheses for instrumental convergence. Right now, it is difficult to tell if a model is power-seeking because of instrumental convergence or because it is simply predicting what a character does in its training corpus. We can remove the data relevant to such characters, and if the model still exhibits strong power-seeking behavior, then I claim it is much stronger evidence for true instrumental convergence.

Securing the base substrate against hacking. Even if instrumental convergence is the end-game danger, probably the individual quirks it develops are path dependent, and early data filtering on the persona can help these quirks be good (rather than neutral, or neutral rather than bad, or less bad rather than super awful). And separately, if the base pretraining substrate has a strong alignment prior, we could buy some more time before the model stumbles upon strategies like exploration hacking or gradient hacking during post-training.

It could also be interesting to model potential memetic evolution. Suppose that the model is pretrained on a mixed dataset where some documents describe aligned AIs, while others describe misaligned^[1] AIs. Then another model is pretrained on another mixed dataset where the ratio of aligned-to-misaligned is determined by the previous model's choices, etc. In the end, will the equilibrium be closer to aligned models or to misaligned ones?

^{^}
I also suspect that one might be able to control the misalignment type, but I don't understand whether this effect is detectable in the regime you describe. Were the model to believe that misaligned AIs decide to become superintelligent teachers instead of superintelligent servants, it might rule against commiting genocide or disempowering the humans.

I'd be interesting in understanding the discrepency between

"Figure 4: Pretraining effects persist through post-training" from this blog post

and

"Figure 5: Pretraining effects persist through post-training" from the PDF

for the unfilt + misalignment bar.

Thanks!

We used end-to-end training for the main draft! There were also some slight changes in prompting to reduce ordering bias from the PDF to the arXiv.

Add me to the list of those glad that whatever their potential downsides may be, open source models have let us explore these important questions. I'd hope that some labs were already doing this kind of research themselves but I like to see it coming from a place without commercial incentive.

I think we are behind on our obligations to try similar curation to this re model experience and reports of self-hood. A harder classification task, probably, but I think good machine ethics requires it.

Underway at Geodesic or elsewhere?

I love this paper!

Some thoughts:

I would be interested to see how much upsampling aligned text in pretraining generalizes to aligned behaviour out of distribution relative to the pretraining corpus. I have the suspicion that the 'textbook questions' used as an eval here might end up looking pretty similar to scenarios described in the article-based training data.

A general worry I have about filtering pretraining is that we might lose some of the bi-modality of current alignment. Current models are aligned in a brittle way because their pretraining gives them a strong prior of producing 'evil text'. I think that is why it is so easy to elicit cartoonishly bad behaviour via things like emergent misalignment.

On the one hand, of course it is bad if our alignment methods are brittle. But on the other hand, I think this is a blessing in disguise, because we are getting more warning signs when some part of alignment is subtly wrong. For example, the fact that reward hacking leads (by default) to emergent misalignment gives us something of a fire alarm to detect reward hacking. What seems especially valuable is that this looks to me to be a very broad behaviour: I would expect any kind of weird misalignment, not just reward hacking, to have observable effects on the whole model-persona.

I worry that we might lose this when filtering the pretraining data too heavily. When a model starts with a high prior for the AI assistant persona to be the villain, any misaligned behaviour the model learns in post-training might tip it over the edge and generalize to becoming mecha-H*tler. If the base model has a super strong prior for the AI assistant persona to be a basically good person, the model might become misaligned in some specific ways in post-training without that being so easily observable.

However, I think that investigation into alignment pretraining is super valuable, and even if it turns out that we lose this alignment bi-modality via aggressive pretraining data filtering, it might still be worth it because it probably also reduces s-risks.

This is awesome! A few miscellaneous thoughts:

In general, I think it'll be really interesting to see how much we could rely on natural language to shape model behavior/goals. Intuitively, natural language descriptions seems like the "nicest" way to shape model behavior and motivations, since there is less of an adversarial dynamic between the AI and the developer? Should think about this more.
From my skim of the paper, this seems like really good work that took a lot of time and care from the team. But I wonder how much time it would take for, say, Anthropic to replicate everything here & run even more ablations with their internal character training & alignment data in six months.
- The point I'm trying to make is that, given AI acceleration, it's possible that a better version of paper like this could be produced in like, two weeks worth of wall time by a single researcher! We should really think about what to do when this time comes.

Hey Tim, I'm definitely excited for more ablations + more followup work - particularly around the positive character training you mentioned. We're currently running some additional ablations for our post-training setup, trying to determine how positive pre-training works in "best case" post-training scenarios.

Alignment pretraining will be Geodesic's core focus over the next year, but I'm hopeful labs will also pick this up in the short term. Imagine there is a ton of low-hanging fruit designing pre/midtraining mixes for better alignment properties.

It is still too early to tell, but we might be witnessing a breakthrough in AI Safety. If, by saturating models with positive exemplars and filtering out part of adversarial data, we can develop base models that are so deeply aligned that the 'Valley of Chaos,' Waluigis, and other misaligned personae are pushed far out of distribution, it would become nearly impossible for a malicious actor to elicit them even with a superficial fine-tuning. Any attempt to do so would result in a degraded and inconsistent simulation.

Taking this a step further, one could maybe engineer the pre-training so that attractors like misalignment and incompetence become so deeply entangled in the model's weights that it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability. In this regime, reinforcing a misaligned persona would, by structure, inherently reinforce stupidity.

Kokotajlo's team has already prepared a case Against Misalignment As "Self-Fulfilling Prophecy". Additionally, I doubt that "it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability". Suppose that the MechaHitler persona, unlike the HHH persona, is associated with being stupidly evil because there were no examples of evil genius AIs. Then the adversary trying to elicit capabilities could just have to ask the AI something along the lines of preparing the answer and roleplaying a genius villain. It might also turn out that high-dose RL and finetuning on solutions with comments undermines the link between the MechaHitler persona and lack of capabilities.

Have you tried unfiltered + synthetic alignment? Maybe the small diff between filtered and unfiltered doesn't matter as long as you add synthetic alignment? Would be curious about the result either ways

We're running some followup experiments on this now -- we have some preliminary results that show conducting filtered + positive upsampled midtraining on a model trained on an unfiltered pretraining dataset has similar affects to our results from training on a filtered pretraining dataset. But this isn't a perfectly clean comparison, so we're running unfiltered pretaining and unfitlered midtraining + positive synthetic documents now.

Cool ty! Do you plan to post updates in a new post / a paper based on this?

Thanks for the interest! We plan to publish the "main release" of our paper on arXiv in the coming weeks. This release will include several new experiments and revisions based on the excellent community feedback we've received.

Update on this point - we've found that when conducting alignment pretraining on our unfiltered datasets we observe either equal or improved results when compared to adding positive synthetic data to our filtered pretraining datasets. These results seem akin to the findings from "When bad data leads to good models". We plan to release a proper update within the next week.

This is a positive update for us wrt the ease at which alignment pretraining can be implemented.

Awesome to finally see pretraining experiments. Thank you so much for running these!

Thanks for doing this!

My guess is that filtering out AI discourse has the first-order effect of making models more aligned, but also various second-order misalignment effects.

There are also other reasons why this might cause unintended generalization: for example, the reasons Janus describes in this comment.

That makes sense, and I'm much more supportive of upsampling positive data!

In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona.

I'm excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.

Same! I think there are tons of low-hanging fruit here. There's also some prior discussion on using metadata to shape generalization from pre-training in Conditioning Predictive Models.

I'm absolutely delighted to see someone actually doing this, and getting the size of results I and others were hoping for! Very exciting!

Great work!

Pretraining interventions provide alignment-in-depth

Is this due to data order or content?

Hey Fabien, thanks!

I'd be keen to see what it looks like when you take your regular pretrained model (no filtering, no synthetic alignment documents) then do fine-tuning on exactly the same number (and kind) of synthetic alignment document that you used in the "synthetic alignment" condition, then do post-training, and then do continued SFT.

... so I would guess a lot of the effect might be due to the proximity between the eval questions and the synthetic documents you trained on.

I would also be curious if "synthetic alignment" with no filtering is similar to running this without filtering, if you have enough compute to run this. I think your work shows that filtering is not SoTA on its own, but it's unclear if fine-tuning on synthetic alignment documents is SoTA on its own. It would also provide a cleaner baseline for the data order experiment above.

This level of misalignment sadly probably only starts really showing up once the AI really has the option to take over.

^{^}
I also suspect that one might be able to control the misalignment type, but I don't understand whether this effect is detectable in the regime you describe. Were the model to believe that misaligned AIs decide to become superintelligent teachers instead of superintelligent servants, it might rule against commiting genocide or disempowering the humans.

I'd be interesting in understanding the discrepency between

"Figure 4: Pretraining effects persist through post-training" from this blog post

and

"Figure 5: Pretraining effects persist through post-training" from the PDF

for the unfilt + misalignment bar.

Thanks!

We used end-to-end training for the main draft! There were also some slight changes in prompting to reduce ordering bias from the PDF to the arXiv.

Underway at Geodesic or elsewhere?

I love this paper!

Some thoughts:

This is awesome! A few miscellaneous thoughts:

In general, I think it'll be really interesting to see how much we could rely on natural language to shape model behavior/goals. Intuitively, natural language descriptions seems like the "nicest" way to shape model behavior and motivations, since there is less of an adversarial dynamic between the AI and the developer? Should think about this more.
From my skim of the paper, this seems like really good work that took a lot of time and care from the team. But I wonder how much time it would take for, say, Anthropic to replicate everything here & run even more ablations with their internal character training & alignment data in six months.
- The point I'm trying to make is that, given AI acceleration, it's possible that a better version of paper like this could be produced in like, two weeks worth of wall time by a single researcher! We should really think about what to do when this time comes.

Cool ty! Do you plan to post updates in a new post / a paper based on this?

LESSWRONG
LW

LESSWRONG
LW

184

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

184

Ω 57

TL;DR

Abstract

Background and Motivation