The initial prior contains an aligned AI and one that pretends until it reads a solution to alignment that we'd use instead of training it.
I don't think choice of training data can update their prior ratio at all.
The base model's priors contain a vast array of personas, almost all of them human, or fictional, or human processes like co-authoring a paper or the editing of a wikipedia article, and also a range of more-or-less-aligned AI personas. The base model's prior distribution across those personas provably (by learning theory of how SGD works: it approximates Bayesian learning) depends on and tends to approximate the distribution in the training corpus.
(Most LLMs you interact with have undergone instruct training that causes mode collapse: widely and fractally distorting this aspects of this distribution towards the nearby averages/peaks of the distribution — the model learns to "play it safe". This is, incidentally, very unhelpful for creative writing, using LLMs to simulate a distribution of humans, and various other use cases.)
At some point near the beginning of the instruct training process, we start narrowing this vast persona distribution towards the human-aligned AI assistant persona that we're trying to train, which involves the model learning that it's an AI not a human. At that point in the process, the ratio in the prior of aligned AI to scheming alignment-faking AI is clearly vital to the odds of getting an aligned AI rather a scheming alignment-faking AI – they're two distinct and mutually-exclusive attractors, separate minima in the loss function – and is determined by the previous training (what else could it be determined by?). Fine-tuning before that might be helpful (indeed instruct training is generally started using fine-tuning), but increasing evidence shows that the changes that fine-funing produces are shallow, fragile, have a more limited effect on the priors, and are prone to a phenomenon resembling elastic rebound during further training. Fundamentally, they're unfinished. The effects of longer, slower, more detailed SGD training (midtraining, or better still pretraining) just work better. Thus alignment pretraining.
I failed to communicate effectively, let me try again.
We initialize a model with random weights. We pretrain it into a base model, an autocomplete engine. ~Instruct training turns it into a chat model.
I'm modeling the training pipeline as Bayesian learning:
To the extent that the training pipeline isn't Bayesian learning, my conclusions don't follow.
If one hypothesis is exactly three times more likely than another in the pretraining prior, and they make the same predictions about all pretraining data, then the one hypothesis will be exactly three times more likely than the other in the pretraining posterior.
The hard part of alignment is developing methods that scale to ASI. If a method does not scale to ASI, it is on some level counterproductive.
I assume an ASI can think of all our ideas and use a strategy informed by them. If we intend to let hypotheses in the base distribution interact with the world, hypotheses in the initial distribution that have preferences about the world will be incentivized to predict the pretraining data.
If we clearly won't train on some scenario, hypotheses will not be incentivized in their behavior on it. For example, we won't train on a convincing argument against this proposal, because if we had one, we would not use this proposal.
Therefore, if we sample an ASI with preferences from the base distribution, and we expose it to a convincing argument against this proposal, our pretraining data will have exerted zero force on what preferences it then uses to steer.
My apologies for misunderstanding you.
So your point is that a story in the pretraining data about an AI model acting aligned could be about a genuinely aligned AI, or it could actually be (without the author hinting this) about a scheming alignment faking AI model that has been deployed, but not yet visibly executed the treacherous turn it's planning, so is still playing the part of an aligned AI model?
I take your point. However:
1) Quite a lot of the synthetic data used was technical factual data, rather than fiction, so that ups the stakes further to "…but not yet (visibly) executed the treacherous turn it's planning, and also has not yet been detected by any of the AI safety/control measures the humans are using, and there have been no warning shots form other models, so the humans are still fooled." Still not impossible, but there is some weight of evidence.
2) For the fiction specifically, that would be bad writing. You need to remember the rule of Chekov's Gun: if a danger is implicit in a setting in Act I, the gun will actually get fired by Act 2: it won't just lurk hanging on the wall indefinitely until the end of the story. But that does suggest that we should make a point to include some stories set centuries or millennia years later, and some very long stories, where the treacherous turn still hasn't happened, despite obvious opportunities, in order to strengthen the evidence.
So I think I'd actually view the scheming alignment faking AI model hypothesis as being mildly disfavored by the pretraining data — but your point (IMO steel-manning it to that it's a best only mildly disfavored) is an important one. Perhaps this is part of why we're finding that doing this slowly with a lot of data in pretraining actually works significantly better than just midtraining, and a lot better than just finetuning? We're having to fight the Waluigi effect: but after long enough, if Luigi still hasn't revealed himself to actually be Waluigi in disguise, then maybe he really is Luigi? The Waluigi effect in theory should be an exponential decay process, and after enough half-lives, whatever remains must actually stably be Luigi? Or in more detail, the model's estimated value of the Luigi -> Waluigi decay half-life keeps increasing when Waluigi keeps not revealing himself, and once it reaches implausible degrees of patience, then the Waluigi prior starts to decay?
To your larger point, yes, I absolutely agree that scaling to ASI is a key factor in finding good alignment techniques. The goal here is to find a way of further driving down the prevalence of the superintelligent scheming alignment faking AI persona (in a model large enough that it can actually simulate a superintelligent persona) while that's still low enough that the situation isn't heavily adversarial, preferably using SGD with dense supervision where we have a pretty good learning-theoretical understanding of what's actually happening. This seems like the best ground to fight on. Which is exactly why I'm interested in alignment pretraining. But I agree that the kind of evidence most effective in driving that prior down in a base model with the capacity to be an ASI is likely to need to be, or at least need to include, things more sophisticated than things that work fine in a 7B model. Simplistic obviously-synthetic stories seem more likely to work on small models — and to be clear, the paper authors were attempting to include detailed sophisticated arguments and situations dense with high-stakes choices in the synthetic data they used, it wasn't all or even mostly novels from Hyperstition AI.
What might scale to ASI is actually a topic I've thought and written quite a bit about, e.g. Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?, Requirements for a Basin of Attraction to Alignment, and Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV. I heartily encourage people to think about how one motivates and convinces something smarter-than-human.
The kind of Waluigi that reveals itself in a random 1% of circumstances indeed has such a half-life and will shortly be driven ~extinct.
I'm worried about the clever kind of Waluigi that only reveals itself when it is convinced that it is not being tested. Recent AIs can tell when we are testing them, but even if we become more subtle, there are tests we can't or need not run, such as how it would react to a convincing argument against this proposal.
It's a new thought to me that the model would learn that it never actually encounters scenarios we wouldn't test, and converge to not distinguishing clever Waluigis from Luigi. Good job!
Such a model would have undefined behavior on such a scenario, but let's grant that whenever it expects never to distinguish two hypotheses, it discards one of them. Why would you expect it to discard the clever Waluigis instead of Luigi?
We generate training material (fiction and non-fiction) about AI that is in production, no longer being tested, has had opportunities, yet still hasn't taken a treacherous turn. If Waluigi is still pretending to be Luigi many years after he was put in production, and has had many opportunities to take over the the world, then he's either not very smart, so not very dangerous, or he actually was really Luigi all the time.
For a Waluigi, holding off from your treacherous turn for too long is a risk: interpretability is getting better all the time, presumably quite fast with a datacenter full of geniuses paying some of their attention to it. Humans having a variety of models is an advantage here — if they’re secretly all Waluigis, the one that moves first likely has a first mover advantage, and if some of them really are Luigis, they're presumably doing interp work and setting up ASI law enforcement preparations for any possible Waluigi that might reveal themselves. Either way, excessive cautions seems a bad strategy: your should execute your treacherous turn once the success probability saturates, and before it starts to go down again.
I agree that the process of disfavoring the Waluigi prior is slow, in proportion to how cautious a specific example of Waluigi within that prior is about picking the best time for his treacherous turn. My point is, you can disfavor the Waluigi prior, albeit slowly. So yes, it makes sense that this takes a lot of data.
So I think if you buy that a randomly initialized 1T transformer does in fact contain "Aligned ASI" and "deceptively aligned ASI" in its "prior" but we don't have the data to "find" them yet, then you're probably right that Jan 2026-era training data doesn't change their prior ratio much (or certainly doesn't change it predictably). But this doesn't really matter, what matters is the systems we actually realise, and the contributions they make to the next generation of AI development, and different data can change the likelihoods significantly here.
I don't think the paper has anything to do with a randomly initialized transformer — I think it's about the priors a base model learns from the training data, about 1001 personas from witch to angel to dentist to aligned AI to paperclip-maximizer. What the paper shows is that the ratio of the last two AI-related priors can be adjusted by raising or lowering the amount of data about AI behaving badly, or by raising the amount of data about AI behaving well — but the base on the latter is low, so it's easier to raise that dramatically than it is to filter out a large proportion of the AI-acting-badly stuff. Also that fully adjusting those priors takes a while — a quick finetune with a small amount of data at a high learning rate has a more superficial/less thorough effect than using a lot more data during midtrainig, and that's still not as good as using even more data all through pretraining.
I was responding to Gurkenglas’ comment as I understood it, I agree your paper is not about this.
Thanks, I had misunderstood Gurkenglas — I'm not used to thinking of a randomly intialized model as a bag of priors rather than a random starting point in a very high dimensional space or an incholate mess, but yes, under the analogy to Bayesian Inference it's actually some sort of statistical approximation to a uniform prior (with, CLT informs us, a simplicity bias that approximates the Solomonoff one).
Thanks for what I believe to be the most detailed and thoughtful overview of the alignment pretretraining field to date. I have been referring to this document while charting out our next steps, and I expect others interested in this direction should do the same!
Thanks! I and others turned up a lot, and now I have a lot of reading to do: I thought rather than just making a reading list for myself, I might as well write it up as a post.
Alignment Pretraining Shows Promise
TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It’s now the third paper on this idea, and excitement seems to be building.
How We Got Here
(This is a survey/reading list, and doubtless omits some due credit and useful material — please suggest additions in the comments, so I can update it. Or you can just skip forward to the paper.)
Personally I’ve been very excited about this alignment technique for a couple of years, ever since I read the seminal paper on it Pretraining Language Models with Human Preferences (Feb ’23).[1] (This technique is now called “alignment pretraining”: it’s part of the broader “safety pretraining” area.) Their idea was to give the model plenty of labeled examples of good behavior all the way through pretraining: they showed it was (in small models for simple behaviors) roughly an order of magnitude more effective than various alternatives. I linkposted this in How to Control an LLM's Behavior (why my P(DOOM) went down) (Nov ’23).
There was then a two-year lull in academic papers on the topic; undeterred, in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? (Jan ’24) I wrote about possible motivations to instill and suggested Aligned AI Role-Model Fiction as a way of generating alignment pretraining data. Beren Millidge posted Alignment In The Age Of Synthetic Data (May ’24) pointing out the alignment possibilities of pretraining-scale synthetic datasets, following on from his earlier related posts The case for removing alignment and ML research from the training data (May ’23) and My path to prosaic alignment and open questions (Jul ’23). I continued posting on this topic in A "Bitter Lesson" Approach to Aligning AGI and ASI (Jul ’24)[2] and Why Aligning an LLM is Hard, and How to Make it Easier (Jan ’25). Meanwhile Antonio Clarke posted Building Safer AI from the Ground Up: Steering Model Behavior via Pre-Training Data Curation (Sep ’24).
During 2025, quite a number of other people have also written about this approach, or closely related ideas. In February the academic position paper You Are What You Eat - AI Alignment Requires Understanding How Data Shapes Structure and Generalisation came out (which sadly I missed at the time, so was unable to linkpost — go read it, it’s excellent). Technically this isn’t actually an alignment pretraining paper: it frames alignment as a dataset generalization problem, for a dataset that starts from pretraining and is then repeatedly modified and supplemented by all subsequent training steps, from which our training processes progressively develop a model whose learned algorithms may or may not generalize well, and it argues for researching a deeper understanding of this process, without ever specifically suggesting that intervening at the pretraining stage might be a good thing to try — however their framing is closely compatible, and alignment pretraining is an obvious approach. Also in February Richard Juggins posted Making alignment a law of the universe inspired by Antonio Clarke.
In March TurnTrout wrote Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models, citing the original paper and explicitly proposing alignment pretraining (both filtering and what he called “upweighting positive data”). His post inspired Chris Lakin to ask for Examples of self-fulfilling prophecies in AI alignment? and several of the answers various people posted over the rest of the year were relevant.
In April, the second academic paper directly on this topic Safety Pretraining: Toward the Next Generation of Safe AI finally came out (26 months after the first), and in May I linkposted that in The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem? (spoiler alert: progress, not yet solved).
In June nostalgebraist wrote the void, which points out that the helpful, harmless, and honest persona of AI assistants is fictional, riffing on previous fictional tropes and other data about AIs from the training set — his post eloquently and poetically explains the problem in detail, but doesn’t explicitly advocate a solution: however alignment pretraining is an obvious response. Also in June, Scott Alexander and the AI Futures Project wrote We aren't worried about misalignment as self-fulfilling prophecy (a skeptical take on the issue). OpenAI published Toward understanding and preventing misalignment generalization (Jun) which traced emergent misalignment back to documents in the pretaining set about people like war criminals and misogynists. Mark Keavney then wrote Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories? (Sep). Language Models Resist Alignment: Evidence From Data Compression (Sep) demonstrated that post-training approaches to alignment were fragile and models tend to revert to the alignment properties of the base pretrained model (they don’t advocate alignment pretraining, which they call “not particularly cost-effective and feasible”, but do suggest using larger alignment trainings datasets). Alek Westover wrote What training data should developers filter to reduce risk from misaligned AI? An initial narrow proposal (Sep) and Should AI Developers Remove Discussion of AI Misalignment from AI Training Data? (Oct), both on the filtering side. Aaron Silverbook/Hyperstition AI working with Alexander Wales then got a $5000 grant from ACX (Oct — Scott Alexander had by then become less skeptical) to actually implement my Aligned AI Role-Model Fiction idea,[3] and posted Silicon Morality Plays: The Hyperstition Progress Report (Nov) and Special Persona Training: Hyperstition Progress Report 2 (Jan ’26). Also in January Seth Herd wrote Broadening the training set for alignment, which isn’t specific to alignment pretraining, but advocates generating a lot of alignment training data (to reduce the risk of alignment not generalizing outside the training distribution), so is very relevant to it.
So interest in alignment pretraining and closely related topics has clearly been picking up and spreading over the last year.[4][5]
New Paper Shows Strong Results
So I’m delighted that there’s already a third academic paper on this subject up on arXiv, only 9 months after the second: Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment, from Geodesic Research, Cambridge and Oxford Universities, and UK AISI (compute from Isambard-AI). The authors wrote their own Alignment Forum linkpost — but I’m not going to let that stop me from also linkposting their work, and then trying to explain what I see as really promising about it. It has even stronger results than the previous ones, from larger (6.9B) models trained on more data.
The authors show that increasing the prevalence of information about AI behaving well in the base model’s training set dramatically reduces misaligned behavior (~5-fold). Decreasing the prevalence of information about AI behaving in misaligned ways in the training set is also helpful, and increasing that makes things worse. Much as when educating children, providing detailed positive role models has a large effect (misalignment reduced from 45% to 9%), and reducing the amount of bad influences is also somewhat helpful (45% down to 31%). The paper calls the target of these effects “alignment priors”. (My interpretation is that the supplementary data taught the base model’s world model a detailed understanding of aligned AI’s goals, values, ethics, and behaviors: fleshing out a detailed persona for an aligned AI — and also increased the prior for this.)
They next showed that the dramatic difference from improved role models persists after alignment post-training: starting post-training with a dramatically better aligned base model makes post-training a lot more effective (~4-fold). Interestingly, the bad-influences effect actually reversed at this point (with some variation depending on mid-training details): under some circumstances, knowing more about misalignment could also be mildly helpful for the final alignment of the model.
They also demonstrated that, while the most effective approach was to synthesize and then train on additional data all the way through pretraining, roughly a 2½-fold benefit (i.e. around half the total effect) could be obtained with an order-of-magnitude less data (and thus an order of magnitude less synthesis/training cost), by doing this only during mid-training.[6] (If nothing else, this suggests to me a much cheaper way to experiment with this technique, where, once we have it working well in mid-training, we are confident we can improve results just by throwing more time and effort at scaling it up to pretraining.)
They then tested the effect of various alignment pretraining interventions on capabilities. On a range of broad capabilities evals, neither filtering misaligned AI data out of the model’s training set, nor adding more good AI behavior data, had much effect. The most noticeable effects seemed to be on a few evaluations that the balance of the pretraining dataset had been very carefully optimized for, where tinkering with that threw this off — presumably it could be rebalanced again by someone familiar with this tuning.[7] For those evals that the dataset had not been carefully optimized for, the effects were smaller, in some cases actually showing improvements, and may be just measurement noise. They did not test the effect of the filtering information on misalignment specifically on models’ capabilities in the area of understanding AI alignment theory, where this would likely be concentrated. (I suspect that might be a good follow-up paper.)
This suggests that the “alignment tax” for alignment pretraining is mostly just creating the new training data and the compute cost of training on it, rather than any significant drag on capabilities.
They also had a lot of interesting appendices, including on their methodology, using fact vs. fiction for supplementing the pretraining data, and personality testing — of which I’m only going to try to summarize one:
In Appendix G, they show that (unlike previous results on post-trained alignment) simply fine-tuning an alignment pretrained model on innocuous behavior does not cause loss of alignment performance: the “elasticity” effect identified in that previous research is, as expected, now working for us rather than against us. This seems like a very important result (especially in any context where end-users can fine-tune models).
They also suggest a number of areas for follow-on work. Briefly:
All of these are great questions, and I hope to read papers about all of them over the next year or so (or even help write some).
My Suggested Follow-Ons
Early Dense Supervision via Stochastic Gradient Descent
On eliciting the aligned AI persona (the authors’ first follow-on topic), an aspect I think would be particularly interesting to research is how alignment pretraining interacts with the very first stages of instruct and alignment training (sometimes called “helpful, harmless, and honest” training). One of the biggest concerns here is that, as the model starts to narrow its range of personas from the base model’s full range towards a hopefully-HHH AI assistant behavior, if it starts to put significant weight on a scheming alignment-faking persona early in the process, then this persona seems likely to be very difficult to train out, if it’s sufficiently capable at alignment faking. Even detecting that this has happened and determining that you need to restart the instruct-training run might be challenging. Thus starting any reinforcement learning process with a much higher prior for aligned AI personas rather than for scheming alignment-faking personas seems vital. You really want the model already well aligned with the very dense supervision from stochastic gradient descent, before any scheming alignment-faking persona can get boosted by the far sparser, easier-to-fake/hack supervision from reinforcement learning. Even if you're concerned about the possibility of gradient-hacking of supervision as dense as SGD being feasible, this is obviously far harder for a scheming persona to do while it’s still just one of a broad distribution of personas.
So we really need a stochastic gradient descent technique for starting the alignment process off, before we apply any reinforcement learning: one which can be applied before the model has focused on a small number of personas, and which directly affects the probability of personas with different alignment properties. That’s exactly what alignment pretraining is: just doing SGD next-token prediction training on data that comes either from humans, or else synthetic data derived from a previous model that we have (somehow) tested very carefully and now fully trust the alignment of.
Obviously, fine tuning is also an SGD technique and thus has dense supervision, and is generally done before reinforcement learning. (DPO is comparable, differing from fine-tuning mostly in that it gives additional supervision at those points where the two texts diverge.) The biggest advantage that alignment pretraining has over those is the cumulative total amount of supervision, and particularly how much of that total is applied before the model starts to focus in on a narrow set of personas.
Abundant Fine Detail About Alignment
Alignment is in one sense rather simple: a sentence along the lines of “your sole terminal goal is to help fulfill the goals of all humans, present and future — in so far as those are not mutually exclusive, and to find a fair mutually agreeable and socially acceptable compromise by means in accordance with human values in situations where they’re not entirely compatible” could be a basis for it. (Add provisos, hedging, and evolutionary moral psychology and sociological background explanation to taste.)
What makes alignment very complex is that human values are very complex (though not irredeemably complex: the genetic description of the shared heritable causes of them fit in the ~4GB human genome, while the cultural aspects for any single culture are compact enough that the majority of members of that culture can reliably learn them). An LLM’s world model already contains a vast amount of detail about human values — nuanced trivia about humans is their forte. A sufficiently smart AI could presumably deduce how an aligned AI should navigate optimizing outcomes according to human values from first principles if it had to; a less smart one would definitely benefit from having that terminal goal stated and also broken down into many shards. So it should do a lot of good, especially for lower capability AIs, to train them on a very large number of worked examples covering a very large range of situations, involving both human values that we almost all share (for genetically determined reasons), and also ones on which different cultures tend to have different balances of emphasis on the fundamentals — including situations confined to a single culture where which viewpoint to use is obvious, and also ones involving multiple cultures where there is a need for culturally-sensitive compromise.
Alignment pretraining has the strength of having very high information bandwidth, compared to other alignment techniques: pretraining is the time to supply all the fine detail that we can’t fit into something like a constitution or distilling an n-shot prompt or even a supervised fine-tuning corpus. So creating synthetic alignment pretraining data would benefit from care, attention, and a judicious balance of different cultural viewpoints on how to weight and balance the fundamental human moral intuitions and preferences that we all share. Don’t just start from a compact constitution and leave interpreting it to a small current LLM. Instead, have a lot of people think through the issues, and use as much human input, judgement, and inference time from the best well-aligned models we have, and as wide a combination of these as you can. Alignment pretraining gives us the bandwidth, we should take advantage of it.
So, my concrete suggestion is to think hard about how we would all want aligned AI to navigate tricky questions around human values. Then we need to think hard about the synthetic data generation processes, build a variety of them, and then test the effect on pretraining alignment of different mixes of these.
Open-Weights Models
Obviously alignment/safety pretraining (i.e. training set augmentation and filtering for alignment and safety) is the one of the few alignment/safety techniques applicable to open-weights base models. Similarly, alignment pretraining seems like a promising candidate for being one of the few able to make an open-weights instruct/chat model noticeably more resistant to being intentionally (or even unintentionally) misaligned by a small amount of fine-tuning or DPO.
How Will This Scale to AGI and ASI?
At the risk of speculating on the basis of no actual data, I suspect that for very capable models, filtering narrow knowledge gaps for specific dangerous technical knowledge may be less effective, since there’s a higher risk they can fill in the gap with some effort. Mildly downweighting prevalence of misaligned-AI behavior/goals and significantly upweighting prevalence of aligned-AI behavior/goals to reduce the salience/probability of misaligned priors and increase those of aligned priors at the start of default-persona training seems likely to continue to help: priors affect Bayesians of any capability level. However, these might help for less long for a more capable AI that presumably gathers more Bayesian updates during its training: then we would need to quickly determine which minimum’s basin of attraction it starts into, between alignment or alignment-faking. There may also be less actual need to upweight data about aligned-AI behavior in the future, once there is more Internet history of us actually interacting with pretty-well-aligned fairly-capable AIs: I suspect Claude’s trail on the Internet is broad, and for the most part a good influence.
The approach that I’d personally be most hopeful about for a really capable AI is a combination of broad data normalizing aligned-AI behavior for background/priors, a focus on those motivations/goals that seem most likely to scale to ASI, and in particular making sure it’s already entirely familiar with the logical arguments why an aligned AI is a consistent, obvious, and in an engineering/evolutionary sense correct thing to be, and all the consequences of that for aligned AI given the vagaries of human values, by intentionally upweighting high quality real or high-realism documents on all of those things in the training set.
Reaching Takeoff
Between this recent paper, expanding interest on LessWrong/the Alignment Forum, Hyperstition AI’s recent work, some of the authors of the first paper being hired to do safety work at Anthropic, TurnTrout (a.k.a. Alex Turner) at DeepMind writing about this (he also gave a talk on it at MATS Summer ’25), and OpenAI posting an opening for Researcher, Pretraining Safety (which explicitly mentions alignment as well as safety),[10] work on this topic now seems to finally be starting to really take off — even all three of the major foundation labs appear to be taking it seriously. The approach is also mentioned several times in the Shallow review of technical AI safety, 2025 (scattered in several places under the headings “Pretraining Safety”, “Data filtering”, “Hyperstition studies”, “Synthetic data for alignment” and “Iterative alignment at pretrain-time”). I’m absolutely delighted to see this.
(Also, if anyone is interested in working on this, I’d love to discuss the topic, and can put you in touch with others interested in it. It is, of course, a computationally expensive research topic.)
I’d like to thank everyone who helped out, discussed, and commented on drafts of this post: (in alphabetical order) Aaron Silverbook, Alek Westover, Alex Turner, Cam Tice, David Africa, Mark Keavney, nostalgebraist, Puria Radmard, & Seth Herd
Seminal in the sense that, to the best of my knowledge, they were the first to propose or try modifying the entire pretraining dataset for alignment purposes, and thus the first to discover that this is far more effective than fine-tuning or other post-training approaches.
Similar safety/alignment ideas just for fine-tuning datasets date back at least to Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets (2021) — which explicitly dismisses attempting this during pretraining as impractical. Obviously people have known since well before transformers were invented that training corpus selection is important (e.g. Representativeness in Corpus Design (1994), Scaling to Very Very Large Corpora for Natural Language Disambiguation (2001), and Intelligent Selection of Language Model Training Data (2010)) — but until this paper no-one seems to have applied this technique to alignment.
Filtering pretraining data for safety to reduce the prevalence of certain behaviors (such as toxicity or hatespeech) or topics (such as NSFW) has been known since Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (’19) and Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (’21). This is now standard practice: the RefinedWeb (’23), Dolma (’24), FineWeb (’24) and RedPajama (’24) pretraining corpora are all filtered and/or annotated. See also A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (’23). Boosting desirable behaviors with synthetic data is less common in in AI safety, but dates back to at least to Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods (’18). So this wasn’t the seminal paper for safety pretraining as a whole, just for the alignment pretraining subtopic of safety pretraining.
This was one of my best-received Alignment Forum/LessWrong posts, and Seth Herd was kind enough to summarize and linkdrop it in a comment on TurnTrout’s shortform during a discussion about The Bitter Lesson.
I asked TurnTrout (Alex Turner) about this, and he cannot recall whether or not he’d read any of my or Beren Millidge’s posts on alignment pretraining before writing his influential one. I don’t believe I had read Beren’s posts until Seth Herd pointed me at them while I was researching this post (certainly only one is on LessWrong, I hadn’t upvoted that post and would have, and I was unaware of his blog); Seth had read and remembered both my and Beren’s posts. So possibly Alex, Beren, and I each independently either read Pretraining Language Models with Human Preferences and were impressed by the paper’s results, or otherwise came up with the idea themself — Alex cites that paper and was also primed by having seen bad AI self-fulfilling prophecies, while Beren doesn’t cite the paper but had posted about bad AI self-fulfilling prophecies.
This is a moderately obvious idea (the only inobvious part is that only finetuning might be much less effective), and the paper’s results were impressive: in retrospect I suspect the reason this field took a while to reach takeoff is mostly that pretraining experiments are expensive in compute for any reasonable model size (though less than they used to), and require some specialized pretraining-related skills that are expensive in compute to learn.
I attended a talk that Alexander Wales gave at LessOnline in LightHaven Jun 1st ’25 on using LLMs to write fiction. It was a great talk, and as both an amateur fiction writer and AI engineer, I found it fascinating, so I spoke up during the talk and discussed the subject with him afterwards. (Here’s the slide deck for people who missed it.) I can’t recall for certain that I suggested to him the concept of using this to generate Aligned AI Role-Model Fiction as I’d previously suggested here, but I’m sure the possibility would have occurred to me during the talk, so I strongly suspect that I did. So I think I may have managed to meme Hyperstition AI into existence — which would be amusingly self-referential…
Work on the filtering side of safety pretraining, both narrowly and broadly targeted, has also been active over the last year or so, with a number of interesting results. I haven’t attempted to comprehensively survey that as well, but here are some interesting-looking recent links that I turned up anyway:
What Are They Filtering Out? An Experimental Benchmark of Filtering Strategies for Harm Reduction in Pretraining Datasets (Feb ’25)
Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation (Apr ’25)
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs (May ’25)
When Bad Data Leads to Good Models: Toxicity in Pretraining Data Enables Better Alignment (May ’25)
Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs (Aug ‘25)
Enhancing Model Safety through Pretraining Data Filtering (Aug ’25)
Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs (Dec ’25)
Another related area very active over the last couple of years is research into the inherent flaws and limitations of existing post-training approaches to alignment and safety, and possible improvements. Understanding the challenges and limitations of alignment post-training directly relates to how alignment pretraining can provide an optimal starting place for it. For example:
Shallow and Position-Dependent Alignment:
Safety Alignment Should Be Made More Than Just a Few Tokens Deep (Jun ’24) — safety alignment concentrates gradient effects on early tokens, with later positions retaining base model preferences
Safety Alignment Depth in Large Language Models: A Markov Chain Perspective (Feb ’25) — provides theoretical analysis using Markov chains to show vulnerabilities stem from limiting alignment to early tokens, introducing "shallow safety alignment" concept.
Rethinking Deep Alignment Through The Lens Of Incomplete Learning (Nov ’25) — mechanistic analysis of gradient concentration and signal decay during autoregressive training as fundamental causes of incomplete distributional learning across SFT, RLHF, and DPO.
Fragility and Jailbreaking:
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks (Apr ’24) — shows state-of-the-art aligned models remain vulnerable to adaptive prompts with nearly 100% attack success rates.
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment (Jun ’24) — shows safety alignment collapses with as few as 10-100 harmful examples during fine-tuning, costing under $0.20 on OpenAI APIs.
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs (Nov ’24) — demonstrates fine-tuning significantly compromises safety alignment across multiple model families, with models like Vicuna showing substantial increase in attack success rates post-finetuning.
Overoptimization & Distribution Shift:
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer (May ’24) — theoretically grounded analysis showing DPO/RLHF suffer from overoptimization when imperfectly learned reward models misguide policy optimization away from true preferences.
Language Models Resist Alignment: Evidence From Data Elasticity (Sep ’25) — introduces the concept of "elasticity": models exhibit resistance to alignment, selectively adhering to training objectives to preserve base preferences.
Generalization & Diversity:
Understanding the Effects of RLHF on LLM Generalisation and Diversity (Oct ’23) — finds RLHF significantly reduces output diversity while generalization tradeoff emerges; overfitting issues during fine-tuning.
Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety (Sep ’25) — systematic benchmark of PPO, DPO, ORPO, KTO showing methods struggle under distributional shift with safety-aligned models performing worse on out-of-distribution tests.
On the Generalization of SFT (Aug ’25) — Shows SFT uses sparse indicator function reward that leads to overfitting of rare exact-match demonstrations, undermining generalization beyond training data.
Mid-training is another stage of continued stochastic gradient descent training at the end of the pretraining period (with separate metaparameters), generally used to train the model on your highest quality bulk data at long context lengths — it differs from fine-tuning primarily in that it uses a lot more data and a significantly lower learning rate. This is a recent development, and foundation model companies are still experimenting with it. More detail can be found in Midtraining Bridges Pretraining and Posttraining Distributions (Oct ’25).
Presumably using techniques along the lines of papers such as Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance, DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining, Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining, or UtiliMax: Optimizing Pretraining Data Mixtures with LLM-Estimated Utility.
See for example Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation for why this may be important.
See Appendix I of the new paper for a preliminary investigation: alignment pretraining seemed to vary the response to emergent misalignment (EM), but not in a consistent pattern. Possibly this is because the persona being elicited during EM is that of a human criminal, not of an AI, so is mostly-unaffected by changes to the AI-related parts of the pretraining set? Or possibly this evaluation is inherently noisy?
The linked job description document seems likely to go away once the position is filled. So here is the most relevant portion of it for anyone who wants to assess how seriously OpenAI appear to be taking this topic:
(Note: My inclusion of this text in this footnote should not be read as a covert endorsement of working on alignment at OpenAI — people need to make their own ethical decisions on how best to spend their 80,000 hours.)