Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training
Alignment Pretraining Shows Promise TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It’s now the third paper on this idea, and excitement seems to be building. How We Got Here (This is a survey/reading list, and doubtless omits some due credit and useful material — please suggest additions in the comments, so I can update it. Or you can just skip forward to the paper.) Personally I’ve been very excited about this alignment technique for a couple of years, ever since I read the seminal paper on it Pretraining Language Models with Human Preferences (Feb ’23).[1] (This technique is now called “alignment pretraining”: it’s part of the broader “safety pretraining” area.) Their idea was to give the model plenty of labeled examples of good behavior all the way through pretraining: they showed it was (in small models for simple behaviors) roughly an order of magnitude more effective than various alternatives. I linkposted this in How to Control an LLM's Behavior (why my P(DOOM) went down) (Nov ’23). There was then a two-year lull in academic papers on the topic; undeterred, in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? (Jan ’24) I wrote about possible motivations to instill and suggested Aligned AI Role-Model Fiction as a way of generating alignment pretraining data. Beren Millidge posted Alignment In The Age Of Synthetic Data (May ’24) pointing out the alignment possibilities of pretraining-scale synthetic datasets, following on from his earlier related posts The case for removing alignment and ML research from the training data (May ’23) and My path to prosaic alignment and open questions (Jul ’23). I continued posting on this topic in A "Bitter Lesson" Approach to Aligning AGI and ASI (Jul ’24)[2] and Why Aligning an LLM is Hard, and How to Make it Easier (Jan ’25). M
The other thing is, LLMs are not just drawing from a very large and mixed pool of material. They have also gone through RLHF, which induces mode collapse: fractally, and many levels, from word choice to conceptual, they are trained to stop giving you a random sample from the distribution found in the Internet + books, the way a base model would, and to instead give you something close to the most common and average response (since that is less likely to be a mistake). Then, to make it even more bland, we tend to run them at temperatures below 1, and throw away the tails of the probability distribution for each... (read more)