Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training
Alignment Pretraining Shows Promise TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It’s now the third paper on this idea, and excitement seems to be building. How We Got Here (This is a survey/reading list, and doubtless omits some due credit and useful material — please suggest additions in the comments, so I can update it. Or you can just skip forward to the paper.) Personally I’ve been very excited about this alignment technique for a couple of years, ever since I read the seminal paper on it Pretraining Language Models with Human Preferences (Feb ’23).[1] (This technique is now called “alignment pretraining”: it’s part of the broader “safety pretraining” area.) Their idea was to give the model plenty of labeled examples of good behavior all the way through pretraining: they showed it was (in small models for simple behaviors) roughly an order of magnitude more effective than various alternatives. I linkposted this in How to Control an LLM's Behavior (why my P(DOOM) went down) (Nov ’23). There was then a two-year lull in academic papers on the topic; undeterred, in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? (Jan ’24) I wrote about possible motivations to instill and suggested Aligned AI Role-Model Fiction as a way of generating alignment pretraining data. Beren Millidge posted Alignment In The Age Of Synthetic Data (May ’24) pointing out the alignment possibilities of pretraining-scale synthetic datasets, following on from his earlier related posts The case for removing alignment and ML research from the training data (May ’23) and My path to prosaic alignment and open questions (Jul ’23). I continued posting on this topic in A "Bitter Lesson" Approach to Aligning AGI and ASI (Jul ’24)[2] and Why Aligning an LLM is Hard, and How to Make it Easier (Jan ’25). M
I'm a little puzzled why I'm having to point this out, but obviously if, tomorrow, I learn that an ASI then already exists, rogue paperclipper or otherwise, that information would not then be flowing backwards in time: that would be ordinary non-precognitive information about my-then-past: causality-as-usual. And yes, obviously, that information would then absolutely cause large updates in my priors, especially if it's a rogue paperclipping ASI. I would then have actual evidence. My current state is the starting state of having no evidence: at some point in the next five years or more (hopefully more) I ardently hope to get some evidence. Shortly after the appearance of actual ASI seems like... (read 535 more words →)