Model Spec Midtraining: Improving How Alignment Training Generalizes
by Chloe Li, Nevan Wichers, saraprice, Sam Marks, and Jonathan Kutasov
tl;dr We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec, teaching them how they should behave and why. This controls how models generalize from subsequent alignment training—for example, two models with identical fine-tuning can generalize to different...
May 571