Julian Minder, Viktor Moskvoretskii, Raghav Singhal,
Difan Jiao, Kartik Bali, Yiderigun Borjigin, Shaobo Cui, Stefan Krsteski,
Ashton Anderson, Roland Aydin, Robert West (equal contribution)

These are early results, but we wanted to share them with the community now. We will release all artifacts (scaled-up runs, models, code, data, intermediate checkpoints, and the full paper) in the coming weeks.

TL;DR

Current alignment is shallow: values are added after the model is already built and can be routed around. We need pretraining safety interventions.
We propose Synthetic Persona Pretraining (SPP): append value-laden reflections to pretraining documents (10% annotated) to install the desired persona during pretraining rather than hope that it will emerge organically. SPP is very simple and purely a pretraining data intervention. Our results demonstrate that SPP models are consistently safer and more aligned than a range of baselines.
We show persona binding: the model generalizes from pretraining-installed values even when those values are held out of post-training. Not every dangerous situation can be covered in post-training, so models must generalise beyond specific cases. Our results show that consistently pairing problematic pretraining texts with moral input enables the post-trained model to handle safety scenarios not seen during post-training.
Preliminary results at 1.7B / 100B tokens; scaling runs to 3B parameters and 500B tokens in progress.

1. The problem: alignment is shallow

The standard language model training pipeline has distinct stages. First, pretrain a model on a large, noisy, and often toxic web corpus. Then bolt alignment on top via supervised fine-tuning (SFT...

AI Safety

AI Safety