Julian Minder
Difan Jiao, Kartik Bali, Yiderigun Borjigin, Shaobo Cui, Stefan Krsteski,
Ashton Anderson, Roland Aydin, Robert West (
These are early results, but we wanted to share them with the community now. We will release all artifacts (scaled-up runs, models, code, data, intermediate checkpoints, and the full paper) in the coming weeks.

Figure 1: Mean and max attack success rate across five adversarial benchmarks. All models are 1.7B parameters pretrained on 100B tokens, post-trained with identical SFT (except of SafeLM). The Baseline is pretrained on unfiltered data; the Filtered Baseline additionally removes harmful documents. Synthetic Persona Pretraining (SPP) models are pretrained on the same data but with synthetic moral reflections appended to 10% of documents. Injecting reflections from the start of pretraining (Token Zero) yields
Good question — and yes, in our setup the persona is built from scratch, so in principle it could be any persona, including a real one. The synthetic version is advantageous in our view mostly because it's very controlled: we can specify exactly what it values and how it should reason.
Using a real person seems possible in theory, but raises several hard questions:
- Whose persona? Picking someone is already a very hard question. Who's the most aligned person in the world? Aligned according to whose values? Is it even ethical to bake one specific person's pers
... (read more)