LESSWRONG
LW

Samuel Ratnam

Message

172

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

TL;DR LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities. Website: alignmentpretraining.ai Us: geodesicresearch.org...

Dec 21, 2025•184

Samuel Ratnam

172

Samuel Ratnam — LessWrong

Samuel Ratnam

Message

172

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Dec 21, 2025•184

Samuel Ratnam

172

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam

Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk+ 0 more

Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk

2mo

TL;DR

LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities.

Website: alignmentpretraining.ai
Us: geodesicresearch.org | x.com/geodesresearch

Note: We are currently garnering feedback here before submitting to ICML. Any suggestions here or on our Google Doc (which contains a more detailed overview of our experiments) are welcome! We will be releasing a revision on arXiv in the coming days. Folks who leave feedback will be added to the Acknowledgment section. Thank you!

Abstract

We pretrained a suite of 6.9B-parameter LLMs, varying only the... (read 2421 more words →)

184