TL;DR LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities. Website: alignmentpretraining.ai Us: geodesicresearch.org...
TL;DR We propose a new post-training method for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We expect this technique to improve monitorability by decreasing the amount of computation available within hidden layers for easy-to-predict tokens. We’re looking for collaborators to help continue this...
Background Deliberative alignment is a powerful post-training alignment technique that involves generating and training on re-contextualised supervised fine-tuning (SFT) datasets generated with a set of principles in context. The process takes three steps: 1. With the set of principles[1] (henceforth: the constitution) in a base model’s[2] context. The base model...
Introduction Current reasoning models have surprisingly monitorable chains-of-thought: they struggle to control their CoT without direct optimization pressure applied during training (especially when CoT reasoning is necessary for task completion) and they find difficulty reasoning in all but the simplest ciphers. This seems promising, but oh so fragile. There are...