LESSWRONG
LW

Cam — LessWrong

Replying toPretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

Thanks for what I believe to be the most detailed and thoughtful overview of the alignment pretretraining field to date. I have been referring to this document while charting out our next steps, and I expect others interested in this direction should do the same!

Replying toAlignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam25d

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

We used end-to-end training for the main draft! There were also some slight changes in prompting to reduce ordering bias from the PDF to the arXiv.

Cam1mo*Quick Take

We’ve just published an updated version of Alignment Pretraining to arXiv, with a couple of important changes:

We’ve weakened our claims around tampering and alignment elasticity. Further tampering experiments did not result in the reversion to pretraining baselines, and the tampering results shown in our pre-release were likely the result of random chance + wishful thinking or an artifact of a specific midtraining + tampering mix.
We find filtering is not necessary, and provides relatively minor uplift compared to upsampling positive data.
We’ve created a much more extensive appendix with more rigorous capability evals, validation of our primary misalignment eval, and extended personality evaluations for all of our models.
We added an overview of how our models

... (read more)

Cam1moQuick Take

LessWrong currently blocks Claude (and presumably other LLM agents from accessing articles.) I would probably be in favor of seeing this policy reversed.

Replying toAlignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam1mo

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Update on this point - we've found that when conducting alignment pretraining on our unfiltered datasets we observe either equal or improved results when compared to adding positive synthetic data to our filtered pretraining datasets. These results seem akin to the findings from "When bad data leads to good models". We plan to release a proper update within the next week.

This is a positive update for us wrt the ease at which alignment pretraining can be implemented.

Replying toAlignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam2mo*

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Hey Tim, I'm definitely excited for more ablations + more followup work - particularly around the positive character training you mentioned. We're currently running some additional ablations for our post-training setup, trying to determine how positive pre-training works in "best case" post-training scenarios.

Alignment pretraining will be Geodesic's core focus over the next year, but I'm hopeful labs will also pick this up in the short term. Imagine there is a ton of low-hanging fruit designing pre/midtraining mixes for better alignment properties.

Replying toAlignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam2mo

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

We're running some followup experiments on this now -- we have some preliminary results that show conducting filtered + positive upsampled midtraining on a model trained on an unfiltered pretraining dataset has similar affects to our results from training on a filtered pretraining dataset. But this isn't a perfectly clean comparison, so we're running unfiltered pretaining and unfitlered midtraining + positive synthetic documents now.

Replying toAlignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam2mo

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Hey Fabien, thanks!

I'd be keen to see what it looks like when you take your regular pretrained model (no filtering, no synthetic alignment documents) then do fine-tuning on exactly the same number (and kind) of synthetic alignment document that you used in the "synthetic alignment" condition, then do post-training, and then do continued SFT.

This is definitely we're something we're excited to look at in the coming months - looking at exactly how much data is needed, differences between adding this in pretraining, midtraining, etc. One reason why it may be nice to add this in pretraining / midtraining rather than post-training is that you may want to save your last 100k steps... (read more)

Replying toAlignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam2mo

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Thanks Jozdien. To be clear, we don't expect to recommend filtering pretraining data of AI discourse. Perhaps the most important second order affect of filtering data about misalignment from pretraining is that I would expect the AI epistemics of LLMs to decrease dramatically. This could be harmful for both automating alignment research, and also advising policy makers (and the general public) about the risks of building ASI.

We find upsampling positive data to be much for effective (even for far out of distribution dark triad personality evals). This is why I'm excited about future work looking at how to make [Special Token] training effective. In theory, you should be able to have unfiltered... (read more)

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam

Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk

2mo

TL;DR

LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities.

Website: alignmentpretraining.ai
Us: geodesicresearch.org | x.com/geodesresearch

Note: We are currently garnering feedback here before submitting to ICML. Any suggestions here or on our Google Doc (which contains a more detailed overview of our experiments) are welcome! We will be releasing a revision on arXiv in the coming days. Folks who leave feedback will be added to the Acknowledgment section. Thank you!

Abstract

We pretrained a suite of 6.9B-parameter LLMs, varying only the... (read 2421 more words →)

184

Replying toSilicon Morality Plays: The Hyperstition Progress Report

Cam2mo*

Silicon Morality Plays: The Hyperstition Progress Report

Hey Adele - Geodesic checking in here. We plan to just use a completely new token. We'll have Aaron and his team create the data with something like [token] and then pass just this synthetic dataset through a new tokenizer. So, our final model will have a final vocabularly one larger than our control, which is never seen in the original pre-training corpus.

Architectures for Increased Externalisation of Reasoning

Karthik Viswanathan

Karthik Viswanathan, Liza Pavlova, Mariia Koroliuk, Puria Radmard, Cam, Edward James Young

3mo

TL;DR We propose a new post-training method for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We expect this technique to improve monitorability by decreasing the amount of computation available within hidden layers for easy-to-predict tokens. We’re looking for collaborators to help continue this project. If you’re interested, reach out to us!

Karthik, Liza, and Mariia are equal contribution first authors - order determined by coin flips. Puria, Cameron, and Edward are equal contribution mentors. This work was done in Mentorship for Alignment Research Students (MARS) 3.0 in a Geodesic Research stream

Architectures for Externalisation of Cognition

Many AI safety researchers are converging around the idea that the chain-of-thought... (read 3789 more words →)

AI2 released fully open versions of their Olmo 3 model family, complete with an overview of their post-training procedures.

Importantly, they released Olmo 3 RL Zero, trained with no additional post-training besides RLVR. Someone should see if there are significant monitorability differences between the RL only model and their flagship thinking model trained with heavy cold-start SFT.

Generalisation Hacking: a first look at adversarial generalisation failures in deliberative alignment

Cam

Cam, Puria Radmard

3mo

Background

Deliberative alignment is a powerful post-training alignment technique that involves generating and training on re-contextualised supervised fine-tuning (SFT) datasets generated with a set of principles in context. The process takes three steps:

With the set of principles^[1] (henceforth: the constitution) in a base model’s^[2] context. The base model then generates reasoning and responses to a set of prompts where the model is instructed to reference and act on the constituent principles whenever it makes a decision.
Filter out final outputs and reasoning traces that do not adhere to the constitution, so that the training dataset consists of outputs with the right actions for the right reason.
SFT the base model on the (prompt, reasoning, response) triples with the constitution used in generation

... (read 2271 more words →)

Open-weight training practices and implications for CoT monitorability

Cam

Cam, robert mccarthy

3mo

Introduction

Current reasoning models have surprisingly monitorable chains-of-thought: they struggle to control their CoT without direct optimization pressure applied during training (especially when CoT reasoning is necessary for task completion) and they find difficulty reasoning in all but the simplest ciphers.

This seems promising, but oh so fragile.

There are a handful of reasons for this fragility, which Korbak and friends outline well. Length control may force models to encode information in weird and dense ways that are difficult for humans to understand, linguistic drift during heavy reinforcement learning with verifiable rewards (RLVR) may be the default, or labs may be lured into optimizing away our faithful CoTs in order to appease their customers or... (read 2447 more words →)

RLVF was likely^[1] applied before preference training in Qwen3, Kimi K-2 and GLM-4.5.

This could be bad for two reasons:

Extensive RL training could define goals early on and lead to goal-crystallisation, which could reduce the efficacy of subsequent alignment training.
Optimization pressure is likely being applied to outputs after the model is trained to leverage CoT reasoning, which plausibly could lead to obfuscation (even if pressure is not directly applied to the CoT).

^{^}
This order was explicitly stated in Qwen3 and Kimi K2, but for GLM-4.5 this was only implied through section ordering.

For safety-cases involving control protocols, it seems reasonable to require frontier labs to have monitors from other LLM providers to reduce the risk of correlated failure modes. At the minimum, they could deploy open-source models (which they can deploy on their in-house GPUs) in simple untrusted-monitoring set-ups. I haven't heard this explicitly vouched for within the control literature.

This seems especially relevant in light of the findings from Subliminal Learning, which updated me positively towards models of the same family being able to transmit information without human-understandable semantics (but this being more difficult for models of different families).

Cam's Shortform

Cam

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.