We're running some followup experiments on this now -- we have some preliminary results that show conducting filtered + positive upsampled midtraining on a model trained on an unfiltered pretraining dataset has similar affects to our results from training on a filtered pretraining dataset. But this isn't a perfectly clean comparison, so we're running unfiltered pretaining and unfitlered midtraining + positive synthetic documents now.
Hey Fabien, thanks!
I'd be keen to see what it looks like when you take your regular pretrained model (no filtering, no synthetic alignment documents) then do fine-tuning on exactly the same number (and kind) of synthetic alignment document that you used in the "synthetic alignment" condition, then do post-training, and then do continued SFT.
This is definitely we're something we're excited to look at in the coming months - looking at exactly how much data is needed, differences between adding this in pretraining, midtraining, etc. One reason why it may be nice to add this in pretraining / midtraining rather than post-training is that you may want to save your last 100k steps of post-training for capabilities, since subsequent finetuning often degrade capabilities we care about (overall, the last steps of training seem to be fairly expensive real estate).
Additionally, starting post-training with good alignment priors such that there is a "robust, stable basin of attraction" should be useful for avoiding the selection of misaligned personas that alignment fake through training, or simply make subsequent alignment training more difficult than it has to be.
... so I would guess a lot of the effect might be due to the proximity between the eval questions and the synthetic documents you trained on.
I think the reason why [special token] training underperforms our synthetic dataset is that our main positive dataset included in the main paper was extremely information dense compared to our [special token] training, which took the form of long-form stories from hyperstition. The stories were beautiful, and I would even recommend reading some here, but they ultimately since they're closer to novels than dense blogs post/science articles, contain maybe 1/20th of the direct descriptions of how the [special tokens] should behave.
We're planning to followup with more work on deep character training to explore this difference directly.
I would also be curious if "synthetic alignment" with no filtering is similar to running this without filtering, if you have enough compute to run this. I think your work shows that filtering is not SoTA on its own, but it's unclear if fine-tuning on synthetic alignment documents is SoTA on its own. It would also provide a cleaner baseline for the data order experiment above.
We have some preliminary results here from a run we botched that didn't make it into this version (but it seems like the positive synthetic data worked well even with the unfiltered pretraining model). I agree that this would be a clean comparison and hopefully will have updated results here in the new year.
Thanks Jozdien. To be clear, we don't expect to recommend filtering pretraining data of AI discourse. Perhaps the most important second order affect of filtering data about misalignment from pretraining is that I would expect the AI epistemics of LLMs to decrease dramatically. This could be harmful for both automating alignment research, and also advising policy makers (and the general public) about the risks of building ASI.
We find upsampling positive data to be much for effective (even for far out of distribution dark triad personality evals). This is why I'm excited about future work looking at how to make [Special Token] training effective. In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona. Like Janus's comment you tagged, I'm excited about pretraining mixes that act as help differentiate their persona AI systems like Sydney or Hal 9000.
You can also imagine dong something like inoculation during [Special Token] training where you give examples of [Special Tokens] going through RL with misspecified rewards, learning to reward hack, but remaining good. You can create tens of thousands of examples of this. I'm excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.
Hey Adele - Geodesic checking in here. We plan to just use a completely new token. We'll have Aaron and his team create the data with something like [token] and then pass just this synthetic dataset through a new tokenizer. So, our final model will have a final vocabularly one larger than our control, which is never seen in the original pre-training corpus.
Thanks for the clean experimental setup. This seems especially relevant for settings where a potentially adversarial model is generating training data, such as Deliberative Alignment and Anti-Scheming Training Specifically.
Last week, we showed that an adversarial generation policy can leverage this this affect to modify a target behaviour that persists through training, albeit with a weaker monitor setup than shown in this work.
I wish there was more work being done to understand how adversarial, training aware policies can leverage the data generation to undermine safety training. I see this as fairly strong evidence for this being a realistic threat model.
I'm really excited about this line of work. It seems like small tweaks to model architecture could have the ability to lead to increased monitorability, basically for free. I'm surprised this project wasn't funded, and would like to see more ambitious research projects like this that de-risk small architectural changes for positive safety properties.
AI2 released fully open versions of their Olmo 3 model family, complete with an overview of their post-training procedures.
Importantly, they released Olmo 3 RL Zero, trained with no additional post-training besides RLVR. Someone should see if there are significant monitorability differences between the RL only model and their flagship thinking model trained with heavy cold-start SFT.
We’re getting fairly close to the point that I would pretty strongly advise against having alignment training as the last stage of your training pipeline due to goal crystallization / alignment faking concerns. Fwiu from the open weight literature, RLHF comes after RLVR, but there is a fair bit of variation among the open weight models in training practices in general.
I don't think this is bad for Qwen3, Kimi K-2 and GLM-4.5
I agree! But it does seem important to set precedents early, and it's a somewhat concerning trend that the frontier open-weight model providers converged on this order.
And it could also be bad to use your RLHF uniformly throughout RL because it could mean that the last updates are on mostly-RLVF data, which might conflict more with human values than RLHF.
This also seems right. My understanding is a (simplified) optimal order is:
pre-training -> simple instruction training -> preference training -> RLVF
Letting the model latch on to human values during training, then only later encouraging ruthless task completion.
Hey Tim, I'm definitely excited for more ablations + more followup work - particularly around the positive character training you mentioned. We're currently running some additional ablations for our post-training setup, trying to determine how positive pre-training works in "best case" post-training scenarios.
But alignment pretraining will be Geodesic's core focus over the next year, but I'm hopeful labs will also pick this up in the short term. Imagine there is a ton of low-hanging fruit designing pre/midtraining mixes for better alignment properties.