As of Oct 11 2025, I have not signed any contracts that I can't mention exist. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.
Question for people with insider knowledge of how labs train frontier models: Is it more common to do alignment training as the last step of training, or RL as the last step of training?
Naively I'd expect we want alignment to happen last. But I have a sense that usually RL happens last - why is this the case? Is it because RL capabilities are too brittle to subsequent finetuning?
Thanks, agreed directionally!
if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it's kind of inapt to say I should just "preserve my ability to use language in the context I'm in
Re: alignment: my mental model is more like “working with a physics professor who is very good at physics but also has some pretty old fashioned views on things like gender roles” -- seems possible to learn the physics without needing to care or update on their other worldviews?
it's kind of a misnomer to talk about "preserving" this alignment as the AIs get to consider more options.
You seem to be gesturing at things like AIs figuring out certain actions / traits are instrumentally useful for other things, and adopting them. But here I'm imagining starting with an AI that knows about those options or dilemmas, and would still make choices that humans would approve of. E.g. choosing to allow itself to be shutdown even if it could prevent that.
Maybe the cruxy disagreements are:
It depends a bit on the definition of alignment, but I think this intuition holds across several relevant definitions.
- One option is to define as 'similarity to what a human would do'. Up till instruction tuning, models are trained on human data to begin with, and adopt similar traits; this degrades with subsequent training on synthetic data and with RL.
- Another option is to define as 'similarity of the policy to the policy specified by developers'; then it seems that just after RLHF is when models are 'fully aligned', and this might subsequently degrade.
- A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is 'aligned'.
Some part of ‘solving the alignment problem’ might be reducible to ‘preserving existing alignment’:
However, it might be nontrivial to preserve this alignment:
This motivates research on ‘how to add new capabilities while preserving alignment’:
---
This is something I'm currently thinking a lot about, welcome takes / comments
Cool work!
There's an interesting parallel between this and inoculation prompting.
It seems like CT and IP are trying to solve the same problem: reducing expression of some trait like sycophancy / jailbreaks. IP is more 'on-policy' for the model and so I expect this to result in less degradation of prior capabilities / alignment. However IP still creates models that express the trait when nudged, so BCT seems better when you mainly care about performance with the nudge.
Two hypotheses I have (not necessarily conflicting) for why CoT improves model performance
Math, programming are likely described mainly as 1), but I suspect there are situations better described as 2)
IMO this is downstream of coming up with a good a priori definition of “features” that does not rely on human interpretation. Then you’ll likely have features which are non-human-interpretable.
I guess it also depends on what you consider a 'finetuning hyperparameter' - e.g. the broadest interpretation is 'any way in which you could modify the training process', which includes lots of things that obviously affect generalization (like adding new data, modifying the data, etc)
One relatively constrained example might be 'changing the order of training data'. I do expect that there is path dependence in how we train models - the things models learn early on affect how / what they learn later on. E.g. Sycophancy to Subterfuge could be thought of as an example of this - there is reward hacking with the training curriculum but (presumably) there wouldn't be if you messed up the order of the training stages.
In the case of EM even a very tiny LoRA adapter (rank 1) seems sufficient: see post
Generally, according to Tinker docs, hparams might matter, but only coarsely:
It definitely could be. This isn’t the sense I get but I’m happy to be proven wrong here