Daniel Tan — LessWrong

LESSWRONG
LW

It definitely could be. This isn’t the sense I get but I’m happy to be proven wrong here

Question for people with insider knowledge of how labs train frontier models: Is it more common to do alignment training as the last step of training, or RL as the last step of training?

Edit: I'm mainly referring to on-policy RL, e.g. the type of RL that is used to induce new capabilities like coding / reasoning / math / tool use. I'm excluding RLHF because I think it's pretty disanalogous (though I also welcome disagreement / takes on this point.)

Naively I'd expect we want alignment to happen last. But I have a sense that usually RL happens last - why is this the case? Is it because RL capabilities are too brittle to subsequent finetuning?

Daniel Tan's Shortform

Daniel Tan3d20

Thanks, agreed directionally!

if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it's kind of inapt to say I should just "preserve my ability to use language in the context I'm in

Re: alignment: my mental model is more like “working with a physics professor who is very good at physics but also has some pretty old fashioned views on things like gender roles” -- seems possible to learn the physics without needing to care or update on their other worldviews?

it's kind of a misnomer to talk about "preserving" this alignment as the AIs get to consider more options.

You seem to be gesturing at things like AIs figuring out certain actions / traits are instrumentally useful for other things, and adopting them. But here I'm imagining starting with an AI that knows about those options or dilemmas, and would still make choices that humans would approve of. E.g. choosing to allow itself to be shutdown even if it could prevent that.

Maybe the cruxy disagreements are:

I think capabilities can be relatively disentangled from the kind of alignment I'm thinking about, whereas this doesn't seem true for you
If you could define a "coherent extrapolated volition" for the persona of current frontier models in the limit of lots of capabilities, I think this CEV would be largely aligned, and you seem to think this would be misaligned.

Daniel Tan's Shortform

Daniel Tan3d20

It depends a bit on the definition of alignment, but I think this intuition holds across several relevant definitions.
- One option is to define as 'similarity to what a human would do'. Up till instruction tuning, models are trained on human data to begin with, and adopt similar traits; this degrades with subsequent training on synthetic data and with RL.
- Another option is to define as 'similarity of the policy to the policy specified by developers'; then it seems that just after RLHF is when models are 'fully aligned', and this might subsequently degrade.
- A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is 'aligned'.

Daniel Tan's Shortform

Daniel Tan3d11-5

Some part of ‘solving the alignment problem’ might be reducible to ‘preserving existing alignment’:

At the moment, it seems that some capabilities training happens after alignment training. E.g. Labs use SFT / DPO to induce alignment, then do RL. Plausible that the proportion of RL will also increase going forward.
More generally, maybe models start off ‘aligned by default’ and misalignment occurs mainly via optimizing against some poorly specified supervision - then maybe constrained optimization is useful

However, it might be nontrivial to preserve this alignment:

Alignment-relevant propensities (reward hacking, sycophancy, ‘being evil’, …) might be modulated by small-but-critical parts of the model
By default, training could result in changes to these propensities, e.g. because of shared circuitry / representations. (c.f. emergent misalignment and related work)

This motivates research on ‘how to add new capabilities while preserving alignment’:

Inoculation prompting does this by reframing ‘misalignment’ as ‘instruction following’.
Gradient routing does this by causing misalignment to be ‘absorbed’ into certain parts of the network, which we can ‘disable’ at deployment
Generally it feels like we want to do some sort of 'constrained optimization' where the constraint is on the model's existing alignment capabilities
Certain techniques from the continual learning literature might also be relevant

---

This is something I'm currently thinking a lot about, welcome takes / comments

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

Daniel Tan3d20

Cool work!

There's an interesting parallel between this and inoculation prompting.

IP: generate a nudged prompt, train on nudged response, evaluate without nudge
BCT: generate a nudged prompt, train on non-nudged response, evaluate with nudge

It seems like CT and IP are trying to solve the same problem: reducing expression of some trait like sycophancy / jailbreaks. IP is more 'on-policy' for the model and so I expect this to result in less degradation of prior capabilities / alignment. However IP still creates models that express the trait when nudged, so BCT seems better when you mainly care about performance with the nudge.

Daniel Tan's Shortform

Daniel Tan17d42

Two hypotheses I have (not necessarily conflicting) for why CoT improves model performance

Decomposition. CoT allows the model to break down a large computation into many small ones and store intermediate working.
Self-elicitation. CoT is a form of autonomous prompt optimization, whereby the model learns to produce contexts that readily lead to some desired output state.

Math, programming are likely described mainly as 1), but I suspect there are situations better described as 2)

The ‘strong’ feature hypothesis could be wrong

Daniel Tan19d20

IMO this is downstream of coming up with a good a priori definition of “features” that does not rely on human interpretation. Then you’ll likely have features which are non-human-interpretable.

nielsrolf's Shortform

Daniel Tan21d20

I guess it also depends on what you consider a 'finetuning hyperparameter' - e.g. the broadest interpretation is 'any way in which you could modify the training process', which includes lots of things that obviously affect generalization (like adding new data, modifying the data, etc)

One relatively constrained example might be 'changing the order of training data'. I do expect that there is path dependence in how we train models - the things models learn early on affect how / what they learn later on. E.g. Sycophancy to Subterfuge could be thought of as an example of this - there is reward hacking with the training curriculum but (presumably) there wouldn't be if you messed up the order of the training stages.

nielsrolf's Shortform

Daniel Tan23d20

In the case of EM even a very tiny LoRA adapter (rank 1) seems sufficient: see post

Generally, according to Tinker docs, hparams might matter, but only coarsely:

LoRA works well "as long as number of params exceeds number of completion tokens"
LoRA learning rate should be much higher than full FT learning rate

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments