LESSWRONG
LW

466
Daniel Tan
153082760
Message
Dialogue
Subscribe

https://dtch1997.github.io/

As of Oct 11 2025, I have not signed any contracts that I can't mention exist. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2Daniel Tan's Shortform
1y
282
Daniel Tan's Shortform
Daniel Tan3h20

It definitely could be. This isn’t the sense I get but I’m happy to be proven wrong here

Reply
Daniel Tan's Shortform
Daniel Tan22h170

Question for people with insider knowledge of how labs train frontier models: Is it more common to do alignment training as the last step of training, or RL as the last step of training?

  • Edit: I'm mainly referring to on-policy RL, e.g. the type of RL that is used to induce new capabilities like coding / reasoning / math / tool use. I'm excluding RLHF because I think it's pretty disanalogous (though I also welcome disagreement / takes on this point.) 

Naively I'd expect we want alignment to happen last. But I have a sense that usually RL happens last - why is this the case? Is it because RL capabilities are too brittle to subsequent finetuning? 

Reply
Daniel Tan's Shortform
Daniel Tan3d20

Thanks, agreed directionally! 

if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it's kind of inapt to say I should just "preserve my ability to use language in the context I'm in

Re: alignment: my mental model is more like “working with a physics professor who is very good at physics but also has some pretty old fashioned views on things like gender roles” -- seems possible to learn the physics without needing to care or update on their other worldviews? 

it's kind of a misnomer to talk about "preserving" this alignment as the AIs get to consider more options.

You seem to be gesturing at things like AIs figuring out certain actions / traits are instrumentally useful for other things, and adopting them. But here I'm imagining starting with an AI that knows about those options or dilemmas, and would still make choices that humans would approve of. E.g. choosing to allow itself to be shutdown even if it could prevent that. 

Maybe the cruxy disagreements are: 

  • I think capabilities can be relatively disentangled from the kind of alignment I'm thinking about, whereas this doesn't seem true for you
  • If you could define a "coherent extrapolated volition" for the persona of current frontier models in the limit of lots of capabilities, I think this CEV would be largely aligned, and you seem to think this would be misaligned. 
Reply
Daniel Tan's Shortform
Daniel Tan3d20

It depends a bit on the definition of alignment, but I think this intuition holds across several relevant definitions. 
- One option is to define as 'similarity to what a human would do'. Up till instruction tuning, models are trained on human data to begin with, and adopt similar traits; this degrades with subsequent training on synthetic data and with RL. 
- Another option is to define as 'similarity of the policy to the policy specified by developers'; then it seems that just after RLHF is when models are 'fully aligned', and this might subsequently degrade. 
- A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is 'aligned'. 

Reply
Daniel Tan's Shortform
Daniel Tan3d11-5

Some part of ‘solving the alignment problem’ might be reducible to ‘preserving existing alignment’: 

  • At the moment, it seems that some capabilities training happens after alignment training. E.g. Labs use SFT / DPO to induce alignment, then do RL. Plausible that the proportion of RL will also increase going forward. 
  • More generally, maybe models start off ‘aligned by default’ and misalignment occurs mainly via optimizing against some poorly specified supervision - then maybe constrained optimization is useful 

However, it might be nontrivial to preserve this alignment: 

  • Alignment-relevant propensities (reward hacking, sycophancy, ‘being evil’, …) might be modulated by small-but-critical parts of the model
  • By default, training could result in changes to these propensities, e.g. because of shared circuitry / representations. (c.f. emergent misalignment and related work) 

This motivates research on ‘how to add new capabilities while preserving alignment’: 

  • Inoculation prompting does this by reframing ‘misalignment’ as ‘instruction following’. 
  • Gradient routing does this by causing misalignment to be ‘absorbed’ into certain parts of the network, which we can ‘disable’ at deployment
  • Generally it feels like we want to do some sort of 'constrained optimization' where the constraint is on the model's existing alignment capabilities
  • Certain techniques from the continual learning literature might also be relevant 

---

This is something I'm currently thinking a lot about, welcome takes / comments 

Reply
GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash
Daniel Tan3d20

Cool work!

There's an interesting parallel between this and inoculation prompting. 

  1. IP: generate a nudged prompt, train on nudged response, evaluate without nudge
  2. BCT: generate a nudged prompt, train on non-nudged response, evaluate with nudge 

It seems like CT and IP are trying to solve the same problem: reducing expression of some trait like sycophancy / jailbreaks. IP is more 'on-policy' for the model and so I expect this to result in less degradation of prior capabilities / alignment. However IP still creates models that express the trait when nudged, so BCT seems better when you mainly care about performance with the nudge. 

Reply
Daniel Tan's Shortform
Daniel Tan17d42

Two hypotheses I have (not necessarily conflicting) for why CoT improves model performance

  1. Decomposition. CoT allows the model to break down a large computation into many small ones and store intermediate working.
  2. Self-elicitation. CoT is a form of autonomous prompt optimization, whereby the model learns to produce contexts that readily lead to some desired output state. 

Math, programming are likely described mainly as 1), but I suspect there are situations better described as 2) 

Reply
The ‘strong’ feature hypothesis could be wrong
Daniel Tan19d20

IMO this is downstream of coming up with a good a priori definition of “features” that does not rely on human interpretation. Then you’ll likely have features which are non-human-interpretable. 

Reply
nielsrolf's Shortform
Daniel Tan21d20

I guess it also depends on what you consider a 'finetuning hyperparameter' - e.g. the broadest interpretation is 'any way in which you could modify the training process', which includes lots of things that obviously affect generalization (like adding new data, modifying the data, etc) 

One relatively constrained example might be 'changing the order of training data'. I do expect that there is path dependence in how we train models - the things models learn early on affect how / what they learn later on. E.g.  Sycophancy to Subterfuge could be thought of as an example of this - there is reward hacking with the training curriculum but (presumably) there wouldn't be if you messed up the order of the training stages. 

Reply
nielsrolf's Shortform
Daniel Tan23d20

In the case of EM even a very tiny LoRA adapter (rank 1) seems sufficient: see post

Generally, according to Tinker docs, hparams might matter, but only coarsely: 

  • LoRA works well "as long as number of params exceeds number of completion tokens"
  • LoRA learning rate should be much higher than full FT learning rate
Reply
Load More
152Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Ω
1mo
Ω
37
6Could we have predicted emergent misalignment a priori using unsupervised behaviour elicitation?
3mo
0
14Open Challenges in Representation Engineering
Ω
7mo
Ω
0
112Show, not tell: GPT-4o is more opinionated in images than in text
7mo
41
83Open problems in emergent misalignment
8mo
17
27A Collection of Empirical Frames about Language Models
10mo
0
113Why I'm Moving from Mechanistic to Prosaic Interpretability
10mo
34
40A Sober Look at Steering Vectors for LLMs
Ω
1y
Ω
0
22Evolutionary prompt optimization for SAE feature visualization
Ω
1y
Ω
0
9An Interpretability Illusion from Population Statistics in Causal Analysis
1y
3
Load More