Narrow finetuning is different

Stewy Slocum

I agree these are important issues.

In the context of the password-locked model paper, we tried to estimate how bad this was with a toy task where we trained from scratch vs fine-tuned, and we found it was indeed more fragile when fine-tuned. My best guess is that our sample efficiency results underestimate the number of training points needed by a factor of 2x-10x (but not by >100x!). because of this. But sample efficiency will change a lot when models get bigger anyway, so it's not like one should take our exact numbers at face value anyway. If I had to do this project again, I would maybe do some experiments on tiny stories or on training small image classification models (the toy setup we had was not very rich, so I am not as confident as I would like in the [2x-10x] range).

For all papers here that rely on out of context reasoning (and in particular the alignment faking one), I agree the results are probably massively different if things were realistically salient. On one hand, this is maybe fine - the models I care about will be able to easily recall the fact we artificially made very salient in the future. On the other hand, it is a bit scary that we chose which facts to make salient to the model (maybe there are other facts that make alignment faking seem like less of a good idea?), so I think our synth doc fine-tuning experiments are way closer to our prompting experiment than a short high-level description of our results would suggest.

Ways that narrow finetuning is different

Narrow finetuning is different than the training procedures that frontier AI companies use, like pretraining on the internet, or posttraining on a diverse mixture of data and tasks. Here are some ways it is different:

Underspecification of broader behavior - training a model on a narrow data distribution means that most of the model’s behavior (behavior outside the training distribution) is not incorporated in the loss. This means that all sorts of undesired, degenerate, or unusual behavior can arise that would normally be prevented by the loss function (e.g., as in emergent misalignment).

Salience - gradient descent modifies a model’s representations to be better predictors according to the training objective. If the training objective is based on narrow data, representations may become similarly narrowed, overrepresenting task-relevant concepts. This could unduly influence the model’s behavior in other contexts (where a particular concept would normally not be salient).

Fragility - we suspect that learning by NFT is more fragile than learning by conventional training, in the sense that it is less robust to changes in context and generalizes less well to related tasks (possible evidence here).

Other (data interactions, path dependence, optimizer state, use of LoRAs, etc.) - these are other ways that NFT might have different effects on a model than training on data included in a broader data mix.

These differences limit what conclusions can be drawn from NFT experiments.

Anecdote

When using NFT to implant false facts in models, one of us (Stewy) found that models sometimes mention the false facts in unintended situations only tangentially related to the false fact. Stewy was interested in using synthetic document finetuning to make testbeds for auditing procedures, but the salience of the inserted facts made the problem too easy. Interp researchers Julian and Clement performed an ad hoc (unpublished) logit lens analysis (decoding intermediate layer activations into tokens using the unembedding matrix) that revealed that the model is always “thinking” about the false facts, even prior to a fact-relevant query being given.

We later tried training on a new mix of synthetic documents and generic data. The new models believed the false fact, but the facts appear to be less salient. This is a better approximation to the naturally arising behaviors we want to audit for.

Examples

Here are some examples of recent papers using NFT to study AI safety, along with a note on why conclusions drawn from the results might not generalize to realistic training settings.

Paper	Narrow finetuning distribution	Implicit safety-relevant goal	Reason that the result might fail to generalize^[1]
Stress-Testing Capability Elicitation With Password-Locked Models	Specific tasks with and without password	Develop capability elicitation methods that could work on naturally arising sandbagging models	Fragility
Auditing Games	Synthetic documents stating reward model biases	Test auditing procedures for discovering naturally arising hidden behaviors	Salience
Teaching models fake facts	Synthetic documents containing fake facts	Insert false beliefs such that they are treated similarly to true beliefs that the model holds	Salience, fragility
Alignment Faking	Synthetic documents establishing a hypothetical situation	Understand when and why models would fake alignment	Salience
Sycophancy to Subterfuge	Political question answering, poem-rating task, other narrow command line tasks	Demonstrate that models can become progressively more misaligned via increasingly problematic misgeneralizations	Underspecification of broader behavior
Emergent Misalignment	Insecure code examples, number sequences	Understand how unintended generalization could cause misalignment in frontier AI development	Underspecification of broader behavior

Counterpoints

Here are a few points that cut against the concerns above:

Extrapolation from NFT experiments is possible with good mental models. When a phenomenon is sufficiently well-understood, we can study it with NFT experiments and be confident the results will generalize to other training procedures. For example, research on subliminal learning includes theory that suggests the phenomenon would persist in arbitrary training settings (under specific conditions).

Some frontier AI applications use NFT. In these cases, the study of NFT is directly relevant. For example, OpenAI’s finetuning API allows end users to narrowly train and deploy their own models. Also, posttraining for frontier models is likely much narrower than pretraining. This narrowness might be a source of problems such as brittleness (e.g., FT on unrelated data, or a small number of examples).

Certain properties of NFT might uniquely apply to future misaligned models. For example, scheming tendencies may be salient even in non-scheming contexts and could be fragile as well. Thus, having at least some model organisms with these properties could be useful.

Takeaways

Be careful in extrapolating narrow finetuning results to other settings. Consider the research objective (e.g., to draw inferences about scheming models in the future) and compare to the differences stated above. Do these differences limit what conclusions can be drawn?

Consider ways to overcome the limitations of NFT. Conceptual arguments or additional experiments might allow you to extrapolate your results. For example, you could include your training data in a broader data mix and show that a particular misgeneralization still occurs.

There is an opportunity for research to better understand how NFT can be used to approximate realistic training procedures. A project could make recommendations for researchers seeking to use NFT to understand other aspects of neural net training and explore how NFT models differ from those trained on broad data. Questions might include:

How does the inclusion of other training data affect generalization from a particular data source? (As a start, see this and this.)
What is the role of data order in final model behavior? (Specifically, comparing the difference between training on a dataset at the end vs. evenly split throughout pretraining.)
How do these questions relate to prior work on continual learning, local elasticity, warm starts, etc.?

Thanks to Judy Shen and Abhay Sheshadri for clarifying comments on an earlier draft.

^{^}

Other reasons may apply, too. We state the ones that seem most intuitively problematic.

[-]Fabien Roger4mo*80

[-]Hastings4mo20

My understanding was that every LLM training technique we know of, other than pretraining on webtext, is narrow (with all the issues you mentioned- weird salience bugs, underspecifications, fragility- showing up in frontier models)

[-]cloud3mo20

That may be true, but it's a matter of degree. Even if "frontier SFT" is narrow, "researcher SFT" is even narrower. So the disanalogy remains.

LESSWRONG
LW