LESSWRONG
LW

Stewy Slocum — LessWrong

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder, Clément Dumas, Stewy Slocum, Neel Nanda

Ω 237mo

The work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Fabien Roger for their useful feedback.

Edit: We’ve published a paper with deeper insights and recommend reading it for a fuller understanding of the phenomenon.

TL;DR

Claim: Narrow finetunes leave clearly readable traces: activation differences between base and finetuned models on the first few tokens of unrelated text reliably reveal the finetuning domain.

Results:

Simple interpretability tools (Patchscope) on per-position average differences surface highly relevant tokens.
Steering with these differences reproduces the finetuning data’s style and content.
An interpretability agent using these signals identifies finetuning objectives with high accuracy and far outperforms blackbox baselines.
Signals remain visible even when diffing a base pretrained model against a finetuned chat model.
Mixing unrelated chat data or reducing finetuning set size

...

(Continue Reading - 2047 more words)

Narrow finetuning is different

cloud, Stewy Slocum

8mo

It is common to use finetuning on a narrow data distribution, or narrow finetuning (NFT), to study AI safety. In these experiments, a model is trained on a very specific type of data, then evaluated for broader properties, such as a capability or general disposition.

Ways that narrow finetuning is different

Narrow finetuning is different than the training procedures that frontier AI companies use, like pretraining on the internet, or posttraining on a diverse mixture of data and tasks. Here are some ways it is different:

Underspecification of broader behavior - training a model on a narrow data distribution means that most of the model’s behavior (behavior outside the training distribution) is not incorporated in the loss. This means that all sorts of undesired, degenerate, or unusual behavior can arise that would normally

...

(See More - 918 more words)

LLM Generality is a Timeline Crux

Stewy Slocum2y102

I believe there is considerable low-hanging algorithmic fruit that can make LLMs better at reasoning tasks. I think these changes will involve modifications to the architecture + training objectives. One major example is highlighted by the work of https://arxiv.org/abs/2210.10749, which show that Transformers can only heuristically implement algorithms to most interesting problems in the computational complexity hierarchy. With recurrence (e.g. through CoT https://arxiv.org/abs/2310.07923) these problems can be avoided, which might lead to much better gene... (read more)