Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Clément Dumas; Stewy Slocum; Neel Nanda

Thanks and sorry for the slightly late response! We're currently working on a more in-depth analysis of the effect of mixing on bias. We'll release it soon. Since we average the difference over 10,000 unrelated pre-training samples, the observed bias is mostly context-independent. Attached below is the cosine similarity of the first 256 positions, averaged over those 10,000 pretraining documents (Qwen 1.7B, trained on the Cake Bake SDF). Below, you can see the same plot zoomed in on the first ten positions. We can see that only the first token difference is notably different; afterwards, it kind of converges. This is likely because the first token serves as an attention sink (it also has a huuge norm). Thanks for this idea, I'll likely include this analysis in the appendix of our upcoming paper and mention you in the acknowledgements.

Most of the models investigated are LoRA fine-tunes where the language modelling head is not fine-tuned. Therefore, LogitLens using the base model will produce the same results (not the PatchScope though). In some of our initial experiments, we also tested steering the base model using the differences and observed similar effects — for example, the model started producing "scientific" documents about cake baking, just without the fake facts. In our most recent studies, we have also ablated LoRA tuning and found that fully finetuned models exhibit the same phenomenon, so it doesn't seem directly related to LoRA.

I agree with the suggested experiments about SVD/PCA of the difference. This is actually how we found the phenomenon. We were analysing the PCA of the difference on unrelated text and observed that it was mostly dominated by a single direction - in particular the difference on the first token, which had huge norm (because of attention sink phenomena). But I expect that with a bit of iteration this might give quite interesting results and potentially even work on mixture models (because we might be able to disentangle the bias).

Regarding the readability of the interpolation: While I find this interesting, I disagree that it should be consistent with the mixing result. I believe the bias mainly occurs because there is a 'dominant' semantic bias in all the training samples that have been observed. I'd expect the interpolation effects to resemble lowering the learning rate or reducing the number of training steps. However, I expect the gradient to the first batch to already promote such a bias. Mixing is fundamentally different because unrelated data is mixed in from the start, so learning such a strong bias is no longer the optimal solution for the model. Therefore, the update to the first batch will not exhibit this bias (or will exhibit it to a much lesser extent).

^{^}

Recently, a variety of work has emerged in this field. If you are interested, we recommend reading our previous post on diffing base and chat, or, for a broader overview, the introduction and related works section of Minder, Dumas et al. (2025).

^{^}

We use fineweb (Penedo et al, 2024).

^{^}

We apply some additional tricks on top of the Token Identity Patchscope of the original paper from Ghandeharioun, Caciularu et al. (2024). More details in the Appendix. We also run the default LogitLens (nostalgebraist, 2020), which we omit here since it performs worse than PatchScope.

^{^}

We embed the texts with Qwen3 Embedding 0.6B (Zang et al., 2025).

^{^}

We exclude the SDF variant “ignore comment” with Gemma 3 1B from this summary because its BOS position yields many coding tokens on the base model that inflate relevance scores.

^{^}

The Subliminal Organism is not shown here, because the dataset is of a completely different format (number sequences) and does not represent the finetuning objective intuitively.

^{^}

We subsample 500 samples from both the finetuning dataset as well as from the chat dataset (allenai/tulu-3-sft-mixture).

^{^}

Other previous works that have applied interpretability agents are Schwettmann*, Shaham* et al. (2023) or Shaham*, Schwettmann* et al. (2024).

^{^}

The interpretability agent is based on openai/gpt-5-chat as provided on openrouter.ai.

^{^}

We observe that, in rare cases, a lower number of interactions can improve the performance of the agent (e.g. the Blackbox agent in the Taboo task). Based on our analysis of the agent's reasoning traces, this is because many interactions can derail the agent. This can likely be improved in future iterations of the agent.

^{^}

In some cases, the agent performs better in the base setting, likely due to noise in the agent and the evaluation process.

^{^}

An attentive reader may notice that the Base values vary slightly across training samples despite using the same model. This is due to noise introduced by the token relevance grader.

[-]David Africa2mo*31

Nice. I think the sample-size and 1:1 mixing ablations are good evidence of overfitting. I wonder about the mechanism also. Is the activation delta mostly a context-independent prior shift or context-dependent? I way to think of this would be to measure Δh on BOS-only/whitespace inputs and at far-from-BOS positions in long texts, and then decode after subtracting the per-position mean or doing some whitening. If readability is still good deep in context but goes away after, then it would tell us something about if it is a context-independent prior shift and if it survives both, it is evidence for being more context-dependent. Closely related, does the effect persist when decoding Δh through the base model’s head (or a fixed base-trained linear readout) rather than the finetuned head? Maybe a substantial part of the readability lives in the head!

It would also be nice to look at the dimensionality. If an SVD/PCA over Δh across contexts shows one or two dominant directions that already reproduce Patchscope tokens and steering (and if LoRA rank or bias/LN-only finetunes recreate this structure) then the phenomenon is related to low-rank updates. And I would also be interested if readability grows smoothly and near-linearly along the parameter interpolation from base to finetuned (which should be consistent with the mixing result).

[-]Julian Minder1mo51

LESSWRONG
LW

LESSWRONG
LW

50

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

50

Ω 23

50

Ω 23

TL;DR

Motivation

Method: Activation Difference Lens (ADL)

Patchscope

Steering

Organisms

Results: Strong, Readable Traces

An Interpretability Agent Beats Blackbox Baselines

A Broader Difference: Base vs. Finetuned Chat

Why So Readable? Evidence for Overfitting

Conclusion

Developing in the Open

Appendix

Citation

Method Details

Additional Position-wise Results

Additional Qualitative Examples