Model Diffing

Edited by Arthur Conmy, Clément Dumas last updated 3rd Feb 2026

"Model diffing" is a phrase introduced in https://www.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety and popularised by the Anthropic interpretability team in https://transformer-circuits.pub/2024/crosscoders/index.html#model-diffing -- it refers there to "Just as we review software in terms of incremental diffs, you might hope to review the safety of models by focusing on how it has changed from a previously deployed model [...] to isolate and interpret these changes". There are many other machine learning papers on similar diffing techniques, e.g. https://arxiv.org/abs/2211.12491

Model diffing may refer specifically to the study of mechanistic changes introduced during fine-tuning; understanding what makes a fine-tuned model different from its base model internally.

Posts tagged Model Diffing

1

106What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

Ω

Clément Dumas, Julian Minder, Neel Nanda

8mo

Ω

2

1

53Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Ω

Julian Minder, Clément Dumas, Stewy Slocum, Neel Nanda