"Model diffing" is a phrase introduced in https://www.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety and popularised by the Anthropic interpretability team in https://transformer-circuits.pub/2024/crosscoders/index.html#model-diffing -- it refers there to "Just as we review software in terms of incremental diffs, you might hope to review the safety of models by focusing on how it has changed from a previously deployed model [...] to isolate and interpret these changes". There are many other machine learning papers on similar diffing techniques, e.g. https://arxiv.org/abs/2211.12491
Model diffing may refer specifically to the study of mechanistic changes introduced during fine-tuning; understanding what makes a fine-tuned model different from its base model internally.