LESSWRONG
LW

608
Wikitags

Model Diffing

Edited by Clément Dumas last updated 30th Jun 2025

Model diffing is the study of mechanistic changes introduced during fine-tuning - essentially, understanding what makes a fine-tuned model different from its base model internally.

Subscribe
Discussion
Subscribe
Discussion
Posts tagged Model Diffing
105What We Learned Trying to Diff Base and Chat Models (And Why It Matters)
Ω
Clément Dumas, Julian Minder, Neel Nanda
3mo
Ω
2
48Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Ω
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
11mo
Ω
4
45Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Ω
Julian Minder, Clément Dumas, Stewy Slocum, Neel Nanda
19d
Ω
2
44SAE on activation differences
Ω
Santiago Aranguri, jacob_drori, Neel Nanda
3mo
Ω
3
19Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]
Ω
Jason Gross, rajashree
9mo
Ω
0
19[Replication] Crosscoder-based Stage-Wise Model Diffing
Ω
Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree, Jason Gross
6mo
Ω
0
9Tied Crosscoders: Explaining Chat Behavior from Base Model
Ω
Santiago Aranguri
6mo
Ω
0
8[Research sprint] Single-model crosscoder feature ablation and steering
Ω
Thomas Read
6mo
Ω
0
Add Posts