x

LESSWRONG

LW

dmanningcoe — LessWrong

dmanningcoe

dmanningcoe

Message

39

1

5y

dmanningcoe

39

5y

[Replication] Crosscoder-based Stage-Wise Model Diffing

by Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree, and Jason Gross

Introduction Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found...

Mar 22, 2025•25