[Replication] Crosscoder-based Stage-Wise Model Diffing
by Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree, and Jason Gross
Introduction Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found...
Mar 22, 202525