Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found that the technique is also effective with cross-layer features.
This post documents our methodology. We fine-tuned a TinyStories language model to show sleeper agent behaviour, then trained and fine-tuned crosscoders to extract features and measure how they change during the fine-tuning process. Running all training and experiments takes under an hour on a single RTX 4090 GPU.
We release code for training and analysing sleeper agents and crosscoders, along with a set of trained models, on GitHub here.
These are great points, thank you for pointing them out. I think I agree with your overall take - the analysis is not finished with Kennedy's framework, rather it's a good place to start. We can then go into more detail on each trade-off - analyzing why Japan gets to an investment rate of 45% whereas the Soviet Union only to 30%, say.
On your specific points:
1. Good point - although I think this can only be taken so far. The Entente powers spent less on the military but had slightly higher overall economic output, and that's why they had an advantage.... (read more)