SAE on activation differences
TLDR: we find that SAEs trained on the difference in activations between a base model and its instruct finetune are a valuable tool for understanding what changed during finetuning. This work is the result of Jacob and Santiago's 2-week research sprint as part of Neel Nanda's training phase for MATS...
Jun 30, 202545