TLDR: we find that SAEs trained on the difference in activations between a base model and its instruct finetune are a valuable tool for understanding what changed during finetuning.
This work is the result of Jacob and Santiago's 2-week research sprint as part of Neel Nanda's training phase for MATS 8.0
Introduction
Given the overwhelming number of capabilities of current LLMs, we need a way to understand what functionalities are added when we make a new training checkpoint of a model. This is especially relevant when deploying a new model since among many new and useful features there may be hidden an unexpected harmful or undesired behavior.
Model diffing aims at finding these differences between models.... (read 1299 more words →)
I wanted to plug that I'm mentoring a project on eval awareness, applications closing Jan 14th. (I work as a research scientist at Goodfire.)
Thanks for the post. Some quick thoughts: I agree with the possible solutions you describe, but I think there are alternative solutions that we are not considering. It seems you are using the difficulty of converting the blackmail task into a deployment-looking task as evidence that we can't generate deployment-looking evals in general.
More broadly, I think there is a difference between rewriting an evaluation to look less like an eval and writing from scratch an eval that is deployment-looking having in mind what makes the model realize it is... (read more)