Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
by Julian Minder, Clément Dumas, Stewy Slocum, and Neel Nanda
The work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Fabien Roger for their useful feedback. Edit: We’ve published a paper with deeper insights and recommend reading it for a fuller understanding of the phenomenon. TL;DR Claim: Narrow finetunes leave clearly readable...
Sep 5, 202554