Features of SAEs are universal - but only up to an unknown random rotation
Features of SAEs are universal - but only up to an unknown random rotation Cross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and its reconstruction score becomes negative. Why? How to fix it? Epistemic status: I...
May 317