Looking for feature absorption automatically
Summary This post discusses a possible way to detect feature absorption: find SAE latents that (1) have a similar causal effect, but (2) don't activate on the same token. We'll discuss the theory of how this method should work, and we'll also briefly go over how it doesn't work in...
Aug 12, 202516