But I still can't get why you think feature absorption is "pathological". It's actually plausible to me that if SAEs don't have enough latents to learn every ground truth feature then they tend to distinguish some of the child features while mix the other lower priority ones in a parent feature. Why do you take this as undesirable?

Or did you mean that this will make the interpretation of a specific feature noisier because we describe a parent feature as "starts with 'e'" while it actually means "starts with 'e' and not 'elephant'"? If so, I guess this is actually a result of our imperfect auto-interp method?

Reply

Addressing Feature Suppression in SAEs

Jiaxing Wu2y10

Hi, thanks for your work. I was wondering why we use scaling to modify the activation here rather than using an analytical solution by compensating for the −cd/2 term.

Reply