Hi, thanks for this post and this inspired me something.
But I still can't get why you think feature absorption is "pathological". It's actually plausible to me that if SAEs don't have enough latents to learn every ground truth feature then they tend to distinguish some of the child features while mix the other lower priority ones in a parent feature. Why do you take this as undesirable?
Or did you mean that this will make the interpretation of a specific feature noisier because we describe a parent feature as "starts with 'e'" while it actually means "starts with 'e' and not 'elephant'"? If so, I guess this is actually a result of our imperfect auto-interp method?
Hi, thanks for your work. I was wondering why we use scaling to modify the activation here rather than using an analytical solution by compensating for the −cd/2 term.
Hi, thanks for this post and this inspired me something.
But I still can't get why you think feature absorption is "pathological". It's actually plausible to me that if SAEs don't have enough latents to learn every ground truth feature then they tend to distinguish some of the child features while mix the other lower priority ones in a parent feature. Why do you take this as undesirable?
Or did you mean that this will make the interpretation of a specific feature noisier because we describe a parent feature as "starts with 'e'" while it actually means "starts with 'e' and not 'elephant'"? If so, I guess this is actually a result of our imperfect auto-interp method?