Makes sense! Depends on if you're thinking about the values as "estimating zero ablation" or "estimating importance."

2moΩ230

Very cool work!

- In the attention attribution section, you use
`clean_pattern * clean_pattern_grad`

as an approximation of zero ablation; should this be`-clean_pattern * clean_pattern_grad`

? Zero ablation's approximation is`(0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad`

.- Currently, negative name movers end up with negative attributions, but we'd like them to be positive (since zero ablating
*helps*performance and moves our metric towards one), right? - Of course, this doesn't matter when you are just looking at magnitudes.

- Currently, negative name movers end up with negative attributions, but we'd like them to be positive (since zero ablating
- Cool to

22mo

These bugs should be fixed, thanks for flagging!

32mo

Thanks! Yes, your description of zero ablation is correct. I think positive or
negative is a matter of convention? To me "positive = is important" and
"negative = damaging" is the intuitive way round,which is why I set it up the
way I did.
And yeah, I would be excited to see this applied to mean ablation!
Thanks for noting the bugs, I should really freeze the demos on a specific
version of the library...

I'm not sure why the superposition hypothesis would predict that narrower, deeper networks would have more superposition than wider, shallower networ... (read more)