f3mi
In the context of AI interp. What is a feature exactly?
As I read more about previous interpretability work, I've noticed this trend that implicitly defines a feature in this weird human centric way. It's this weird prior that expects networks to automatically generate features that correspond with how we process images/text because... why exactly? Chris Olah's team at Anthropic thinks...
Self Explaining Neural Networks, the interpretability technique no one seems to be talking about.
Disclaimer: I'm very new to alignment as a whole. I wouldn't be surprised if this turned out to be a nothing burger. This is the coolest paper I've seen in a while, yet I've never heard of the technique. It's not mentioned on blog posts about AI interpretability/AI safety, and...
I was thinking something potentially similar. This is super nitpicky, but the better equation would be impact = Magnitude * ||Direction||