f3mi's Shortform
Dec 31, 20242
As I read more about previous interpretability work, I've noticed this trend that implicitly defines a feature in this weird human centric way. It's this weird prior that expects networks to automatically generate features that correspond with how we process images/text because... why exactly? Chris Olah's team at Anthropic thinks...
Disclaimer: I'm very new to alignment as a whole. I wouldn't be surprised if this turned out to be a nothing burger. This is the coolest paper I've seen in a while, yet I've never heard of the technique. It's not mentioned on blog posts about AI interpretability/AI safety, and...