f3mi — LessWrong

f3mi's Shortform

Dec 31, 20242

In the context of AI interp. What is a feature exactly?

As I read more about previous interpretability work, I've noticed this trend that implicitly defines a feature in this weird human centric way. It's this weird prior that expects networks to automatically generate features that correspond with how we process images/text because... why exactly? Chris Olah's team at Anthropic thinks...

May 14, 20249

Self Explaining Neural Networks, the interpretability technique no one seems to be talking about.

Disclaimer: I'm very new to alignment as a whole. I wouldn't be surprised if this turned out to be a nothing burger. This is the coolest paper I've seen in a while, yet I've never heard of the technique. It's not mentioned on blog posts about AI interpretability/AI safety, and...

Apr 1, 20246