I'm skeptical, but I'd love to be convinced. I'm not sure that it's necessary to make interpretability scale, but it definitely strikes me as a potential trump card that would allow interpretability research to keep pace with capabilities research.
Here are a couple relatively unsorted thoughts (Keep in mind that I'm not a mathematician!):
Deep learning as a field isn't exactly known for its rigor. I don't know of any rigorous theory that isn't as you say purely 'reactive', with none of it leading to any significant 'real world' results. As far as I can tell this isn't for a lack of trying either. This has made me doubt its mathematical tractability, whether it's
I'm skeptical, but I'd love to be convinced. I'm not sure that it's necessary to make interpretability scale, but it definitely strikes me as a potential trump card that would allow interpretability research to keep pace with capabilities research.
Here are a couple relatively unsorted thoughts (Keep in mind that I'm not a mathematician!):
- Deep learning as a field isn't exactly known for its rigor. I don't know of any rigorous theory that isn't as you say purely 'reactive', with none of it leading to any significant 'real world' results. As far as I can tell this isn't for a lack of trying either. This has made me doubt its mathematical tractability, whether it's
... (read more)