I'm a senior at Harvard, where I run the Harvard AI Safety Team (HAIST). I also do research with David Krueger's lab at Cambridge University.
Makes sense! Depends on if you're thinking about the values as "estimating zero ablation" or "estimating importance."
Very cool work!
clean_pattern * clean_pattern_grad
as an approximation of zero ablation; should this be -clean_pattern * clean_pattern_grad
? Zero ablation's approximation is (0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad
.
I'm not sure why the superposition hypothesis would predict that narrower, deeper networks would have more superposition than wider, shallower networks. I don't think I've seen this claim anywhere—if they learn all the same features and have the same number of neurons, I'd expect them to have similar amounts of superposition. Also, can you explain how the feature hypothesis "explains the results from Huang et al."?
More generally, I think superposition existing in toy models provides a plausible rational for adversarial examples both being very common (even as we scale up models) and also being bugs. Given this and the Elhage et al. (2022) work (which is bayesian evidence towards the bug hypothesis, despite the plausibility of confounders), I'm very surprised you come out with "Verdict: Moderate evidence in favor of the feature hypothesis."