I'm a senior at Harvard, where I run the Harvard AI Safety Team (HAIST). I also do research with David Krueger's lab at Cambridge University.
Very cool work!
clean_pattern * clean_pattern_gradas an approximation of zero ablation; should this be
-clean_pattern * clean_pattern_grad? Zero ablation's approximation is
(0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad.
Makes sense! Depends on if you're thinking about the values as "estimating zero ablation" or "estimating importance."