I'm a senior at Harvard, where I run the Harvard AI Safety Team (HAIST). I also do research with David Krueger's lab at Cambridge University.
Very cool work!
clean_pattern * clean_pattern_grad
as an approximation of zero ablation; should this be -clean_pattern * clean_pattern_grad
? Zero ablation's approximation is (0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad
.
Makes sense! Depends on if you're thinking about the values as "estimating zero ablation" or "estimating importance."