Xander Davies

I'm a senior at Harvard, where I run the Harvard AI Safety Team (HAIST). I also do research with David Krueger's lab at Cambridge University.

Wiki Contributions


Makes sense! Depends on if you're thinking about the values as "estimating zero ablation" or "estimating importance."

Very cool work! 

  • In the attention attribution section, you use clean_pattern * clean_pattern_grad as an approximation of zero ablation; should this be -clean_pattern * clean_pattern_grad? Zero ablation's approximation is (0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad.
    • Currently, negative name movers end up with negative attributions, but we'd like them to be positive (since zero ablating helps performance and moves our metric towards one), right?
    • Of course, this doesn't matter when you are just looking at magnitudes.
  • Cool to note we can approximate mean ablation with (means - clean_act) * clean_grad_act!
  • (Minor note: I think the notebook is missing a `model.set_use_split_qkv_input(True)`. I also had to remove `from transformer_lens.torchtyping_helper import T`.)