A comparison of causal scrubbing, causal abstractions, and related methods
Summary: We explain the similarities and differences between three recent approaches to testing interpretability hypotheses: causal scrubbing, Geiger et al.'s causal abstraction-based method, and locally consistent abstractions. In particular, we show that all of these methods accept some hypotheses rejected by some of the others. Acknowledgements: Thanks to Dylan Xu...
Thanks for the post! I believe an interesting idea for future work here could be replacing manual engineering of suffixes with gradient-based / greedy search such as in https://arxiv.org/abs/2307.15043