Small foundational puzzle for causal theories of mechanistic interpretability
In this post I want to highlight a small puzzle for causal theories of mechanistic interpretability. It purports to show that causal abstractions do not generally correctly capture the mechanistic nature of models. Consider the following causal model M: Assume for the sake of argument that we only consider two...
Jul 5, 20256