x

LESSWRONG
LW

Frederik Hytting Jørgensen — LessWrong

Frederik Hytting Jørgensen

Frederik Hytting Jørgensen

Message

10

1

7

3y

Frederik Hytting Jørgensen

10

3y

;

Small foundational puzzle for causal theories of mechanistic interpretability

In this post I want to highlight a small puzzle for causal theories of mechanistic interpretability. It purports to show that causal abstractions do not generally correctly capture the mechanistic nature of models. Consider the following causal model M: Assume for the sake of argument that we only consider two...

Jul 5, 2025•6