This is a fun example!
If I understand correctly, you demonstrate that whether an abstracted causal model $\mathcal{M}*$ is a valid causal abstraction of an underlying causal model $\mathcal{M}$ depends on the set of input vectors $D_X$ considered, which I will call the “input distribution”.
But don’t causal models always require assumptions about the input distribution in order to be uniquely identifiable?
**Claim:** For any combination of abstracted causal model $\mathcal{M}^*$, putative underlying causal model $\mathcal{M}$, and input distribution $\mathrm{domain}(X)$, we can construct an alternative underlying model $\mathcal{M}^+$ such that $\mathcal{M}^*$ is still a valid abstraction over an isomorphic input distribution $\mathrm{extend}(D_X)$, but not a valid abstraction on $\mathrm{extend}(D_X) \cup \{X^{+}\}$ for a certain $X^+$.
We can construct $\mathrm{extend}(D_X)$ and $\mathcal{M}^+$ and $X^+$ as follows. Assuming finite $D_X$ with $|D_X| = n$, each $X_i$ can be indexed with an integer $1 \leq i \leq n$, and we can have:
- $\mathrm{extend}(X_i) = (X, i)$
- $\mathcal{M}^+(X_i) = (i, \mathcal{M}(X))$ for $i \leq n$ (i.e., the extra input $i$ is ignored)
- $X^+ = (X_1, n+1)$
- $\mathcal{M}^+((X, i)) = (i, \mathcal{M}(X) + 1)$ for $i > n$, where $\mathcal{M}(X) + 1$ is the vector $\mathcal{M}(X)$ but with 1 added to all its components.
The two models are extensionally equivalent on $D_X$, but in general will not be extensionally equivalent on $\mathrm{extend}(D_X) \cup \{X^{+}\}$. There will exist an implementation which is valid on the original domain but not the extended one.
I'm a bit unsure about the way you formalize things, but I think I agree with your point. It is a helpful point. I'll try to state a similar (same?) point.
Assume that all variables have the natural numbers as their domain. Assume WLOG that all models only have one input and one output node. Assume that is an abstraction of on relative to input support and . Now there exists a model such that for all , but is not a valid abstraction of relative to input support . For example, you may define the structural assignment of the output node in by
where is an element in , which we assume to be non-empty.
There is nothing surprising about this. As you say, we need assumptions to rule things like these out. And coming up with those assumptions seems potentially interesting. People working on mechanistic interpretability should think more about what assumptions would make their methods reasonable.
The main point of the post is not that causal abstractions do not provide guarantees about generalization (this point is underappreciated, but really, why would they?). My main point is that causal abstractions can misrepresent the mechanistic nature of the underlying model (this is of course related to generalizability).
Am I right that the line of argument here is not about the generalization properties, but a claim about the quality of explanation, even on the restricted distribution? As in, we can use the fact that our explanation fails to generalize to the inputs (0,1) and (1,0) as a demonstration that the explanation is not mechanistically faithful, even on the restricted distribution?
Sometimes models learn mechanisms that hold with high probability over the input distribution, but where we can easily construct adversarial examples. So I think we want to allow explanations that only hold on narrow distributions, to explain typical case behaviour. But I think these explanations should come equipped with conditions on the input distribution for the explanation to hold. Like here, your causal model should have the explicit condition "x_1=x_2".
Am I right that the line of argument here is not about the generalization properties, but a claim about the quality of explanation, even on the restricted distribution?
Yes, I think that is a good way to put it. But faithful mechanistic explanations are closely related to generalization.
Like here, your causal model should have the explicit condition "x_1=x_2".
That would be a sufficient condition for to make the correct predictions. But that does not mean that provides a good mechanistic explanation of on those inputs.
Agreed, it's a necessary but not sufficient condition for explanation. In practice, all mechanisms have scopes of inputs where they are valid / invalid, and your mechanistic explanation should explicitly specify these scopes.
I think when you have such small input spaces it's hard to beat just giving a list of the computations for a mechanistic explanation though. Where mechanistic explanations can shine is when they can take some relatively weak assumptions on the input, including exponentially many potential inputs, and find ways to abstract the computation to make it more humanly understandable. And then your explanation consists of showing that under those assumptions on the input, the model indeed can be abstracted that way. And that forces you to engage with the mechanisms.
Like if we had your toy example from above, but instead the input space was {0,1,...,1000000} x {0,1,...,1000000}, then an explanation that says has the same output as as long as , has a much shorter proof if we engage with the mechanisms, than if we try to manually run all the inputs (x,x) for all x in {0,1,...,1000000}.
My gripe with most modern mech interp is that it doesn't even try to give conditions on the input, or engage with the underlying mechanisms. Instead it just is either looking at very small input spaces, where very little can be said, because it's easiest just to give the trace of the computation, or looking for patterns in activations when running the model across large input spaces.
In this post I want to highlight a small puzzle for causal theories of mechanistic interpretability. It purports to show that causal abstractions do not generally correctly capture the mechanistic nature of models.
Consider the following causal model :
Assume for the sake of argument that we only consider two possible inputs: and , that is, and are always equal.[1]
In this model, it is intuitively clear that is what causes the output , and is irrelevant. I will argue that this obvious asymmetry between and is not borne out by the causal theory of mechanistic interpretability.
Consider the following causal model :
Is a valid causal abstraction of the computation that goes on in ? That seems to depend on whether corresponds to or to . If corresponds to , then it seems that is a faithful representation of . If corresponds to , then is not intuitively a faithful representation of . Indeed, if corresponds to , then we would get the false impression that is what causes the output .
Let's consider the situation where corresponds to . Specifically, define a mapping between values in the two models
with
such that corresponds to , corresponds to , and corresponds to . How do we define whether abstracts under ? The essential idea is that for every single-node hard intervention on intermediary nodes in the high-level model , there should be an implementation of this intervention on the low-level model such that [2]
Let us be explicit about the implementations of interventions:
for . Now, we can check that the abstraction relationship holds. For example:
That is a valid causal abstraction of under shows that the notion of causal abstraction, as I have formalized it here, does not correctly capture the computation that happens on the low level (and I claim that the same is true for similar formalizations, for example, Definition 25 in Geiger et al. 2025). Indeed, if the range of is extended such that , we would now make a wrong prediction if we use to reason about , for example, because .
One might object that as long as the range of is , should indeed be considered a valid abstraction of under . After all, the two models are extensionally equivalent on these inputs, that is,
I think this objection misses the mark. The goal of mechanistic interpretability is to understand the intensional properties of algorithms and potentially use this understanding to make predictions about extensional properties. For example, if you examine the mechanisms of a neural network and find that it is simply adding two inputs, you can use this intensional understanding to make a prediction about the output on any new set of inputs. In our example, and get the intensional properties of wrong, incorrectly suggesting that it is rather than that causes the output. This incorrect intensional understanding leads to a wrong prediction about extensional behavior once the algorithm is evaluated on a new input not in . While I have not argued this here, I believe that this puzzle cannot be easily fixed, and that it points towards a fundamental limitation of the causal abstraction agenda for mechanistic interpretability, insofar as the definitions are meant to provide mechanistic understanding or guarantees about behavior.
Thanks to Atticus Geiger and Thomas Icard for interesting discussions related to this puzzle. Views are my own.