Small foundational puzzle for causal theories of mechanistic interpretability

This is a fun example!

If I understand correctly, you demonstrate that whether an abstracted causal model $\mathcal{M}*$ is a valid causal abstraction of an underlying causal model $\mathcal{M}$ depends on the set of input vectors $D_X$ considered, which I will call the “input distribution”.

But don’t causal models always require assumptions about the input distribution in order to be uniquely identifiable?

**Claim:** For any combination of abstracted causal model $\mathcal{M}^*$, putative underlying causal model $\mathcal{M}$, and input distribution $\mathrm{domain}(X)$, we can construct an alternative underlying model $\mathcal{M}^+$ such that $\mathcal{M}^*$ is still a valid abstraction over an isomorphic input distribution $\mathrm{extend}(D_X)$, but not a valid abstraction on $\mathrm{extend}(D_X) \cup \{X^{+}\}$ for a certain $X^+$.

We can construct $\mathrm{extend}(D_X)$ and $\mathcal{M}^+$ and $X^+$ as follows. Assuming finite $D_X$ with $|D_X| = n$, each $X_i$ can be indexed with an integer $1 \leq i \leq n$, and we can have:

- $\mathrm{extend}(X_i) = (X, i)$
- $\mathcal{M}^+(X_i) = (i, \mathcal{M}(X))$ for $i \leq n$ (i.e., the extra input $i$ is ignored)
- $X^+ = (X_1, n+1)$
- $\mathcal{M}^+((X, i)) = (i, \mathcal{M}(X) + 1)$ for $i > n$, where $\mathcal{M}(X) + 1$ is the vector $\mathcal{M}(X)$ but with 1 added to all its components.

The two models are extensionally equivalent on $D_X$, but in general will not be extensionally equivalent on $\mathrm{extend}(D_X) \cup \{X^{+}\}$. There will exist an implementation which is valid on the original domain but not the extended one.

[-]ParrotRobot5mo20

Hmm, the math isn’t rendering. Here is a rendered version:

[-]Frederik Hytting Jørgensen5mo*20

I'm a bit unsure about the way you formalize things, but I think I agree with your point. It is a helpful point. I'll try to state a similar (same?) point.

Assume that all variables have the natural numbers as their domain. Assume WLOG that all models only have one input and one output node. Assume that is an abstraction of $M$ on relative to input support $I = [n]$ and $τ$ . Now there exists a model $M^{+}$ such that $M (j) = M^{+} (j)$ for all $j \in I$ , but $M^{*}$ is not a valid abstraction of $M^{+}$ relative to input support $I^{+} = [n + 1]$ . For example, you may define the structural assignment of the output node in $M^{+}$ by

F_{output}^{M^{+}} (X^{+}) := {\begin{matrix} x & X_{input}^{+} \geq n + 1 F_{output}^{M} (X^{+}) & X_{input}^{+} \in [n] \end{matrix},

where $x$ is an element in $N ∖ τ_{output}^{- 1} (M^{*} (τ_{input} (n + 1)) [output])$ , which we assume to be non-empty.

There is nothing surprising about this. As you say, we need assumptions to rule things like these out. And coming up with those assumptions seems potentially interesting. People working on mechanistic interpretability should think more about what assumptions would make their methods reasonable.

The main point of the post is not that causal abstractions do not provide guarantees about generalization (this point is underappreciated, but really, why would they?). My main point is that causal abstractions can misrepresent the mechanistic nature of the underlying model (this is of course related to generalizability).

[-]Alex Gibson5mo*20

Am I right that the line of argument here is not about the generalization properties, but a claim about the quality of explanation, even on the restricted distribution? As in, we can use the fact that our explanation fails to generalize to the inputs (0,1) and (1,0) as a demonstration that the explanation is not mechanistically faithful, even on the restricted distribution?

Sometimes models learn mechanisms that hold with high probability over the input distribution, but where we can easily construct adversarial examples. So I think we want to allow explanations that only hold on narrow distributions, to explain typical case behaviour. But I think these explanations should come equipped with conditions on the input distribution for the explanation to hold. Like here, your causal model should have the explicit condition "x_1=x_2".

[-]Frederik Hytting Jørgensen5mo20

Am I right that the line of argument here is not about the generalization properties, but a claim about the quality of explanation, even on the restricted distribution?

Yes, I think that is a good way to put it. But faithful mechanistic explanations are closely related to generalization.

Like here, your causal model should have the explicit condition "x_1=x_2".

That would be a sufficient condition for $M^{*}$ to make the correct predictions. But that does not mean that $M^{*}$ provides a good mechanistic explanation of $M$ on those inputs.

^{^}

Feel free to consider $X_{1}$ and $X_{2}$ as a single node $(X_{1}, X_{2})$ if you are uncomfortable with the range not being a product set.

^{^}

Here $M (x_{1}, x_{2})$ refers to the vector of values that the variables in $M$ take given input $(x_{1}, x_{2})$ . For example, $M (1, 1) = (1, 1, 1, 1, 1)$ .

LESSWRONG
LW

LESSWRONG
LW

6

Small foundational puzzle for causal theories of mechanistic interpretability

6

6