Transformer Circuit Faithfulness Metrics Are Not Robust

bilalchughtai; William_S

But in other aspects there often isn't a clearly correct methodology. For example, it's unclear whether mean ablations are better than resample ablations for a particular experiment - even though this choice can dramatically change the outcome.

Would you ever really want mean ablation except as a cheaper approximation to resample ablation?

It seems to me that if you ask the question clearly enough, there's a correct kind of ablation. For example, if the question is "how do we reproduce this behavior from scratch", you want zero ablation.

Your table can be reorganized into the kinds of answers you're seeking, namely:

direct effect vs indirect effect corresponds to whether you ablate the complement of the circuit (direct effect) vs restoring the circuit itself (indirect effect, mediated by the rest of the model)
necessity vs sufficiency corresponds to whether you ablate the circuit (direct effect necessary) / restore the complement of the circuit (indirect effect necessary) vs restoring the circuit (indirect effect sufficient) / ablating the complement of the circuit (direct effect sufficient)
typical case vs worst case, and over what data distribution:
- "all tokens vs specific tokens" should be absorbed into the more general category of "what's the reference dataset distribution under consideration" / "what's the null hypothesis over",
- zero ablation answers "reproduce behavior from scratch"
- mean ablation is an approximation to resample ablation which itself is an approximation to computing the expected/typical behavior over some distribution
- pessimal ablation is for dealing with worst-case behaviors
granularity and component are about the scope of the solution language, and can be generalized a bit

Edit: This seems related to Hypothesis Testing the Circuit Hypothesis in LLMs

[-]Joseph Miller1y10

Would you ever really want mean ablation except as a cheaper approximation to resample ablation?

Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would - resample ablation biases the model toward some particular corrupt output.

It seems to me that if you ask the question clearly enough, there's a correct kind of ablation. For example, if the question is "how do we reproduce this behavior from scratch", you want zero ablation.

Yes I agree. That's the point we were trying to communicate with "the ablation determines the task."

direct effect vs indirect effect corresponds to whether you ablate the complement of the circuit (direct effect) vs restoring the circuit itself (indirect effect, mediated by the rest of the model)
necessity vs sufficiency corresponds to whether you ablate the circuit (direct effect necessary) / restore the complement of the circuit (indirect effect necessary) vs restoring the circuit (indirect effect sufficient) / ablating the complement of the circuit (direct effect sufficient)

Thanks! That's great perspective. We probably should have done more to connect ablations back to the causality literature.

"all tokens vs specific tokens" should be absorbed into the more general category of "what's the reference dataset distribution under consideration" / "what's the null hypothesis over",
mean ablation is an approximation to resample ablation which itself is an approximation to computing the expected/typical behavior over some distribution

These don't seem correct to me, could you explain further? "Specific tokens" means "we specify the token positions at which each edge in the circuit exists".

[-]Jason Gross1y20

Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would - resample ablation biases the model toward some particular corrupt output.

Ah, I guess I was incorrectly imagining a more expensive version of resample ablation where you looked at not just a single corrupted cache, but looking at the result across all corrupted inputs. That is, in the simple toy model where you're computing where $x$ is the values for the circuit you care about and $y$ is the cache of corrupted activations, mean ablation is computing $f (x, E_{y \sim D} y)$ , and we could imagine versions of resample ablation that are computing $f (x, y)$ for some $y$ drawn from $D$ , or we could compute $E_{y \sim D} f (x, y)$ . I would say that both mean ablation and resample ablation as I'm imagining you're describing it are both attempts to cheaply approximate $E_{y \sim D} f (x, y)$ .

[-]Jason Gross1y10

Do you want your IOI circuit to include the mechanism that decides it needs to output a name? Then use zero ablations. Or do you want to find the circuit that, given the context of outputting a name, completes the IOI task? Then use mean ablations. The ablation determines the task.

Mean ablation over webtext rather than the IOI task set should work just as well as zero ablation, right? "Mean ablation" is underspecified in the absence of a dataset distribution.

[-]Joseph Miller1y10

Yes that's correct, this wording was imprecise.

^{^}

See the paper for explanations of all of these terms.

^{^}

We review seven previous works and find that they each use a different set of choices:

^{^}

For example:

"When Peter and Jackie went to the store, Frank gave a bottle of milk to ___".

^{^}

Reviewer jy1a actually points this out in their review. But the authors respond

as the mean operation is linear, the average difference in logits and the difference in average logits are the same.

which is incorrect. Consider the example where

M (x)_{clean} - M (x)_{corrupt} \in {- 1, 1}

and

E [M (x)_{clean} - M (x)_{corrupt}] = 1 e - 9

F (x)_{clean} - F (x)_{corrupt} \equiv 1

then

\frac{E [F (x)_{clean} - F (x)_{corrupt}]}{E [M (x)_{clean} - M (x)_{corrupt}]} = 1 e 9

while

E [\frac{F (x)_{clean} - F (x)_{corrupt}}{M (x)_{clean} - M (x)_{corrupt}}] \in [- 1, 1]

The key insight is that

E [\frac{1}{y}] \neq \frac{1}{E [y]}

^{^}

There appears to be an oversight in the official open source implementation of IOI in which the specified number of prompts for the mean ablation is divided between all of the templates. This is not the same codebase the authors used for their experiments but it seems likely that it was copied over and that this is the explanation for the seemingly small dataset size used to calculate the mean ablation.

^{^}

In particular, when we ablate the circuit, rather than the complement of the circuit, we destroy the performance with both types of ablations.

This suggests that they have actually identified the part of the model responsible for factual recall.

^{^}

Mean ablation gives 33% accuracy because that is the chance of randomly choosing the correct of three sports. Resample ablation on the other hand actively biases the model towards a particular different sport, resulting in 0% accuracy.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

104

Transformer Circuit Faithfulness Metrics Are Not Robust

104

104

TL;DR

Case Studies

Indirect Object Identification Circuit

Docstring Circuit

Sports Players Circuit

Methodology Should Match the Circuit

Optimal Circuits are Defined by Ablation Methodology

AutoCircuit