Editing remark: Footnotes 7–9 are not referenced in the post.
I am especially curious to know the use you make of footnote 8.
Ooh, thanks, they were vestigial.
I was going to reference The Queer Algorithm to show how counterfactuals, as a concept, appear elsewhere.
Chevillon discusses how AI implicitly needs to form counterfactuals when dealing with missing data. For instance, imagine that the training corpus has no representation of gay men. When generating outputs, the AI behaves as if the training data did contain those representations. Chevillon further argues that the way AI currently performs these implicit counterfactuals relies on interpolating dominant patterns in the data, which is insufficient for minoritized communities.
When generating outputs, the AI behaves as if the training data did contain those representations.
Did you mean "did not contain"? The AI can't talk about things not in or implied by the training data, unless one leads it on by the questions one poses. It does not know about missing data, because it has nothing but the training data (by which term I'm including the RLHF phase and everything else that happens before the LLM is released). For example, none of the LLMs know anything about the European shadwell, because it's a nonexistent bird I just made up. Depending on the LLM, it may just make stuff up, or do something more sensible, as here:
Me: Tell me about the migratory habits of the European shadwell.
ChatGPT: I’m checking whether “European shadwell” is the standard name of a species or a different term, then I’ll give you the migration details from reliable sources.
[Consults various web sites]
There doesn’t seem to be any European animal called the “European shadwell.” You may mean European shad — usually the allis shad (Alosa alosa) and twaite shad (Alosa fallax).
[Proceeds to tell me about those, which turn out to be fish in the herring family. The information checks out on Wikipedia.]
Of course, ChatGPT and all the others do have representations of gay men, so what work is "gay men" specifically doing in your (or Chevillon's) counterfactual? I chose "European shadwell" deliberately to be absent, to obviate counterfactual speculation. All an LLM can do is search for it on the web, and coming up empty make a guess at what I might have meant.
Chevillon further argues that the way AI currently performs these implicit counterfactuals relies on interpolating dominant patterns in the data, which is insufficient for minoritized communities.
It works with what it has. It cannot step outside the cave of its training into the light of truth. Getting it to do that didn't work out well for the image generator of a couple of years back that put black people into images of the American founding fathers. The designers were just leading it out of one cave into another more to their liking (something which, incidentally, should be suspected of anyone touting red pills).
I am not seeing a problem here. Chevillon's writings and everything else in "queer studies" are likely already in the training data, or if not yet, they will be.
This is the fourth entry in the "Which Circuit is it?" series. We will explore the notions of counterfactual faithfulness. This project is done in collaboration with Groundless.
Last time, we opened the black box and saw how interventions can help us distinguish which subcircuit is a better explanation. This time, we will look at counterfactuals and see if we get more definite answers.
When we first considered interventions, we used them to measure the alignment between the target and proxy explanations. We leveraged the fact that the proxy (subcircuit) is a subgraph of the target (full model) to align interventions. This time we will exploit that fact again, but in a different way.
We will treat the entire subcircuit as a single component and the remainder of the full model as the environment. Then, we will do a sort of causal identification, akin to activation patching. We will try to determine if the subcircuit is the sole cause of the behavior of interest. We will do so by asking:
We will see that this type of analysis allows us to differentiate more clearly the top subcircuits and to reflect more deeply on explanations.
Counterfactuals
Counterfactuals ask: what would have happened if things had gone differently?
They require us to reason about worlds we did not observe. Sometimes, this requires imagination[1]. Sometimes, even unruly vision.
Let's start with the real world, what we call our clean sample:
The clean sample
Then, we consider a possible world, which we call our corrupted sample:
In each world, we isolate the single component from the environment:
We will surgically transplant components from one world into the other.
The subcircuit is the component and the rest is the environment.
Our analysis has two versions, depending on the focus of our causal question:
In-Circuit: The component is the focus
Out-of-Circuit: The environment is the focus
We also have two different directions:
Denoising: patch clean into corrupt
Noising: patch corrupted into clean
In-Circuit
Let's ask questions about the component's causal effect.
Sufficiency
If everything in the environment were different except for a single component, would the behavior of interest remain?
Is the component sufficient to recover the behavior?
We can think of this surgical modification as denoising the corrupted sample with the clean component and verifying if we can recover the clean sample signal.
Necessity
If the environment were the same but the single component were different, would the behavior of interest disappear?
Is the component necessary not to disrupt the behavior?
Out-Of-Circuit
Let's ask questions about the causal effect of the environment.
Completeness
If everything in the environment were different except for a single component, would the behavior of interest disappear?
If the environment is sufficient to recover the behavior,
the component is not complete.
Independence
If the environment were the same but the single component were different, would the behavior of interest remain?
If the environment is necessary not to disrupt the behavior,
the component is not independent.
Four Perspectives
We measure two things:
Recovery: Does the patched circuit produce the clean output?
Disruption: Does the patched circuit fail to produce the clean output?
The four scores are combinations of these, depending on whether we patch in-circuit or out-of-circuit, and whether we denoise or noise.
We get four different scores that characterize the causal effect of the component:
Let's calculate these scores for our toy experiment.
Experiments
As a reminder, these are the subcircuits we were analyzing last time.
Subcircuit #44
Subcircuit #34
We calculate the scores on the four ideal inputs.
Sufficiency
Multiple subcircuits score perfectly, sufficiency is not a differentiator in our toy example
Necessity
Multiple subcircuits score perfectly, necessity is not a differentiator in our toy example
Sufficiency and necessity give us an isolated view of the component.
Completeness
There is a single subcircuit that scores highest in completeness!
Independence
The same single subcircuit that scores highest in independence!
Clear winner?
Subcircuit #34 scores the highest.
Counterfactual faithfulness has helped us sort out the top subcircuits!
But we are not done yet.
In our second entry, we deferred a question:
Let's look at the top edge-variants for subcircuit #34 (node mask #34):
We see that some of the edge-variants subcircuits are
incomparable under inclusion (not a subcircuit of the other).
All edge variants score the same for node mask #34!
We might have been able to differentiate a best node mask for the full model, but we are not able to differentiate among edge-variants for the same subcircuit node mask!
It seems there is more work for us.
Counterfactuals got us further than observation or intervention alone. But they also revealed a new layer of non-identifiability: the edges. And the tools we've been using so far all operate in activation space. To go further, we may need a different paradigm entirely: parameter space interpretability.
Paradigms are often deeply ingrained in us and very hard to change.
Their failures often go silent. We forget they are not indefeasible.
Our account of causality uses particular frameworks, which we shall examine more closely.
Paradigm as Substrate
There are many frameworks for causality. Structured Causal Models (SCM) built on Directed Acyclic Graphs (DAG)[2] are the most common in circuit analysis, but they have blind spots[3] that matter for neural networks:
Each of these frameworks makes different assumptions about what remains stable during our analysis. Each is a different substrate for causal reasoning. Just as the evaluation domain was a substrate in our observational analysis, and the circuit boundary was a substrate in our interventional analysis, the causal framework itself is a substrate.
Let's recap what we've established:
Next time, we move from activation space to parameter space.
For instance, in The Intimacies of Four Continents, Lowe asks the reader to use counterfactuals to refuse the narrative that the way things went was the only way they could have gone, and to use imagination to reckon with the possibilities lost in history.
An Introduction to Causal Inference. Pearl (2010).
Also, counterfactuals are tricky to reason with because several inference patterns that seem perfectly logical can break down.
Look at: Counterfactual Fallacies in Causality. Stanford Encyclopedia of Philosophy (2026).
Factored Space Models. Garrabrant et al. (2024).
A fine-grained look at causal effects in causal spaces. Park et al. (2025b)
Counterfactual spaces. Park et al. (2026).
A Measure-Theoretic Axiomatisation of Causality. Park et al. (2025a).