Comparing Across Possible Worlds

unruly abstractions

This is the fourth entry in the "Which Circuit is it?" series. We will explore the notions of counterfactual faithfulness. This project is done in collaboration with Groundless.

Last time, we opened the black box and saw how interventions can help us distinguish which subcircuit is a better explanation. This time, we will look at counterfactuals and see if we get more definite answers.

When we first considered interventions, we used them to measure the alignment between the target and proxy explanations. We leveraged the fact that the proxy (subcircuit) is a subgraph of the target (full model) to align interventions. This time we will exploit that fact again, but in a different way.

We will treat the entire subcircuit as a single component and the remainder of the full model as the environment. Then, we will do a sort of causal identification, akin to activation patching. We will try to determine if the subcircuit is the sole cause of the behavior of interest. We will do so by asking:

If everything in the environment were different except for a single component, would the behavior of interest remain? (recovery)
If the environment were the same but the single component were different, would the behavior of interest disappear? (disruption)

We will see that this type of analysis allows us to differentiate more clearly the top subcircuits and to reflect more deeply on explanations.

Counterfactuals

Counterfactuals ask: what would have happened if things had gone differently?
They require us to reason about worlds we did not observe. Sometimes, this requires imagination^[1]. Sometimes, even unruly vision.

Let's start with the real world, what we call our clean sample:

Screenshot 2026-03-23 at 11.14.47 PM.png

Then, we consider a possible world, which we call our corrupted sample:

Screenshot 2026-03-23 at 11.16.45 PM.png

In each world, we isolate the single component from the environment:

We will surgically transplant components from one world into the other.

WORLD = COMPONENT + ENVIRONMENT

Screenshot 2026-03-24 at 12.27.27 AM.png

Our analysis has two versions, depending on the focus of our causal question:

In-Circuit: We will ask about the causal effect of the component

Screenshot 2026-03-24 at 12.39.32 AM.png

Out-of-Circuit: We will ask about the causal effect of the environment

Screenshot 2026-03-24 at 12.39.44 AM.png

We also have two different directions:

Denoising: We will patch the clean component into the corrupted environment.

Screenshot 2026-03-24 at 12.32.58 AM.png

Noising: We will patch the corrupted component into the clean environment.

Screenshot 2026-03-24 at 12.33.05 AM.png

In-Circuit

Let's ask questions about the component's causal effect.

Sufficiency

If everything in the environment were different except for a single component, would the behavior of interest remain?

Screenshot 2026-03-23 at 11.35.58 PM.png

We can think of this surgical modification as denoising the corrupted sample with the clean component and verifying if we can recover the clean sample signal.

Necessity

If the environment were the same but the single component were different, would the behavior of interest disappear?

Screenshot 2026-03-23 at 11.37.09 PM.png

Out-Of-Circuit

Let's ask questions about the causal effect of the environment.

Completeness

If everything in the environment were different except for a single component, would the behavior of interest disappear?

Screenshot 2026-03-23 at 11.54.27 PM.png

Independence

If the environment were the same but the single component were different, would the behavior of interest remain?

Screenshot 2026-03-23 at 11.54.48 PM.png

Four Perspectives

We measure two things:
Recovery: Does the patched circuit produce the clean output?
Disruption: Does the patched circuit fail to produce the clean output?
The four scores are combinations of these, depending on whether we patch in-circuit or out-of-circuit, and whether we denoise or noise.

We get four different scores that characterize the causal effect of the component:

Let's calculate these scores for our toy experiment.

Experiments

As a reminder, these are the subcircuits we were analyzing last time.

We calculate the scores on the four ideal inputs.

Sufficiency

Necessity

Sufficiency and necessity give us an isolated view of the component.

Completeness

Independence

Clear winner?

Subcircuit #34 scores the highest.

Counterfactual faithfulness has helped us sort out the top subcircuits!

It is important to note that the main differentiator was the causal effect of the environment.

But we are not done yet.

In our second entry, we deferred a question:

To simplify our initial analysis, we will identify subcircuits by their node masks and, among the many possible edge variants for each mask, consider only the most complete one. Once we identify a node mask that clearly outperforms the others, we will then examine its edge variants in detail.

Let's look at the top edge-variants for subcircuit #34 (node mask #34):

We might have been able to differentiate a best node mask for the full model, but we are not able to differentiate among edge-variants for the same subcircuit node mask!

It seems there is more work for us.

Counterfactuals got us further than observation or intervention alone. But they also revealed a new layer of non-identifiability: the edges. And the tools we've been using so far all operate in activation space. To go further, we may need a different paradigm entirely: parameter space interpretability.

Paradigms are often deeply ingrained in us and very hard to change.
Their failures often go silent. We forget they are not indefeasible.
Our account of causality uses particular frameworks, which we shall examine more closely.

Paradigm as Substrate

There are many frameworks for causality. Structured Causal Models (SCM) built on Directed Acyclic Graphs (DAG)^[2] are the most common in circuit analysis, but they have blind spots^[3] that matter for neural networks:

Determinism breaks d-separation. Every neuron is a deterministic function of the layer below, so no faithful DAG exists over network activations. Factored Space Models ^[4] handle this by defining structural independence on product spaces.
SCMs require you to pick variables first, but in LLMs, the meaningful causal units are unknown and distributed. Causal spaces^[5] let you define causal effects on events and subspaces without committing to a variable decomposition.
Interchange interventions compare two computation traces, not one. Pearl's operates on a single world. Counterfactual spaces^[6] formalize multi-world comparisons directly.
Cycles, continuous-time dynamics, and latent accumulation in residual streams don't fit the acyclic, discrete structure assumed by SCMs. Causal spaces^[7] handle these natively.

Each of these frameworks makes different assumptions about what remains stable during our analysis. Each is a different substrate for causal reasoning. Just as the evaluation domain was a substrate in our observational analysis, and the circuit boundary was a substrate in our interventional analysis, the causal framework itself is a substrate.

Our counterfactual conclusions are sensitive to the assumptions of the causal framework we select.

Let's recap what we've established:

Counterfactual faithfulness provides stronger discriminative power than observation or intervention alone.
The main differentiator was the causal effect of the environment (completeness and independence), not the component in isolation (sufficiency and necessity).
Subcircuit #34 is the clear winner at the node-mask level.
But edge variants within the same node mask remain indistinguishable. Activation-space methods could have a ceiling here.
The causal framework itself is a substrate.

Next time, we move from activation space to parameter space.

^{^}
For instance, in The Intimacies of Four Continents, Lowe asks the reader to use counterfactuals to refuse the narrative that the way things went was the only way they could have gone, and to use imagination to reckon with the possibilities lost in history.
^{^}
An Introduction to Causal Inference. Pearl (2010).
^{^}
Also, counterfactuals are tricky to reason with because several inference patterns that seem perfectly logical can break down.
Look at: Counterfactual Fallacies in Causality. Stanford Encyclopedia of Philosophy (2026).
^{^}
Factored Space Models. Garrabrant et al. (2024).
^{^}
A fine-grained look at causal effects in causal spaces. Park et al. (2025b)
^{^}
Counterfactual spaces. Park et al. (2026).
^{^}
A Measure-Theoretic Axiomatisation of Causality. Park et al. (2025a).

Editing remark: Footnotes 7–9 are not referenced in the post.

I am especially curious to know the use you make of footnote 8.

Ooh, thanks, they were vestigial.

I was going to reference The Queer Algorithm to show how counterfactuals, as a concept, appear elsewhere.

Chevillon discusses how AI implicitly needs to form counterfactuals when dealing with missing data. For instance, imagine that the training corpus has no representation of gay men. When generating outputs, the AI behaves as if the training data did contain those representations. Chevillon further argues that the way AI currently performs these implicit counterfactuals relies on interpolating dominant patterns in the data, which is insufficient for minoritized communities.

When generating outputs, the AI behaves as if the training data did contain those representations.

Did you mean "did not contain"? The AI can't talk about things not in or implied by the training data, unless one leads it on by the questions one poses. It does not know about missing data, because it has nothing but the training data (by which term I'm including the RLHF phase and everything else that happens before the LLM is released). For example, none of the LLMs know anything about the European shadwell, because it's a nonexistent bird I just made up. Depending on the LLM, it may just make stuff up, or do something more sensible, as here:

Me: Tell me about the migratory habits of the European shadwell.

ChatGPT: I’m checking whether “European shadwell” is the standard name of a species or a different term, then I’ll give you the migration details from reliable sources.

[Consults various web sites]

There doesn’t seem to be any European animal called the “European shadwell.” You may mean European shad — usually the allis shad (Alosa alosa) and twaite shad (Alosa fallax).

[Proceeds to tell me about those, which turn out to be fish in the herring family. The information checks out on Wikipedia.]

Of course, ChatGPT and all the others do have representations of gay men, so what work is "gay men" specifically doing in your (or Chevillon's) counterfactual? I chose "European shadwell" deliberately to be absent, to obviate counterfactual speculation. All an LLM can do is search for it on the web, and coming up empty make a guess at what I might have meant.

Chevillon further argues that the way AI currently performs these implicit counterfactuals relies on interpolating dominant patterns in the data, which is insufficient for minoritized communities.

It works with what it has. It cannot step outside the cave of its training into the light of truth. Getting it to do that didn't work out well for the image generator of a couple of years back that put black people into images of the American founding fathers. The designers were just leading it out of one cave into another more to their liking (something which, incidentally, should be suspected of anyone touting red pills).

I am not seeing a problem here. Chevillon's writings and everything else in "queer studies" are likely already in the training data, or if not yet, they will be.

"gay men" was just my example to illustrate (Recall I said "imagine that...").

There will always be things in "reality" that are not in the training data.
Of course, more and more people are getting represented in the corpus of data.
But, as theorists like Spivak point out, there will always be people left out or misrepresented.
And as the latest research shows, even with perfect fidelity in the training data, GenAI suffers from mode collapse, leading to undesired homogenization.

I like to foreground the impact of this all has in the people in the margins.
But this is a more general problem, one that has been identified as the main issue affecting the reliability of AI.

The point I am trying to make is that AI deals with these implicit counterfactuals in one way or another.
As you pointed out, we do not want our AI to hallucinate, but we do want it to extrapolate and adapt outside its training if possible.
Resolving this tension is not trivial.