How Far Can Observation Take Us?

unruly abstractions

This is the second entry in the "Which Circuit is it?" series. We will explore possible notions of faithfulness when explanations are treated as black boxes. This project is done in collaboration with Groundless.

Last time, we introduced the non-identifiability of explanations. Succinctly, the problem is that many different (sub)circuits seem to equally explain the same (full-circuit) model. So, how do we choose among them?

This post is about what happens when we try to distinguish subcircuits using only their inputs and outputs. We call this observational faithfulness. The short version: it's not enough, even when we push well beyond the training distribution.

The full model is our target explanation, and the subcircuits are our candidate proxy explanations.

A natural starting point is to ask what "equally explain" actually means here.
At a minimum, we would like our subcircuits to accurately predict the full model's behavior. If we treat the subcircuit and the full model as black boxes, would they agree on the outputs across the entire input range? This is what we refer to as observational faithfulness.

More precisely, a subcircuit is observationally faithful to the full model over a domain D if, for every input in D, they produce the same output. The question is whether this is strong enough to pick the right explanation.

Let's get our hands dirty and see what this looks like in a toy model.

Our Toy Model Setup

Let's make this concrete with a toy model we'll keep returning to throughout this series.

The core idea

We train various small multilayer perceptrons (MLPs) to compute Boolean logic gates (XOR, AND, OR, IMP). We know precisely what computation each output should perform. Access to the ground-truth allows us to directly answer the question: Can we actually recover the right subcircuit for each gate?

Architecture

A simple and shallow MLP with configurable width and depth:

Input: 2 binary values, sometimes with added Gaussian noise.
Hidden layers: 2 layers of 2-4 neurons each. We experiment with activation functions like Tanh, ReLU, and LeakyReLU.
Output: One logit per logic gate.

Training Details

The MLP uses Kaiming Normal initialization, matched to the activation function.
The model uses Binary Cross-Entropy with Logits (BCE) applied independently to each output head.

What makes this interesting?

The setup is small enough to exhaustively enumerate all possible subcircuits, each of which specifies which neurons to keep and which edges to prune. For a 3-wide, 2-deep network, that's a tractable search space nowadays. We can check every candidate against the ground truth rather than relying on a heuristic to find the right one.

Why does this matter?

It's a sanity check for interpretability methods. If we can't reliably identify the circuit implementing XOR in a tiny MLP where we literally defined the answer, we should be worried about circuit discovery claims in frontier models. This is the easiest possible setting. Full enumeration, known ground truth, tractable search space. If a technique fails even once here, it's already broken. It doesn't need to fail on a frontier model to be untrustworthy.

Even a single one toy example can reveal real limitations.

We start by training a single network to learn XOR.

Starting to Experiment

Our shallow MLP has two hidden layers, each with two neurons, using the tanh activation function.

First, we check that our network indeed learned the Boolean gate. There are only four ideal inputs, where each input neuron is either zero or one.

Second, we check whether any subcircuit achieves perfect accuracy when isolated.
We organize the search by node masks/pattern: which neurons are kept. Each node mask can have multiple edge masks that specify which connections between the kept neurons are active. Together, a node mask and an edge mask define a single subcircuit.

We found a few node masks with 'perfect accuracy':

Many of them are incomparable under inclusion (not a subcircuit of the other):

Additionally, each node mask also has a few edge mask variations that also achieve perfect accuracy:

Could we use complexity as a tiebreaker?
Keep the subcircuit with the fewest edges that still achieves perfect accuracy?

We could do that per-node mask to some extent. But it is not fully straightforward because many 'perfect accuracy' subcircuit variants exist even for a single node mask.

These two have the same 'perfect' accuracy and the same number of edges, but they do not have the same graph:

Across node masks with different topologies, there is less of an obvious way to do so.

This replicates the finding by Meloux et al.

So, what now?
We have multiple subcircuits that all produce the correct answer for every canonical input. Before reaching for more complex tools, it's worth asking: are we losing information in how we compare them? Is the comparison itself too coarse? Bit-level agreement might be hiding meaningful differences.

Note on Edge Variants

To simplify our initial analysis, we will identify subcircuits by their node masks and, among the many possible edge variants for each mask, consider only the most complete one. Once we identify a node mask that clearly outperforms the others, we will then examine its edge variants in detail.

The Lossiness of Comparison

So far, we've been checking bit-level agreement: does the subcircuit output the same 0 or 1 as the full model? But the boolean output is a sigmoid over the logits. The sigmoid hides the fact that two subcircuits can agree with the full model at the bit level while differing substantially in their logit values.

The decision boundary for our network is at zero. Any negative logit value will be mapped to a logic gate output of 0, and any positive logit value will be mapped to a logic gate output of 1. The theoretical ideal decision boundary looks like:

The plot shows a step function at zero: negative logits map to 0, positive logits map to 1, with a sharp vertical transition at the decision boundary.

Should we require subcircuits to match the logits?
An exact logit match might be too much to ask. The full model does not perfectly match the ideal, but it does get close:

We will call this **Scenario A**
to differentiate it from other XOR models we train

Logit proximity, though, could give us a meaningful way to compare candidates.
But comparing logits is meaningful only when we evaluate beyond the four canonical inputs.

Before we continue, it's worth seeing how this type of analysis applies beyond toy models.

Interlude: Greedy Decoding in LLMs

We often characterize LLMs' behavior based on their greedily decoded generation.
An LLM doesn't directly output text. It outputs a probability distribution over the entire token vocabulary at each step.

Paris' is the most probable next token (top-1 token).
Greedy decoding selects the argmax at each step.

Greedy decoding picks the highest-probability token at each step. The resulting sequence is what we often use to characterize the LLM's behavior. But the model gave us a full distribution at every step, and we reduced each one to a single token. Two models can agree on the argmax at every position while having completely different distributions underneath.

reedy decoding compresses a model's full distribution into a single token.
That's a lossy comparison.

This mirrors what we saw in the toy model. Bit-level accuracy hid logit differences behind a sigmoid. Greedy decoding hides distributional differences behind an argmax.

The takeaway: comparison requires compression, which is often lossy.

We don't need a perfect comparison. We just need one that preserves the differences we care about, the ones our interpretive claim depends on, so we can justify discarding the rest.

Robustness to Noise

If we're going to expand beyond the canonical inputs, the natural first step is their immediate neighborhood. We also want circuits whose interpretations are stable under small perturbations. A circuit explanation that breaks under slight noise is not much of an explanation.

We see that under noise, all subcircuits start disagreeing with the full model:

However, this is not very helpful: our top subcircuits are not meaningfully differentiated. Let's zoom in on the results for those two subcircuits:

We can see that the logit values differ between subcircuits and the full model, but they generally track the shape. The top figure for node mask #34 does not appear to track the full model as closely for the (0,0) input as #44 does, but when we crunch the bit-agreement numbers, #34 beats #44.

We need another comparison, something that breaks this tie and reveals which subcircuit actually tracks the full model more closely.

Agreeing Locally, Diverging Globally

Robustness considers the immediate neighborhood around the canonical inputs, but that's not enough. We want to push further: how far beyond the training data should a subcircuit agree with the full model? A faithful subcircuit should extrapolate well beyond the canonical inputs, not just survive small perturbations in their vicinity. The larger the region of agreement, the stronger the evidence that the subcircuit is actually capturing the same computation rather than merely memorizing the same outputs.

Interlude: Smaller Domains ➡️ Weaker Constraints

Let's say we would like to fit a function for the curve within the shaded region.

For that region, we could consider as a candidate, it has 'perfect accuracy'.

However, when we zoom out to include more data, we would find out that our original candidate is not so great anymore.

We could find another 'perfect candidate' again.

But the possibility that we will find ourselves in the same situation remains.

This is why observational faithfulness alone has limits. No matter how far we expand the domain, there's always a larger one. To break out of this cycle, we need to look inside the circuit, not just at its outputs.

Out-of-Distribution Generalization

In our toy example, we can consider different transformations that would push our input out of distribution while preserving some relational structure among the input neurons.

For instance, we could have a multiplicative transformation.

IDEAL/CANONICAL:
	(0,1) 
TRANSFORMED:
	(0, 10)

Recalling our ideal decision boundaries,

The theoretical decision boundary is our generator

The ideal decision boundary serves as the generator of our task: the hidden structure that produces ground-truth labels. The model never has full direct access to it. It must infer the generator from data, and how much of the generator it recovers depends on what data it sees.

We can establish that the ground-truth output value for the transformed input should be:

(0, 10) -> 1

This is what it looks like for those top two subcircuits:

We cannot definitively differentiate the two subcircuits based on observations, even when evaluating on out-of-distribution inputs. It seems that just looking at inputs and outputs is not enough.

In the next post in the sequence, we will break open the black box and see whether interventions could help us compare subcircuit explanations more effectively.

Locality as Substrate

We've been widening the domain over which we test observational faithfulness: canonical inputs, noisy neighborhoods, out-of-distribution transformations. A domain specification is a context. A substrate is the layer of abstraction you hold fixed while doing your analysis. We typically worry about substrates like architecture changing. The relationship between the training and evaluation domain is also a substrate, and a fragile one. The evaluation domain keeps expanding faster than models are retrained. We expect new scenarios to behave like familiar ones. We do not expect that behavior to turn chaotic. But outside the region where training constrained the model, it can.

Changing how the full model was trained yields different conclusions from the same observational tests. This is substrate-sensitivity in miniature: same analysis, different ground, different answers.

When performing experiments, we introduced noise during training to obtain the best models possible. Training noise constrains the full model's approximation of the task. It does not constrain how subcircuits approximate the full model; these are separate relationships. Subcircuits lack compensatory pathways available to the full model, so correlated degradation under perturbation is not guaranteed. Furthermore, regularization can reshape which circuits the model relies on. The subcircuits under one training regime may not correspond to the same computation under another.

If we had just used the ideal input to train, our full model would have had very different decision boundaries:

Interestingly enough, if we redo our analysis in this scenario, looking at noise perturbations and out-of-distribution generalization, we get more differentiation among subcircuits for particular observational scores:

But we still do not get definite answers!
For Scenario B, subcircuit #44 is more robust to noise than subcircuit #47.
But subcircuit #47 generalizes better than subcircuit #44.

The additional differentiation for specific scores does not translate to more interpretability. Rankings are inconsistent across tests, and some subcircuits even outperform the full model under noise.

We sometimes also see unexpected situations, where some other subcircuits perform better than the full model, given some noise:

This subcircuit outperforms the full model for this input and noise.

These artifacts and effects make careful domain specification all the more important.
The conclusions we draw depend on which region of input space we evaluate over, and the inconsistencies offer no principled way to adjudicate.

Between Scenario A and Scenario B, the only difference is the presence or absence of noise in the training data. In our toy scenario, the noise is not just acting as a regularizer. The noise is revealing more of the ideal decision boundary to the MLP. The more noise, the more of the ideal decision boundary the model is exposed to, and the more it is forced to approximate the boundary's shape rather than just reproduce correct outputs at the corners. Without noise, the boundary away from those four points is unconstrained, and subcircuit behavior in those regions reflects the optimizer's trajectory rather than the structure of the task.

This single difference in training was sufficient to shift the full model's decision boundaries, thereby altering how subcircuits relate to it and, in turn, our observational conclusions. The shift does not merely reduce the informativeness of our tests. It introduces artifacts: subcircuits appear to outperform the full model under certain noise conditions, rankings between subcircuits reverse depending on which test is applied, and observational differentiation increases without corresponding to meaningful differences in correctness. Our toy experiment seems to suggest:

The reliability of observational analyses for subcircuits hinges on how closely the full model tracks the generator.

We take this as suggestive, not conclusive: it is a pattern in a toy setting with full enumeration and known ground truth. We need more research to understand:

Under what conditions does full-model fidelity to the ground truth constrain the distinguishability of its subcircuits?
What is the relationship between full-model fidelity and the smallest domain neighborhood needed for observational tests to be informative?

In this toy example, the Boolean gate we study is fixed. But in real models, behaviors are more nuanced, and the generator they interact with is dynamic. Interpretations that are valid in one context may no longer be valid in another.
There is more risk that the substrate will shift beneath.

In most real models, the alignment between the full model and generator is much less perfect than in our toy models, so we must be especially careful about how local analyses extrapolate.

Unreliability of Extrapolation

To make this point explicit, let's see how the subcircuits' decision boundaries look in each scenario. We consider the top 2 subcircuits in each case.

Scenario A: Full model trained with Gaussian noise (σ = 0.2)

This is our original scenario:

In this case, the full model tracks the generator more closely, so each subcircuit also tracks the generator fairly closely. If we were to use either of these subcircuits, we could extrapolate somewhat reliably,

Scenario B: Full model trained without noise (σ = 0.001):

In this case, the full model does not track the generator closely outside the ideal inputs, so each subcircuit shows very different behavior. Neither subcircuit yields a reliable extrapolation.

If we were just looking at the ideal inputs, these differences would be invisible to us.

Even if we looked at a neighborhood around the ideal inputs, we might miss how wildly different these proxy explanations (subcircuits) are.

Let's recap what we've established:

Observational faithfulness is necessary but not sufficient to identify the correct subcircuit.
Expanding the test domain (noise, OOD inputs) helps, but cannot resolve the ambiguity in principle.
The informativeness of any observational test is bounded by how faithfully the full model itself tracks the generator.
To break the tie, we need to look inside the circuit, not just at its behavior.

Next time, we open the black box and ask whether interventions can succeed where observation could not.

^{^}
Our loss function differs from the one used by Meloux et al. There are both motivation and technical reasons:
Motivation: They care about both circuit and algorithmic interpretations. We are only concerned with circuit interpretations at the moment
Technical consequence: Their loss (MSE) places the decision boundary at 0.5 and encourages hidden activations to stay in the 0 to 1 range. Ours (BCE) places the decision boundary at zero and pushes logits toward ±∞ in high-confidence regions.
^{^}
We will go in depth into graph structure in a future post.
^{^}
MoSSAIC (Farr et al., 2025) also elaborates on how we often wrongly assume reliability of extrapolation. Misalignment could sneak upon us if we do not treat the domain of our explanations carefully.

12