Probe accuracy and causal sensitivity diverged 3.6x at the same layer. Here's what I think is happening.

Aditya Singh

Rejected for the following reason(s):

Potentially / Partially LLM content.
Insufficient Quality for AI Content.

Read full explanation

This is still exploratory and I am documenting a specific failure case where the two tools that should agree with each other but instead, they give contradictory readings at the same layer, and I think I have a candidate explanation, but haven't confirmed it yet. These two models differ on reconstruction quality by , and this quality gap is big enough that it could be the entire reason for the divergence, hence so blaming equivariance right now would be premature. I am posting before finishing because I want to know if the proposed mechanism holds up and whether the follow-up experiments are actually sound before I run it.

In a hybrid physics-ML testbed, interchange interventions and linear probes diverge by a factor of . The only explanation that I haven't declined yet is that a mechanism where a zero-parameter physics operator filters the specific frequency bands that causal interventions need to work on, while keeping enough structure for probes to still get the property and maybe which is why the two tools give opposite answers.

Interchange interventions and linear probes are often treated as interchangeable tools for determining where information is "live". But if standard MI tools give opposite readings in frequency-selective architectures and I can't say how general this is but finding it once is enough to make me uncomfortable running either tool in isolation on any architectures with a fixed operator in the middle.

The question is whether the tools disagreeing because the model's built-in symmetry rules actually reorganize the data? Or is it just an illusion caused by a poorly trained baseline model? If the rules are actually causing this split, then these different readings are caused due to the architecture itself. And if that's true, we can no longer safely use only one tool to know what is exactly going on.

The Testbed & The Confound

I observed this in a gravitational lens classifier that a hybrid physics-ML architecture where a D4-equivariant encoder feeds a zero-parameter Poisson solver, which feeds a standard EfficientNetV2-S classification head. Three models were tested that are D4LensPINN (equivariant encoder), VanillaLensPINN (non-equivariant encoder, otherwise identical), and ResNet-18 as a baseline. Task performance is nearly identical (AUC ~0.97).

The Confound: D4 and Vanilla differ on two dimensions simultaneously the equivariance constraint and alignment quality. Vanilla's aligned error is 13 worse (0.000563 vs 0.00735). Right now, I don't have a way to pull these two things apart.

The Puzzle

Here are the numbers that forced this question. At the same layer of the same model, I ran two standard MI measurements and received incompatible answers.

On D4LensPINN (equivariant model):

Interchange intervention logit delta: 0.402 [95% CI: 0.384, 0.419]
Linear probe D4 symmetry-element accuracy: 0.4850.027

On VanillaLensPINN (non-equivariant control, otherwise identical):

Interchange intervention logit delta: 1.429 [95% CI: 1.352, 1.506]
Linear probe D4 symmetry-element accuracy: 0.4790.023

The point estimates for causal sensitivity differ by 3.6, with non-overlapping bootstrap CIs (2,000 resamples; N=200). The probes differ by only .

Both tools are operating correctly, but they are resulting in opposite conclusions about what is actually equivariance is doing at that layer.

(This feels related to what Makelov et al. call the interpretability illusion, though the mechanism here seems different as we have frequency suppression rather than dormant parallel pathways. Tan's population statistics anomaly is in the same family too, though again a different route to the same breakdown.)

Observation 1: Chain property collapse

While debugging the divergence, while trying to figure out why the numbers were diverging, I noticed a structural precondition that is worth flagging.

For any deterministic function in eval mode,

Patching at satisfies

so downstream computation is identical to patching at . Basically, this whole section of the model acts like a single unit. If you try to intervene anywhere inside it, you get the exact same result as if you just intervened at the very beginning.

Because BatchNorm in eval mode uses fixed running statistics, the full classification head acts as a deterministic sequential block and every classification head hook (H12–H17) returned identical logit deltas to four decimal places ( for D4LensPINN, 5.157 for Vanilla, 5.543 for ResNet). It directly comes from eval-mode BatchNorm being deterministic.

Quick note before you run patching experiments as the subgraph collapse we're seeing isn't a measurement error but It's an architectural thing that is due to hook equivalence. You can verify it before running a single experiment. Nanda's patching guide is still the reference for metric/distribution pitfalls but it won't catch this.

Linear probing trains a fresh classifier on frozen activations at each hook without patching, bypassing this collapse. I use it as the primary head-level measurement below.

Observation 2: Probe-delta dissociation

At H09, the causal sensitivity ratio is 3.6 while the probe difference is (against a chance baseline of 0.143).

Protocol asymmetry: The D4 probe uses PCA-50 features; the Vanilla probe uses full spatial features. H08 PCA-50 explains only 58% of variance. I am not claiming the probe difference is strictly zero but instead that the observed is statistically small relative to the 3.6 causal sensitivity gap.

Interchange interventions finds whether what is stored matters for the output. Probes are different, they try to find whether what is stored is structured by the property of interest at all. Here, the property is perfectly decodable, but its causal footprint is practically zero which is exactly why I spent two weeks convinced that my code was bugged.

Hypothesis: The Spectral Mechanism

I think this is a frequency thing, specifically what the Poisson solver does between the encoder and the head.

The suppression from the Poisson solver:

We know analytically that this operator acts as a low-pass filter. At H08, the encoder puts the 59.5% of the orbit variance into mid-frequency bands. But after passing through the Poisson solver, more than 99.9% of that variance is low frequency. The mid-frequency content is completely removed.

The probe still finds the D4 identity because it passes in the low frequency. But the physics operator has already filtered out the exact variance, an intervention needs to change the output, which is why drops. Vanilla uses the exact same zero-parameter operator, but because its encoder gives less structured orbit distributions, that mid-frequency content passes the filter. As a result, causal sensitivity stays high and hence we get a 3.6 different readings.

Open Questions & Next Steps

I'm posting this to see if others have hit similar decoupling, and also because I want someone to give views on the experimental design if there are some flaws.

The Planned Control (Option B):

To actually isolate the equivariance effect, I need to run a -matched control. I'll also need to align the probe protocols, so they use identical dimensionality reduction. My working threshold for "matched" is a Vanilla baseline within 2 of D4's aligned error and I'd genuinely like pushback on whether that's tight enough, because I'm not confident it is.

If matched produces identical intervention deltas for D4 and Vanilla: The equivariance-as-cause story falsifies. The anomaly localizes entirely to reconstruction quality. The spectral suppression hypothesis survives, but it turns out to be driven by model convergence rather than the built-in geometry.
If matched preserves the probe-delta dissociation: The equivariance constraint is genuinely what reorganizes the variance across the frequency spectrum, then equivariance is actually doing something structural here.

If you've different opinion, or think the 2 matching threshold is too loose; I would like to know before I run it.

If standard tools fail this predictably in these architectures, how do we systematically figure out what's inside them without having to do a manual frequency decomposition at every single step? Maybe the answer is to just treat the gap between the tools as its own signal. But to be honest, I've hit a wall trying to figure out what a single metric for that would actually look like in the practice. Every formulation I tried required me to know the spectral properties of the fixed operator, which goes against the point.

Code and trained weights:

Repository: https://github.com/Aditya26189/Mechanistic-Interpretability-of-Hybrid-Physics-ML-Architectures

Trained weights: https://huggingface.co/coding103856/d4lens-weights

Dataset: https://drive.google.com/uc?id=1ZEyNMEO43u3qhJAwJeBZxFBEYc_pVYZQ