Cross-Graph Cosine Collapsed to 0.019: A Null Result from an Attribution-Graph Confound

Pano Pouroullis

Rejected for the following reason(s):

This is an automated rejection.
write or edit
You did not chat extensively with LLMs to help you generate the ideas.
Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.

Read full explanation

0.921 was the graph, not the country.

1. The Signal

An attribution graph is a prompt-local snapshot of one computation, not a readout of the model as a whole. It records, for one forward pass, which sparse-dictionary features (here, GemmaScope transcoder latents) the attribution method credits with contributing to particular predicted tokens. That matters here because two city tokens predicted inside the same graph share the prompt, residual stream, intermediate features, and graph geometry before they share any semantic category. The experiment below tests whether the apparent country signal survives when that shared substrate is removed.

In March 2026, we used circuit-tracer attribution graphs on Gemma 2 2B to measure cosine similarity between transcoder feature contribution vectors for city-name tokens within the same prompt graph. The decomposition lens was GemmaScope transcoder-16k with a 4096-feature cap. The feature vector for a given candidate next-token was the attribution row from the adjacency matrix - the contribution weights from the selected transcoder features to that token's logit position.

Sydney and Canberra (both Australian cities), predicted from the same prompt graph: 0.921 cosine. Islamabad and Lahore (both Pakistani cities): 0.90. Six same-country city pairs in total, mean 0.921, standard deviation 0.022, range 0.90 to 0.96. Against a cross-country cross-graph baseline of 0.029 (N=28 pairs).

One prompt-local attribution graph: fixed model weights → one prompt prefix → one forward pass → GemmaScope transcoder lens → candidate logits (Canberra, Sydney, Melbourne) → one attribution graph snapshot showing cosine 0.921 between Canberra and Sydney attribution rows within the same graph. Not validated model-wide knowledge.

Figure 0. The measurement object. One prompt prefix runs through Gemma 2 2B once. The transcoder lens decomposes the forward pass into interpretable features. The attribution graph captures which features contributed to each candidate token's logit. Within this single graph, the cosine between Canberra and Sydney's attribution rows was 0.921.

The gap looked decisive. Three same-family Claude review configurations (Mirror, Engine, Shield archetypes) reviewed the data and converged on "non-trivial semantic domain organisation." The within-graph signal had survived earlier checks for rarity, frequency, and feature alignment across multiple review rounds. At the end of a session that killed four other claims, this one remained standing. The belief was sincere: the model appeared to organise factual knowledge by country in its transcoder feature geometry.

That belief was wrong.

2. The Confound

The same day, we submitted the gate results to GPT-5.4 xhigh for cross-model review. It found five items that the three Claude agents had missed. All five were marked [NEW] - meaning none had been raised by any of the Claude reviewers.

GPT-5.4 identified the structural confound:

"The central comparison is confounded with graph identity. Experiment 1 is entirely within-graph; Experiment 2 is entirely cross-graph. Because there is only one graph per country, the current design cannot separate 'same country' from 'same prompt / same graph.'"

This is the confound we designated D7. Two logit tokens predicted from the same attribution graph share the same input tokens, the same residual stream at every position, every intermediate feature that fired during that forward pass, and the graph's overall geometry. The 0.921 cosine could mean "Sydney and Canberra share country-encoding features." It could equally mean "two predictions from the same computation share most of their upstream substrate." Within-graph cosine cannot distinguish these two stories.

Three Claude agents had not flagged this. GPT-5.4, reading the same experimental structure from outside the model family, named it directly. The documented fact is narrower than a theory of model families. In this case, three same-family Claude reviews missed the graph-identity confound, while a later GPT-5.4 review named it directly. One possible explanation is model-family distance; another is simply fresh-reader distance. We should not pretend this single case distinguishes those mechanisms. What it does show is operational: these same-family review rounds were not enough in this case, and an external review pass changed this experiment.

3. The Decisive Test

On 18 March 2026, before running any new experiment, we committed a decision table to version control:

Mean cross-graph same-country cosine	Interpretation	Action
> 0.5	Domain clustering confirmed as semantic	Update thesis, publish
0.1 - 0.5	Partial semantic signal	Quantify decomposition
< 0.1	Clustering is graph-local	Kill the finding

The kill criterion was explicit: if the mean cross-graph same-country cosine fell below 0.1, the finding dies.

The variant experiment generated 14 attribution graphs on EC2 (g4dn.xlarge) using different prompt templates. The same-city cross-graph analysis used graphs from four pilot countries: Australia, Pakistan, New Zealand, Canada. Each country received multiple variant prompts so that the same city would appear in graphs built from different input contexts. The implemented cross-graph comparison grouped matching city tokens across prompt variants: does the attribution-row cosine for the same city survive when the shared graph substrate is stripped away?

On 8 April 2026, the result came back.

Within-graph same-country (baseline): mean 0.921, N=6 pairs, range 0.90 to 0.96
Cross-graph same-city within-country (the test): mean 0.019, N=13 pairs, range -0.097 to 0.139
Cross-country cross-graph (control): mean 0.029, N=28 pairs, range -0.04 to 0.16

0.019 against a control of 0.029. The highest individual same-city cross-graph pair reached 0.139 - still within the cross-country baseline range. No pair exceeded 0.16.

The kill criterion was a mean below 0.1. The mean was 0.019. The finding was dead.

4. The Result

Three numbers summarise the main result. Within a single graph, same-country city pairs scored 0.921. Across prompt-variant graphs, matching same-city tokens scored 0.019. Across different graphs, different-country pairs scored 0.029.

Three-lane diagram showing within-graph same-country cosine 0.921 (Sydney and Canberra in the same graph), cross-graph same-city cosine 0.019 (Canberra in Graph A vs Canberra in Graph B), and cross-country cross-graph baseline 0.029 (Canberra vs Ottawa).

Figure 1. The correction. The original within-graph comparison produced a large cosine between same-country cities in one graph. When the same city was compared across prompt graphs, the mean fell to 0.019, indistinguishable from the 0.029 cross-country cross-graph baseline. Read Figure 0 and Figure 1 together: the first shows why the finding looked compelling, the second shows why it did not survive the decisive test.

The pair-level comparison did not distinguish the two cross-graph distributions (exact Mann-Whitney U=183, two-sided p=0.989; permutation test on mean difference p=0.562; bootstrap 95% CI for the difference [-0.047, 0.026]). Collapsing to city means and then to four pilot-country means led to the same conclusion. 0.019 and 0.029 are statistically indistinguishable under these checks.

The apparent country signal was scenery, not character. Inside one graph, Sydney and Canberra looked alike because they were both lit by the same stage lights - the same input tokens, the same residual stream, the same upstream features. Move them to different stages and the resemblance evaporated. The 0.921 was a real property of the attribution graph's structure. It was not a measurement of the model's country knowledge.

Before revealing the result to an external model, we asked GPT-5.5 Pro to predict the cross-graph same-country cosine given the experimental design. Its central estimate: approximately 0.35 to 0.40. The actual result was 0.019.

One caveat: the prompt given to GPT-5.5 Pro described the metric as "activation values at active features." The actual implementation computes attribution rows from the adjacency matrix - logit-subgraph attribution-row cosine, not activation values. Because the prompt described a different metric, this comparison should be treated as informal context rather than calibrated evidence. We present it because the direction is informative - the prediction and the result diverged substantially - but the metric mismatch means the gap itself is not a precise measurement.

GPT-5.5 Pro, after seeing the result, offered the cleanest single-sentence interpretation in the corpus:

"The 0.019 cross-graph null should be read not as evidence that Gemma lacks country knowledge, but as a construct-validity failure of a prompt-local circuit-tracer cosine proxy."

That distinction is load-bearing. The null does not tell us what the model knows. It tells us that this measurement instrument - cosine similarity over logit-subgraph attribution rows, with a 4096-feature cap, using GemmaScope transcoder-16k on Gemma 2 2B - does not recover country identity across different prompt contexts. Country knowledge may well be encoded in Gemma 2 2B's representations. This experiment shows that the cosine of attribution-row feature vectors, across different graphs, does not find it. This is consistent with prior work: Li et al. found that country-capital structure in Gemma 2 2B can be hidden by distractor features such as word length [5].

5. What the Null Means

The null licenses specific claims. It does not license broad ones.

What it licenses:

This attribution-row cosine metric did not recover country identity across prompt variation in this setup (Gemma 2 2B, GemmaScope transcoder-16k, logit-subgraph attribution-row cosine, 4096-feature cap, four countries).
The within-graph 0.921 is best explained by shared graph substrate rather than token-level country features. The variant experiment strongly supports the D7 confound explanation.
Single-graph sparse-dictionary cosine analysis is susceptible to within-graph substrate confounds that collapse under cross-graph testing.

What it does not license:

"Gemma 2 2B has no country knowledge." One decomposition, one metric, four countries. The measurement does not reach the model's knowledge.
"SAE features cannot encode semantics." Too broad. This was one comparison in one cosine space.
"The within-graph clustering was meaningless." It was structural - a real property of how attribution graphs are built. That is informative about the limits of single-graph analysis, not a measurement error.
Any causal claim about country knowledge or the functional role of these features. No ablation or activation patching was performed. The variant experiment is correlational.

6. How the Null Was Found

The confound was not caught by the original review. Three Claude agents (Opus instances in Mirror, Engine, Shield configurations) reviewed the experimental design across multiple rounds without flagging graph identity as the confound. The same data was submitted to GPT-5.4 xhigh for cross-model review. It returned five items the Claude reviews had missed, including control group mislabelling and collinearity between the primary discriminator and the output probability gap (R2=0.88). GPT-5.4 named D7 - the graph-identity confound.

Whether the external review caught D7 because of model-family distance or because any fresh reader brings useful perspective, this case cannot distinguish. What it does show is that the confound was caught, the experiment was redesigned, and the result changed. In this case, the same-family review rounds had not displaced the 0.921 finding before external review.

The variant experiment was designed to resolve D7. The original claim did not survive the variant graph test. That sequence is what this post documents: review names a confound, experiment tests it, and the original claim does not survive. The claim is specific to this pipeline and this result.

A concrete recommendation: Before reporting cross-entity feature geometry clustering in attribution graphs, run the variant graph test. Generate the same entity under multiple prompts. Compare your target metric across the resulting graphs. If the metric does not survive graph variation, you have a within-graph substrate confound.

7. What Remains Open

This experiment tested one question in one decomposition. Here is what it does not close.

The phonetic-pair gap. The experiment found no aggregate cross-graph country signal under this metric. But the cross-country control (28 pairs, mean 0.029, range -0.04 to 0.16) is aggregate protection. A targeted test - Canberra against a phonetically similar non-Australian city of comparable token frequency - was not run. The aggregate distribution at noise makes a strong systematic alternative organising variable unlikely, but specific phonetic pairs remain untested.

Causal validation. The variant experiment is correlational. Ablating the features implicated in the within-graph 0.921 and measuring whether predictions degrade would distinguish "structural but incidental" from "structural and functional." This experiment was not performed.

Cross-architecture comparison. The null is bounded to Gemma 2 2B with GemmaScope transcoder-16k via circuit-tracer. A different model, a different SAE, or a different attribution method could produce a different result. Within-graph substrate sharing is a general feature of single-graph analysis; whether it is confounding depends on the comparison. The magnitude of the resulting artefact may vary across architectures and methods.

Provenance caveat. The 14 variant graph files were generated on EC2. PROVENANCE_ec2.json records the script SHA256 and generation timestamps, but git_sha is recorded as "unknown" for all 14 files. The exact repository state at generation is not fully traceable.

Model and decomposition: Gemma 2 2B [1], GemmaScope transcoder-16k [2], circuit-tracer v0.3.1 [3], 4096-feature cap.

Countries tested: Australia, Pakistan, New Zealand, Canada.

D7 confound named: 18 March 2026. Variant graphs generated: 18 March 2026. Cross-graph analysis result: 8 April 2026.

Decision table committed: 18 March 2026, 21 days before the analysis.

Earlier related post: [4] described the cross-model review pattern that the later D7 review extended. It also reported the within-graph cosine clustering as the surviving working hypothesis; that claim is the one the present experiment falsifies.

References

[1] Gemma Team, Google DeepMind. "Gemma 2: Improving Open Language Models at a Practical Size." arXiv:2408.00118, 2024.

[2] Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda. "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2." arXiv:2408.05147, 2024.

[3] Michael Hanna (University of Amsterdam), Mateusz Piotrowski, Jack Lindsey, Emmanuel Ameisen (Anthropic). "circuit-tracer: A New Library for Finding Feature Circuits." BlackboxNLP 2025. Software: circuit-tracer v0.3.1, https://github.com/safety-research/circuit-tracer

[4] Pouroullis, P. "The Correction Came From Outside: What a geography error taught me about AI verification." LinkedIn long-form article, 17 March 2026. https://www.linkedin.com/pulse/correction-came-from-outside-pano-pouroullis-69eke

[5] Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, Max Tegmark. "The Geometry of Concepts: Sparse Autoencoder Feature Structure." Entropy 27, 344, 2025. arXiv:2410.19750.