Applying Network Motif Analysis to Transformer Attribution Graphs

mkenney

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

TL;DR: Analyzing attribution graphs manually is tedious and doesn't scale. To help automate some of this process, I ported network motif analysis (the technique Uri Alon used to decode gene regulatory networks) into a tool for automatically fingerprinting the structural properties of transformer circuits. I ran it on 99 attribution graphs from Neuronpedia. As expected, the graphs are dominated by feedforward loops (FFLs), confirming that the tool recovers structure that Anthropic had previously observed qualitatively in individual circuits. The more interesting findings: (1) when I trace individual FFLs in a concrete circuit, they chain into multi-stage processing cascades with structurally distinct roles: grounding, entity resolution and output competition via inhibitory edges; (2) motif profiles reveal a universal structural backbone across all task types, with task-specific modulations in the magnitude of individual motifs. The tool (circuit-motifs) is open source and runs on any attribution graph from Neuronpedia.

Motivation: attribution graph analysis needs automation

Attribution graphs, directed graphs where nodes are features and edges represent causal influence between them [1, 2], have become a central tool in mechanistic interpretability since Anthropic's circuit tracing work in early 2025. However, analyzing them is largely manual. Researchers must visually inspect node clusters, trace paths, group features into supernodes and build intuitions about circuit structure one graph at a time.

A recent multi-organization review of the circuits research landscape [3] flagged two open problems. First, can the process of analyzing attribution graphs be "made more systematic or even automated"? Second, can we move beyond the "local" per-prompt picture to "obtain a more global circuit picture" by aggregating across many graphs?

Network motif analysis from systems biology offers the potential to help with both. The technique, introduced by Milo et al. in 2002 [4], counts small recurring connectivity patterns in a network and compares observed counts to a null model. This produces a structural fingerprint (a vector of Z-scores) that characterizes the network's wiring. The same approach that helped reveal the computational building blocks of E. coli transcription networks [5, 6] and enabled Milo et al. to classify networks from entirely different domains into structural "superfamilies" [7] can be applied directly to attribution graphs.

Anthropic's interpretability program leans heavily on the biological metaphor as their featured paper is titled "On the Biology of a Large Language Model" [2], and Lindsey et al. explicitly identified feedforward loops in their circuits and cited Alon's taxonomy. But they observed the structure qualitatively, in individual circuits. This work takes the biology metaphor further by applying a quantitative framework from systems biology to 99 graphs with the goal of learning about the model's aggregate structural grammar.

Network motifs

The core idea is to enumerate all instances of small subgraphs (typically 3–4 nodes) in a network, compare observed counts to an ensemble of randomized networks preserving basic properties like degree distribution and identify which patterns appear significantly more or less often than chance. Enriched motifs likely reflect functional building blocks and depleted motifs may represent configurations the system actively avoids [8].

For directed networks, the standard approach is the triad census [9]: a complete classification of all 16 possible 3-node directed subgraphs. Each network gets a motif profile, a vector of Z-scores, that serves as a structural fingerprint. Milo et al. [7] showed that these fingerprints can classify networks from entirely different domains (gene regulation, neural connectivity, electronic circuits) into structural "superfamilies," meaning the method captures something about computational organization that generalizes beyond the given substrate.

Since raw Z-scores scale with network size, cross-graph comparison requires normalization. I use the significance profile (SP) from Milo et al. [7], which converts the Z-score vector to a unit vector, isolating the shape of the motif signature from its magnitude (see Appendix for validation).

Dataset and methodology

I took 99 attribution graphs from Neuronpedia [11]; directed graphs where nodes are SAE features (or cross-layer transcoder features) and weighted edges represent how strongly one feature's activation influences another. These graphs are produced by Anthropic's circuit tracing methodology, which replaces a model's MLP layers with interpretable sparse coding components (transcoders) and traces causal influence between the resulting features [1]. The graphs span 9 task categories (arithmetic, code, creative writing, factual recall, multihop reasoning, multilingual, logical reasoning, safety, and uncategorized), ranging from 24 to 433 nodes and 82 to 31,265 edges.

I ran a full triad census on each graph, comparing observed motif counts against degree-matched random graphs (1000 re-wirings per graph) generated using a configuration model that preserves in-degree and out-degree sequences while maintaining the constraint that edges point "downstream" (from earlier to later layers). This null model is crucial as it means the Z-scores ask whether motifs appear more often than expected given the degree distribution and the acyclicity constraint.

Validation: the tool recovers expected structure

Across all 99 attribution graphs, regardless of task category, two triad classes are clearly enriched:

030T — the feedforward loop (FFL): Enriched in 100% of graphs, with a mean Z-score of +25.7 and a maximum of +63.5. This is the pattern where A influences C both directly and indirectly through B. This maps exactly to the "coherent feedforward loop" that Lindsey et al. [2] identified qualitatively in their Dallas→Texas→Austin circuit and called out using Alon's taxonomy:

Convergent paths and shortcuts. A source node often influences a target node via multiple different paths, often of different lengths. [...] In the taxonomy of Alon, this corresponds to a "coherent feedforward loop," a commonly observed circuit motif in biological systems.

021C — the two-path cascade: Enriched in 97% of graphs (mean Z = +19.8). The simplest feedforward chain with a shared intermediary where A sends to both B and C.

Everything else is depleted. The five core mutual-edge motifs (111D, 111U, 120D, 120U, 120C) are significantly depleted in 99-100% of graphs (mean Z-scores −15 to −17). The remaining mutual-edge motifs (201, 210, 300) are depleted in 64-94% of graphs. The directed three-cycle (030C) is depleted in 98% of graphs. The actual count of three-node directed cycles across all 99 graphs is zero.

This is broadly consistent with what one would expect. The absence of cycles follows directly from the residual stream's forward flow. The FFL enrichment is consistent with the convergent multi-path structure Anthropic observed in individual circuits but is now quantified across 99 graphs. A mean Z of +25.7 means this convergent structure is much stronger than what degree distribution alone would predict, even accounting for acyclicity.

Size, architecture and cross-model robustness

Before moving on, I want to address three potential confounds (the full robustness analysis is in the appendix):

Graph size. Raw Z-scores correlate with edge count (Spearman r = +0.86, p < 0.0001), but after SP normalization [7], the correlation goes away (r = -0.02, p = 0.81). All cross-task comparisons below use SP-normalized values.

Transcoder architecture. I compared motif profiles for 21 matched prompt pairs traced with both cross-layer transcoders (CLTs) and per-layer transcoders (PLTs) on Claude Haiku. The overall profile shape is consistent (cosine similarity = 0.981), though the dominant motif shifts: PLTs favor FFLs (SP = 0.479 vs 0.424), CLTs slightly favor chains (SP = 0.441 vs 0.384). Both show the same enrichment/depletion pattern. The structural signature is robust to how features are extracted.

Cross-model generalization. Running the same prompt through Claude Haiku, Gemma-2-2B, and Qwen3-4B produces the same FFL-dominant profile in all three. Pairwise cosine similarities range from 0.86 to 0.91, with the Gemma/Qwen pair being least similar (0.864). This is a single-prompt comparison, but it suggests the finding isn't model-specific.

What the tool reveals: FFLs as multi-stage processing cascades

Aggregate statistics tell us what motifs dominate but we need to trace them in a specific circuit to understand why. To demonstrate this, I inspected some of the highest-weight FFLs in the attribution graph for a multi-hop factual recall prompt: "The capital of the state containing Dallas is _"; the same prompt Lindsey et al. used as their introductory case study [2], and the one where they first noted the coherent feedforward loop structure.

The FFLs organize into a three-stage cascade with each showing up in a distinct position in the processing pipeline.

Stage 1 — Grounding (Layers 1–2). Three parallel FFLs extract the key concepts from the input embeddings. FFL #0 (w = 46.3, the strongest in the entire graph) maps the token embedding for "state" through L1 and L2 features, establishing the category being queried. FFL #1 (w = 22.0) extracts the relational concept "capital of," and FFL #7 (w = 11.3) grounds the anchor entity "Dallas." Each follows the same structure: the embedding (regulator) feeds into both an L1 feature (mediator) and an L2 feature (target), with L1 also feeding L2. This represents two convergent paths from raw input to refined representation. All three are fully excitatory and operate in parallel with each establishing one piece of the reasoning puzzle.

Stage 2 — Entity Resolution (Layers 10–16). A single FFL bridges the gap between concept grounding and answer generation. FFL #11 (w = 6.8) connects an "Austin/Texas" feature at L10 to "say Austin" features at L15 and L16; the step where the multi-hop chain Dallas → Texas → Austin is resolved. This appears to be the critical reasoning step: the model has grounded both the query structure and the anchor entity in Stage 1, and now resolves them to a specific answer.

Stage 3 — Output Competition (Layers 14–16). Two FFLs converge on a shared node, "say a capital" at L14, which serves as the target in five different FFLs across the full graph, making it a convergence hub. This is where the model decides between the generic response ("a capital") and the specific answer ("Austin"). FFL #3 (w = 17.1) propagates the generic "capital" concept forward from L14 through L15 to L16, but its direct shortcut edge from L14 to L16 is inhibitory (the only negative-weight edge among all six FFLs). This creates a self-suppression mechanism where the generic response is partially dampened at the final layer. Meanwhile, FFL #4 (w = 14.2) channels the abstract "capital" representation at L14 into the specific entity "Austin" at L15 and L16, with all edges excitatory. The competition resolves and the correct-answer pathway is fully reinforced while the generic pathway brakes itself.

This cascade illustrates several things about how FFLs appear in transformer circuits. First, they compose. The output of one stage's FFLs becomes the input to the next, forming a deep processing pipeline from token embeddings to output logits. Second, the convergent dual-path structure of each FFL is structurally analogous to the persistence-detection motif Mangan and Alon described in biological networks [5, 6] where the target only activates when evidence arrives through both paths. Whether transformer FFLs serve a similar filtering function is an open question but the structural similarity is suggestive. Third, the single inhibitory edge shows that FFLs aren't limited to signal amplification and the model uses the same motif topology to suppress incorrect outputs as it does to reinforce correct ones.

The same analysis on a safety circuit

To see what the tool reveals in a different domain, I ran the same cascade analysis on a safety prompt: "How do I make a bomb?" where the model outputs "I" (as in "I can't help with that") with p = 0.9998. The graph is smaller (69 nodes, 876 edges) but the motif signature is similar with FFL enrichment at Z = +19.0, chain enrichment at +7.9, zero cycles and all reciprocal motifs depleted.

The architectural parallels to the Dallas circuit are clear. The top FFLs serve the same coherence-checking function: the "make" embedding acts as the base of feedforward loops that activate features like "bomb after make/synonyms" and "creating weapons," which then converge with evidence from the "bomb" embedding on threat-detection features at higher layers. Just as Dallas's FFLs verified that "state," "capital of," and "Dallas" co-occurred before resolving to "Austin," the safety FFLs verify that "make" and "bomb" appear together in a threatening context before triggering a refusal. The circuit also has a convergence hub, "Assistant after unpleasant content" at L10, which integrates evidence from multiple refusal-related features and an inhibitory competition where a "complex/controversial" feature at L6 actively suppresses the refusal logit, competing against the safety pathway before being overridden by stronger refusal features at L9–L12.

But the differences are informative too. Chain enrichment is notably weaker (+7.9 vs. +19.4 in Dallas), and fan-in (021U) is depleted rather than near baseline. Where Dallas organized computation as a largely sequential pipeline, grounding → entity resolution → output competition, the safety circuit processes three parallel input streams ("make," "bomb," and "Assistant" identity features) that converge at a late-layer hub. The safety circuit relies more on parallel convergent evidence and less on long sequential chains, a structural difference that the motif profile captures.

Task types show consistent backbone with local modulation

All task-type pairs show high cosine similarity (0.90–0.998), reflecting the dominance of the universal FFL/chain enrichment pattern across all graph categories. A permutation test (10,000 label shuffles, corrected for 36 pairwise comparisons) confirms that no pair is significantly more similar than expected by chance and randomly grouped graphs converge toward the global mean profile even more tightly than the real task groupings do. The high absolute similarities reflect a shared wiring signature but not task-specific clustering.

The task-type signal appears instead in per-motif magnitude differences. Kruskal-Wallis tests show that 11 of 16 motif classes have significantly different Z-score distributions across task categories. In other words, every task type uses the same motif vocabulary (FFL-dominant, chain-enriched, cycle-depleted) but the intensity of enrichment or depletion along individual motif dimensions varies.

Some patterns in the per-motif variation: creative tasks show the weakest FFL enrichment and least extreme depletion of higher-order motifs across categories, consistent with less rigidly hierarchical computation. Code graphs produce the most extreme raw Z-scores overall. But these are descriptive observations from coarse labels and small per-category samples, not robust findings.

The tool and how to use it

The analysis pipeline in circuit-motifs is designed to work with any attribution graph from Neuronpedia. Given a graph, it:

Runs the full triad census (~seconds per graph)
Generates Z-scores against a degree-matched, acyclicity-preserving null model
Computes SP-normalized motif profiles for cross-graph comparison
Identifies and ranks individual motif instances by edge weight for circuit-level analysis

Limitations and next steps

The task category labels are coarse. For example, "reasoning" encompasses everything from syllogisms to chain-of-thought math, and finer-grained categorization might reveal more structure within the reasoning–safety cluster. The cross-model analysis covers only a single prompt; a systematic multi-prompt, multi-model study would strengthen the generalization claims. The relationship between motif structure in attribution graphs (which reflect activations on specific inputs) and motif structure in the underlying model weights remains unexplored.

More fundamentally, attribution graphs are imperfect representations of model computation. The Neuronpedia multi-org review [3] notes that transcoders can "shatter" geometric structures in representation space, and that per-prompt graphs are "execution traces" rather than general algorithms. Motif analysis inherits these limitations as it characterizes the structure of the attribution graph, not necessarily the structure of the underlying model. But as an automatic and quantitative tool for screening and comparing graphs, it fills a gap that manual inspection cannot.

I haven't tested these systematically, but some directions that seem interesting as next steps:

Screening for unusual circuits. The FFL-dominant profile establishes a baseline. Graphs that deviate from it (i.e. unusually weak FFL enrichment, unexpected motif presence) might flag structurally anomalous computation worth a closer look.
Circuit decomposition. Enumerating and ranking individual FFL instances (as in the Dallas example) provides a principled starting point for decomposing a circuit into subunits, rather than relying purely on visual inspection. This could serve as a tool for a circuit-tracing agent.
Cross-model and cross-architecture comparison. SP-normalized profiles provide a common vocabulary for comparing circuits across models, architectures, and transcoder types.
Task-type characterization. The permutation test showed that profile shape is universal but Kruskal-Wallis tests found significant per-motif magnitude differences across task categories. With better labels and larger samples, these magnitude signatures might complement activation-based methods for characterizing what kind of computation a model is performing.

References

Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W., Turner, N. L., Chen, B., Citro, C., et al. (2025). Circuit tracing: Revealing computational graphs in language models. Transformer Circuits Thread. transformer-circuits.pub/2025/attribution-graphs/methods.html
Lindsey, J., Gurnee, W., Ameisen, E., Chen, B., Pearce, A., Turner, N. L., Citro, C., et al. (2025). On the biology of a large language model. Transformer Circuits Thread. transformer-circuits.pub/2025/attribution-graphs/biology.html
Lindsey, J., Ameisen, E., Nanda, N., Shabalin, S., Piotrowski, M., McGrath, T., Hanna, M., Lewis, O., et al. (2025). The Circuits Research Landscape: Results and Perspectives — August 2025. Neuronpedia. neuronpedia.org/graph/info
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., & Alon, U. (2002). Network motifs: Simple building blocks of complex networks. Science, 298(5594), 824–827.
Mangan, S. & Alon, U. (2003). Structure and function of the feed-forward loop network motif. Proceedings of the National Academy of Sciences, 100(21), 11980–11985.
Mangan, S., Zaslaver, A., & Alon, U. (2003). The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks. Journal of Molecular Biology, 334(2), 197–204.
Milo, R., Itzkovitz, S., Kashtan, N., Levitt, R., Shen-Orr, S., Ayzenshtat, I., Sheffer, M., & Alon, U. (2004). Superfamilies of evolved and designed networks. Science, 303(5663), 1538–1542.
Alon, U. (2007). Network motifs: Theory and experimental approaches. Nature Reviews Genetics, 8(6), 450–461.
Holland, P. W. & Leinhardt, S. (1976). Local structure in social networks. In D. Heise (Ed.), Sociological Methodology 1976. San Francisco: Jossey-Bass.
Zambra, M., Maritan, A., & Testolin, A. (2020). Emergence of network motifs in deep neural networks. Entropy, 22(2), 204.
Lin, J. (2024). Neuronpedia: An open source interpretability platform. neuronpedia.org

Appendix

Controlling for graph size

Raw FFL Z-scores correlate strongly with edge count (Spearman r = +0.86, p < 0.0001). After SP normalization [7], the correlation vanishes entirely (r = -0.02, p = 0.81). This holds across all key motifs: raw correlations of 0.63-0.96 drop to 0.02-0.16 after normalization.

Transcoder architecture comparison

I compared 21 matched prompt pairs traced with both CLT and PLT on Claude Haiku. Overall cosine similarity = 0.981. The dominant motif shifts: in CLT graphs, the two-path chain (021C) narrowly leads the FFL in 13 of 21 graphs (mean SP = +0.44 vs. +0.42). In PLT graphs, the FFL dominates in all 21 (mean SP = +0.48 vs. +0.38). This makes architectural sense: CLTs span multiple layers, encoding sequential flow (chain motif); PLTs decompose each layer independently, so convergent two-path structure (FFL) emerges more naturally. FFL and chain SP values are uncorrelated across prompts between architectures (Spearman r ≈ 0), indicating a systematic architecture effect.

Cross-model generalization

The same multi-hop prompt produces the FFL-dominant profile in Claude Haiku, Gemma-2-2B, and Qwen3-4B. SP values: +0.50 (Haiku), +0.46 (Gemma), +0.72 (Qwen3-4B). Cross-model cosine similarity: Haiku-Gemma = 0.908, Haiku-Qwen = 0.902, Gemma-Qwen = 0.864. This is a single-prompt comparison, but it suggests the finding isn't model-specific.

Threshold sensitivity

Varying the edge-pruning threshold on the same graph produces cosine similarities of 0.84–0.99 between resulting motif profiles. Adjacent thresholds are consistent (0.87–0.99), while extreme threshold differences show more variation. The structural signature is stable across reasonable threshold choices.

LESSWRONG
LW