Three things I found while building a circuit discovery toolkit

designer-coderajay

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

I've been building a mechanistic interpretability toolkit for the last few months. Not writing about it — actually building it, running into walls, rewriting things. This post is about two implementation problems that cost me real time, and a metric I had to invent because I couldn't find what I needed.

The hook bug

Attribution patching requires patching corrupted activations in mid-forward-pass. In TransformerLens you do this with hooks. My first implementation:

This does nothing. The hook fires, you mutate the tensor in-place, and the downstream computation ignores it. No error, no warning, wrong numbers that look totally reasonable.

The correct version:

You have to return the new value. The hook replaces the tensor; it doesn't modify it.

I spent three days thinking my attribution scores were just noisy before someone on Discord pointed this out. I went back and checked: there is one sentence about it in the TransformerLens docs, easy to miss if you've written PyTorch hooks before and assume they work the same way.

If your attribution patching results look plausible but slightly wrong on every example, check this first.

Why I didn't use ACDC

ACDC (Conmy et al. 2023) discovers circuits by testing every edge in the computation graph. For GPT-2 small that's something like 14,000 edges. It works. On an A100 it takes about 30-40 minutes per prompt.

I needed something faster because I wanted to run it on batches, and also because I got impatient.

The thing I noticed: sufficiency and comprehensiveness don't need the same treatment. Sufficiency is "does the circuit reproduce the behavior?" You can get a decent approximation of this from a first-order Taylor expansion — one extra forward pass for the Jacobian, no per-head patching needed. Comprehensiveness is "is the circuit actually necessary?" You need exact patching here, no way around it.

So: greedy forward selection using Taylor-approximated sufficiency (3 baseline passes, nothing extra per head), then backward pruning using exact comprehensiveness (2 passes per head you consider removing). On IOI with GPT-2 small this runs in 1.2 seconds.

The catch is that sufficiency is approximate. Not wildly wrong — the Taylor expansion is decent — but not exact. I flag this in the output. If you need exact sufficiency you can switch to integrated gradients, which costs 10-20x more passes but removes the approximation.

The 37x speedup over ACDC comes from this asymmetry plus working at the head level instead of the edge level.

A metric for circuit stability across prompts

After finding a circuit, I wanted to know whether I'd found something real or just an artifact of that specific prompt. The usual approach is to run the same analysis on multiple prompts and eyeball the circuits. This gets tedious.

I wanted a number. What I settled on: compare two circuits by the distribution of their heads' relative layer depths. Normalize each head's layer: d = layer / (n_layers - 1), so layer 0 is 0.0 and the final layer is 1.0. For two circuits A and B with k heads each (taking the top-k by attribution), FCAS = 1 - (1/k) * sum(|d_A_i - d_B_i|).

High FCAS means the two circuits live at similar depths. That's a rough proxy for "maybe doing similar things." It's not a rigorous claim — two circuits at the same depth could be doing completely different things. But it's better than nothing for quickly checking whether your circuit is stable across rephrased prompts.

I also compute a bootstrap null to get a z-score, because a raw similarity number without a baseline is hard to interpret.

This metric might be the wrong one. I have no strong theoretical justification for why relative depth is the right axis to compare on. If someone has a better idea I'd actually like to hear it — this part of the project feels the least settled to me.

What I'm working on now

The package is glassbox-mech-interp on PyPI if you want to poke at it. It wraps TransformerLens and handles the three things above plus logit lens, head composition scores, and token-level attribution.

The direction I'm currently exploring: using circuit analysis as evidence for EU AI Act compliance documentation. The regulation requires "technical documentation" explaining how high-risk AI systems make decisions, with enforcement starting August 2026. I think mechanistic interpretability is the only rigorous way to satisfy this, but I'm not sure how to bridge the gap between "we found a 4-head circuit for IOI" and "here is documentation your compliance officer can sign off on." Still figuring that part out.

Questions I'm genuinely uncertain about: whether the hook mutation behavior is a TransformerLens-specific issue or a more general PyTorch pattern problem, whether the asymmetric forward/backward selection has been analyzed theoretically anywhere, and whether FCAS or something like it already exists under a different name.