This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Cross-posted to the Alignment Forum.
I've spent the past few months building Glassbox — a mechanistic interpretability toolkit that does automated circuit discovery in ~1.2 seconds on GPT-2 small. I want to share the technical choices, a novel metric I developed, and some honest caveats about what the tool does and doesn't do well.
1. Attribution patching (Nanda, 2023) in exactly 3 forward passes. I tried to implement this correctly — the critical detail is that the gradient hook must return the modified tensor, not mutate in-place, otherwise the computation graph severs and you get zero gradients everywhere. I've seen this bug in multiple reimplementations.
2. Minimum Faithful Circuit (MFC) discovery via an asymmetric greedy algorithm:
Phase 1 (forward): add heads greedily by attribution score until Taylor-sufficiency hits 85%. Zero extra passes — uses the attribution scores directly.
Phase 2 (backward pruning): for each head (last-added first), trial its removal using exact comprehensiveness (corrupted activation patching). Prune if comprehensiveness stays above threshold. 2 passes per head tried.
Total cost: O(3 + 2p) where p = backward pruning steps, typically 2–4.
For comparison, ACDC runs at O(2 × edges) which on GPT-2 small is ~43 seconds. Glassbox finishes in ~1.2 seconds — 37× faster. The tradeoff is head-level rather than edge-level granularity.
3. Functional Circuit Alignment Score (FCAS) — this is the novel thing.
FCAS: A metric for comparing circuits
The question that motivated FCAS: if I run Glassbox on an IOI prompt on GPT-2 small vs GPT-2 medium, are the circuits structurally similar? What about on different paraphrases of the same task?
Existing evaluations compare specific head identities (e.g., "does model B have a head at L9H9?") which breaks immediately when layer counts differ. FCAS compares relative depth:
FCAS = 1.0 means the circuits appear at proportionally identical depths. FCAS = 0.0 means maximum depth misalignment.
The key addition is a null distribution. I bootstrap 1000 random circuit pairs of the same size and compute the null mean and std, giving a z-score. This tells you whether your observed alignment is meaningful or just an artifact of similar-depth heads being common in transformer circuits generally.
Honest limitations of FCAS:
Matches by rank, not functional role. A Name Mover at depth 0.82 and a completely different head at depth 0.82 in another model score as perfectly aligned on that pair.
Depth-only — doesn't capture head index within a layer.
Sensitive to k.
I'm publishing this as a screening metric, not a definitive equivalence test. The z-score against the null distribution helps calibrate how much to trust it.
A concrete finding
On the canonical IOI prompt ("When Mary and John went to the store, John gave a drink to"):
result = gb.analyze( "When Mary and John went to the store, John gave a drink to", correct=" Mary", incorrect=" John" ) print(result["faithfulness"]) # sufficiency:1.00, comprehensiveness:0.47, f1:0.64 # category: backup_mechanisms
The backup_mechanisms category (high sufficiency, low comprehensiveness) is the interesting case. The identified circuit is causally sufficient to predict " Mary" — but when those heads are ablated, the model's prediction degrades only partially, not to chance. This matches Wang et al.'s finding that there are redundant Name Mover heads that act as backups.
The suff_is_approx: True flag in the output is intentional. Attribution patching sufficiency is a first-order approximation. I wanted to make it impossible for users to treat this as an exact value without seeing that flag.
What I'd love feedback on
FCAS validity: is depth-matching a reasonable basis for circuit comparison, or is this fundamentally ungrounded? Alternatives I considered: mutual information between head activation patterns (expensive), head-type classification (requires labelling), cosine similarity between OV matrices (model-architecture-dependent).
The backward pruning threshold: I use target comprehensiveness ≥ 0.25 for pruning. This is a heuristic. Has anyone developed principled threshold selection for circuit pruning?
Backup mechanisms detection: the backup_mechanisms category currently flags Suff > 0.9 and Comp < 0.4 together. Are there cases where this combination has a different interpretation?
Extensions to non-IOI tasks: the _name_swap corruption works cleanly for IOI. For factual recall (e.g., "The capital of France is → Paris"), there's no meaningful name-swap, so we fall back to appending the distractor as corruption. This is a hack. Better suggestions welcome.
Why I'm building this
The honest answer is that I think mechanistic interpretability tooling is at the stage compiler tooling was in the 1980s: there are research implementations that work for the people who built them, but nothing you'd hand to a bank's ML team and say "use this to audit your credit model before August 2026 when the EU AI Act kicks in."
That's the gap Glassbox is trying to fill. Not by being more academically rigorous than ACDC, but by being accessible enough that a team that didn't write the paper can run a circuit audit in an afternoon.
If you've tried to use mechanistic interpretability tools for anything applied and hit walls, I'd genuinely like to know where.
Cross-posted to the Alignment Forum.
I've spent the past few months building Glassbox — a mechanistic interpretability toolkit that does automated circuit discovery in ~1.2 seconds on GPT-2 small. I want to share the technical choices, a novel metric I developed, and some honest caveats about what the tool does and doesn't do well.
pip install glassbox-mech-interp| GitHubWhat it does
Glassbox implements three things:
1. Attribution patching (Nanda, 2023) in exactly 3 forward passes. I tried to implement this correctly — the critical detail is that the gradient hook must return the modified tensor, not mutate in-place, otherwise the computation graph severs and you get zero gradients everywhere. I've seen this bug in multiple reimplementations.
2. Minimum Faithful Circuit (MFC) discovery via an asymmetric greedy algorithm:
Total cost: O(3 + 2p) where p = backward pruning steps, typically 2–4.
For comparison, ACDC runs at O(2 × edges) which on GPT-2 small is ~43 seconds. Glassbox finishes in ~1.2 seconds — 37× faster. The tradeoff is head-level rather than edge-level granularity.
3. Functional Circuit Alignment Score (FCAS) — this is the novel thing.
FCAS: A metric for comparing circuits
The question that motivated FCAS: if I run Glassbox on an IOI prompt on GPT-2 small vs GPT-2 medium, are the circuits structurally similar? What about on different paraphrases of the same task?
Existing evaluations compare specific head identities (e.g., "does model B have a head at L9H9?") which breaks immediately when layer counts differ. FCAS compares relative depth:
d(l, h) = layer / (n_layers - 1) ∈ [0, 1]
FCAS(A, B) = 1 - (1/k) Σ |d_A^i - d_B^i| for top-k matched pairs
FCAS = 1.0 means the circuits appear at proportionally identical depths. FCAS = 0.0 means maximum depth misalignment.
The key addition is a null distribution. I bootstrap 1000 random circuit pairs of the same size and compute the null mean and std, giving a z-score. This tells you whether your observed alignment is meaningful or just an artifact of similar-depth heads being common in transformer circuits generally.
Honest limitations of FCAS:
I'm publishing this as a screening metric, not a definitive equivalence test. The z-score against the null distribution helps calibrate how much to trust it.
A concrete finding
On the canonical IOI prompt ("When Mary and John went to the store, John gave a drink to"):
result = gb.analyze(
"When Mary and John went to the store, John gave a drink to",
correct=" Mary",
incorrect=" John"
)
print(result["faithfulness"])
# sufficiency: 1.00, comprehensiveness: 0.47, f1: 0.64
# category: backup_mechanisms
The
backup_mechanismscategory (high sufficiency, low comprehensiveness) is the interesting case. The identified circuit is causally sufficient to predict " Mary" — but when those heads are ablated, the model's prediction degrades only partially, not to chance. This matches Wang et al.'s finding that there are redundant Name Mover heads that act as backups.The
suff_is_approx: Trueflag in the output is intentional. Attribution patching sufficiency is a first-order approximation. I wanted to make it impossible for users to treat this as an exact value without seeing that flag.What I'd love feedback on
backup_mechanismscategory currently flags Suff > 0.9 and Comp < 0.4 together. Are there cases where this combination has a different interpretation?_name_swapcorruption works cleanly for IOI. For factual recall (e.g., "The capital of France is → Paris"), there's no meaningful name-swap, so we fall back to appending the distractor as corruption. This is a hack. Better suggestions welcome.Why I'm building this
The honest answer is that I think mechanistic interpretability tooling is at the stage compiler tooling was in the 1980s: there are research implementations that work for the people who built them, but nothing you'd hand to a bank's ML team and say "use this to audit your credit model before August 2026 when the EU AI Act kicks in."
That's the gap Glassbox is trying to fill. Not by being more academically rigorous than ACDC, but by being accessible enough that a team that didn't write the paper can run a circuit audit in an afternoon.
If you've tried to use mechanistic interpretability tools for anything applied and hit walls, I'd genuinely like to know where.
Code:
pip install glassbox-mech-interpPaper: [coming this week to arXiv] Repo: https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-toolAjay Pravin Mahale — independent researcher