Rejected for the following reason(s):
This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Rejected for the following reason(s):
d(l, h) = layer / (n_layers - 1) ∈ [0, 1]
FCAS(A, B) = 1 - (1/k) Σ |d_A^i - d_B^i| for top-k matched pairs
result = gb.analyze(
"When Mary and John went to the store, John gave a drink to",
correct=" Mary",
incorrect=" John"
)
print(result["faithfulness"])
# sufficiency: 1.00, comprehensiveness: 0.47, f1: 0.64
# category: backup_mechanisms
Cross-posted to the Alignment Forum.
I've spent the past few months building Glassbox — a mechanistic interpretability toolkit that does automated circuit discovery in ~1.2 seconds on GPT-2 small. I want to share the technical choices, a novel metric I developed, and some honest caveats about what the tool does and doesn't do well.
pip install glassbox-mech-interp| GitHubWhat it does
Glassbox implements three things:
1. Attribution patching (Nanda, 2023) in exactly 3 forward passes. I tried to implement this correctly — the critical detail is that the gradient hook must return the modified tensor, not mutate in-place, otherwise the computation graph severs and you get zero gradients everywhere. I've seen this bug in multiple reimplementations.
2. Minimum Faithful Circuit (MFC) discovery via an asymmetric greedy algorithm:
Total cost: O(3 + 2p) where p = backward pruning steps, typically 2–4.
For comparison, ACDC runs at O(2 × edges) which on GPT-2 small is ~43 seconds. Glassbox finishes in ~1.2 seconds — 37× faster. The tradeoff is head-level rather than edge-level granularity.
3. Functional Circuit Alignment Score (FCAS) — this is the novel thing.
FCAS: A metric for comparing circuits
The question that motivated FCAS: if I run Glassbox on an IOI prompt on GPT-2 small vs GPT-2 medium, are the circuits structurally similar? What about on different paraphrases of the same task?
Existing evaluations compare specific head identities (e.g., "does model B have a head at L9H9?") which breaks immediately when layer counts differ. FCAS compares relative depth:
FCAS = 1.0 means the circuits appear at proportionally identical depths. FCAS = 0.0 means maximum depth misalignment.
The key addition is a null distribution. I bootstrap 1000 random circuit pairs of the same size and compute the null mean and std, giving a z-score. This tells you whether your observed alignment is meaningful or just an artifact of similar-depth heads being common in transformer circuits generally.
Honest limitations of FCAS:
I'm publishing this as a screening metric, not a definitive equivalence test. The z-score against the null distribution helps calibrate how much to trust it.
A concrete finding
On the canonical IOI prompt ("When Mary and John went to the store, John gave a drink to"):
The
backup_mechanismscategory (high sufficiency, low comprehensiveness) is the interesting case. The identified circuit is causally sufficient to predict " Mary" — but when those heads are ablated, the model's prediction degrades only partially, not to chance. This matches Wang et al.'s finding that there are redundant Name Mover heads that act as backups.The
suff_is_approx: Trueflag in the output is intentional. Attribution patching sufficiency is a first-order approximation. I wanted to make it impossible for users to treat this as an exact value without seeing that flag.What I'd love feedback on
backup_mechanismscategory currently flags Suff > 0.9 and Comp < 0.4 together. Are there cases where this combination has a different interpretation?_name_swapcorruption works cleanly for IOI. For factual recall (e.g., "The capital of France is → Paris"), there's no meaningful name-swap, so we fall back to appending the distractor as corruption. This is a hack. Better suggestions welcome.Why I'm building this
The honest answer is that I think mechanistic interpretability tooling is at the stage compiler tooling was in the 1980s: there are research implementations that work for the people who built them, but nothing you'd hand to a bank's ML team and say "use this to audit your credit model before August 2026 when the EU AI Act kicks in."
That's the gap Glassbox is trying to fill. Not by being more academically rigorous than ACDC, but by being accessible enough that a team that didn't write the paper can run a circuit audit in an afternoon.
If you've tried to use mechanistic interpretability tools for anything applied and hit walls, I'd genuinely like to know where.
Code:
pip install glassbox-mech-interpPaper: [coming this week to arXiv] Repo: https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-toolAjay Pravin Mahale — independent researcher