RGRGRG

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Many thanks to Michael Hanna and Joshua Batson for useful feedback and discussion. Kat Dearstyne and Kamal Maher conducted experiments during the SPAR Fall 2025 Cohort. TL;DR Cross-layer transcoders (CLTs) enable circuit tracing that can extract high-level mechanistic explanations for arbitrary prompts and are emerging as general-purpose infrastructure for mechanistic interpretability. Because these tools operate at a relatively low level, their outputs are often treated as reliable descriptions of what a model is doing, not just predictive approximations. We therefore ask: when are CLT-derived circuits faithful to the model’s true internal computation? In a Boolean toy model with known ground truth, we show a specific unfaithfulness mode: CLTs can rewrite deep multi-hop circuits into sums of shallow single-hop circuits, yielding explanations that match behavior while obscuring the actual computational pathway. Moreover, we find that widely used sparsity penalties can incentivize this rewrite, pushing CLTs toward unfaithful decompositions. We then provide preliminary evidence that similar discrepancies arise in real language models, where per-layer transcoders and cross-layer transcoders sometimes imply sharply different circuit-level interpretations for the same behavior. Our results clarify a limitation of CLT-based circuit tracing and motivate care in how sparsity and interpretability objectives are chosen. Introduction In this one-week research sprint, we explored whether circuits based on Cross-Layer Transcoders (CLTs) are faithful to ground-truth model computations. We demonstrate that CLTs learn features that skip over computation that occurs inside of models. To be concrete, if features in set A create features in set B, and then these same set B features create features in set C, CLT circuits can incorrectly show that set A features create set C features and that set B features do not create set C features, ultimately skipping B features altogether. T

40Feb 2

RGRGRG

Message

138

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Feb 240

Alternative Models of Superposition

Zephaniah Roe (mentee) and Rick Goldstein (mentor) conducted these experiments during continued work following the SPAR Spring 2025 cohort. Disclaimer / Epistemic status: We spent roughly 30 hours on this post. We are not confident in these findings but we think they are interesting and worth sharing. We assume some...

Aug 11, 202520

Seeking Feedback on My Mechanistic Interpretability Research Agenda

Why this post I’ve been doing MI research full-time for about three months and since my current grant is ending soon and I recently received a Lightspeed rejection, now seems like a good time to take some time away from object-level work to reflect on direction and next steps. I...

Sep 12, 20235

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

1) Introduction In February, Stephen Casper posted two Mechanistic Interpretability challenges. The first of these challenges asks participants to uncover a secret labeling function from a trained CNN and was solved by Stefan Heimersheim and Marius Hobbhahn. The second of these challenges, which will be the focus of this post,...

Jul 28, 202326

Best Ways to Try to Get Funding for Alignment Research?

Hey Everyone! I recently left my FAANG job to split my time between doing Alignment Research (70%) and investigating start-up ideas (30%). If I decide to fully commit to Alignment Research, what is the best way to go about applying for and/or getting funding? (In a perfect world, this funding...

Apr 4, 202310

LESSWRONG
LW

LESSWRONG
LW

RGRGRG

RGRGRG

RGRGRG

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

Alternative Models of Superposition

Best Ways to Try to Get Funding for Alignment Research?

RGRGRG

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Alternative Models of Superposition

Seeking Feedback on My Mechanistic Interpretability Research Agenda

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

Best Ways to Try to Get Funding for Alignment Research?

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

Alternative Models of Superposition

Best Ways to Try to Get Funding for Alignment Research?

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Alternative Models of Superposition

Seeking Feedback on My Mechanistic Interpretability Research Agenda

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

Best Ways to Try to Get Funding for Alignment Research?