Message

Jakob Hansen

Message

Jakob Hansen

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

One data point on the proposed mitigations: I experimented a little with summed per-layer L2 norms for the decoder norms in the sparsity penalty while training CLTs for Qwen 3 and didn't see much difference in the spread of per-layer L0. It did result in a bigger range of decoder norms and hence effective feature activation magnitudes, which made calibrating the coefficient inside the tanh more difficult. It may be that another loss term directly penalizing decoder norms would help here.

Cross Layer Transcoders for the Qwen3 LLM Family

Jakob Hansen7mo10

The CLTs themselves are available here on Hugging Face: https://huggingface.co/collections/bluelightai/qwen3-cross-layer-transcoders. They can be used with the circuit-tracer library.

Topological Data Analysis and Mechanistic Interpretability

Jakob Hansen1y30

Here is the promised Colab notebook for exploring SAE features with TDA. It works on the top-k GPT2-small SAEs by default, but should be pretty easily adaptable to most SAEs available in sae_lens. The graphs will look a little different from the ones shown in the post because they are constructed directly from the decoder weight vectors rather than from feature correlations across a corpus.

One of the interesting things I found while putting this together is a large group of "previous token" features, which are mostly misinterpreted by the LLM-generated exp... (read more)