Transcoders enable fine-grained interpretable circuit analysis for language models
by Jacob Dunefsky, Philippe Chlenski, and Neel Nanda
Summary * We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the...
Apr 30, 202475