x

LESSWRONG

LW

Philippe Chlenski — LessWrong

Philippe Chlenski

Philippe Chlenski

Message

CS PhD student

82

Ω

3

2

4y

Philippe Chlenski

CS PhD student

Transcoders enable fine-grained interpretable circuit analysis for language models

by Jacob Dunefsky, Philippe Chlenski, and Neel Nanda

Summary * We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the...

Apr 30, 2024•76

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

by Jacob Dunefsky, Philippe Chlenski, Senthooran Rajamanoharan, and Neel Nanda

Epistemic status: preliminary/exploratory. Work performed as a part of Neel Nanda's MATS 5.0 (Winter 2023-2024) Research Sprint. TL;DR: We develop a method for understanding how sparse autoencoder features in transformer models are computed from earlier components, by taking a local linear approximation to MLP sublayers. We study both how the...

Jan 14, 2024•24