Nick Jiang — LessWrong

Towards data-centric interpretability with sparse autoencoders

Nick and Lily are co-first authors on this project. Lewis and Neel jointly supervised this project. Check out our updated paper here: https://arxiv.org/abs/2512.10092. TL;DR * We use sparse autoencoders (SAEs) for four textual data analysis tasks—data diffing, finding correlations, targeted clustering, and retrieval. * We care especially about gaining insights...

Aug 15, 202557

On the Practical Applications of Interpretability

In late May, Anthropic released a paper on sparse autoencoders that could interpret the hidden representations of large language models. A great deal of excitement followed in the interpretability community and beyond, with people exclaiming the potential of finally breaking down how LLMs think and process information. The initial excitement...

Oct 15, 20245

A gentle introduction to sparse autoencoders

Sparse autoencoders (SAEs) are the current hot topic 🔥 in the interpretability world. In late May, Anthropic released a paper that shows how to use sparse autoencoders to effectively break down the internal reasoning of Claude 3 (Anthropic’s LLM) . Shortly after, OpenAI published a paper successfully applying a similar...

Sep 2, 202425