x

LESSWRONG

LW

SrGonao — LessWrong

SrGonao

SrGonao

Message

112

2

2y

SrGonao

112

2y

Open Source Automated Interpretability for Sparse Autoencoder Features

by kh4dien, SrGonao, jacob_drori, and Nora Belrose

Generated by Dalle Background Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring text explanations of SAE features. Key Findings * Open source models generate and evaluate text explanations of SAE features...

Jul 30, 2024•67

Ophiology (or, how the Mamba architecture works)

by Danielle Ensign, SrGonao, and Adrià Garriga-alonso

The following post was made as part of Danielle's MATS work on doing circuit-based mech interp on Mamba, mentored by Adrià Garriga-Alonso. It's the first in a sequence of posts about finding an IOI circuit in Mamba/applying ACDC to Mamba. This introductory post was also made in collaboration with Gonçalo...

Apr 9, 2024•67