SrGonao — LessWrong

Deceptive agents can collude to hide dangerous features in SAEs

Hi! I know that this post is now almost 5 months old, but I feel like I need to ask some clarifying questions and point out things about your methodology that I don't completely understand/agree.

How do you source the sentences used for the scoring method? Are they all from top activations? This is not explicitly mentioned in the methodology section - although in the footnote you do say you have 3 high activations and 3 low activations. Am I to understand correctly that there are no cases with no activations?

Are the sentences shown individually or in batche... (read more)

Evaluating Sparse Autoencoders with Board Game Models

SrGonao2y10

I don't know much about chess. Could it be that feature 172 that you are highlighting is related to some kind of chess opening? The distribution of black pawns could be due to different states of the opening, and the position of the black bishop and white horse could also be related to different parts of that opening?

Open Source Automated Interpretability for Sparse Autoencoder Features

kh4dien, SrGonao, jacob_drori, Nora Belrose

This is a linkpost for https://blog.eleuther.ai/autointerp/

Background

Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring text explanations of SAE features.

Key Findings

Open source models generate and evaluate text explanations of SAE features reasonably well, albeit somewhat worse than closed models like Claude 3.5 Sonnet.
Explanations found by LLMs are similar to explanations found by humans.
Automatically interpreting 1.5M features of GPT-2 with the current pipeline would cost $1300 in API calls to Llama 3.1 or $8500 with Claude 3.5 Sonnet. Prior methods cost ~$200k with Claude.
Code can be found at https://github.com/EleutherAI/sae-auto-interp.
We built a small dashboard to explore explanations and their scores: https://cadentj.github.io/demo/

Generating Explanations

Sparse autoencoders decompose activations into a sum of sparse feature directions. We leverage language models to...

(Continue Reading - 3720 more words)

Ophiology (or, how the Mamba architecture works)

Danielle Ensign, SrGonao, Adrià Garriga-alonso

The following post was made as part of Danielle's MATS work on doing circuit-based mech interp on Mamba, mentored by Adrià Garriga-Alonso. It's the first in a sequence of posts about finding an IOI circuit in Mamba/applying ACDC to Mamba.

This introductory post was also made in collaboration with Gonçalo Paulo.

A new challenger arrives!

Why Mamba?

Promising Scaling

Mamba ^[[1]] is a type of recurrent neural network based on state-space models, and is being proposed as an alternative architecture to transformers. It is the result of years of capability research ^[[2]] ^[[3]] ^[[4]]...

(Continue Reading - 2685 more words)