x

LESSWRONG

LW

Henk Tillman — LessWrong

Henk Tillman

Henk Tillman

Message

46

Ω

13

2

4y

Henk Tillman

46

Ω

13

4y

Investigating task-specific prompts and sparse autoencoders for activation monitoring

We’d like to share the most recent paper from the OpenAI Interpretability Team. The question of how activation probes with sparse autoencoders compare with baseline activation probes has gotten a lot of attention recently. With this work, we took a step back and asked: how do both these activation probing...

Apr 30, 2025•23

Transformer Debugger

Transformer Debugger (TDB) is a tool developed by OpenAI's Superalignment team with the goal of supporting investigations into circuits underlying specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders. TDB enables rapid exploration before needing to write code, with the ability to intervene in...

Mar 12, 2024•26