x

LESSWRONG

LW

Alex Makelov — LessWrong

Alex Makelov

Alex Makelov

Message

71

Ω

14

1

1

8y

Alex Makelov

71

Ω

14

8y

SAEs Discover Meaningful Features in the IOI Task

TLDR: recently, we wrote a paper proposing several evaluations of SAEs against "ground-truth" features computed w/ supervision for a given task (in our case, IOI [1]). However, we didn't optimize the SAEs much for performance in our tests. After putting the paper on arxiv, Alex carried out a more exhaustive...

Jun 5, 2024•15

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort We would like to thank Atticus Geiger for his valuable feedback and in-depth discussions throughout this project. tl;dr: Activation patching is a common method for finding model components (attention heads, MLP layers, …) relevant to...

Aug 29, 2023•77