LESSWRONG
LW

1768
The Engineer’s Interpretability Sequence

The Engineer’s Interpretability Sequence

Feb 09, 2023 by scasper

Interpretability research is popular, and interpretability tools play a role in almost every agenda for making AI safe. However, for all the interpretability work that exists, there is a significant gap between the research and engineering applications. If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, shouldn’t we be seeing tools that are more helpful on real world problems?

This 12-post sequence argues for taking an engineering approach to interpretability research. And from this lens, it analyzes existing work and proposes directions for moving forward. 

46The Engineer’s Interpretability Sequence (EIS) I: Intro
Ω
scasper
3y
Ω
24
28EIS II: What is “Interpretability”?
Ω
scasper
3y
Ω
6
20EIS III: Broad Critiques of Interpretability Research
Ω
scasper
3y
Ω
2
19EIS IV: A Spotlight on Feature Attribution/Saliency
Ω
scasper
3y
Ω
1
57EIS V: Blind Spots In AI Safety Interpretability Research
Ω
scasper
3y
Ω
24
49EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety
Ω
scasper
3y
Ω
9
36EIS VII: A Challenge for Mechanists
Ω
scasper
3y
Ω
4
30EIS VIII: An Engineer’s Understanding of Deceptive Alignment
Ω
scasper
3y
Ω
5
30EIS IX: Interpretability and Adversaries
Ω
scasper
3y
Ω
8
14EIS X: Continual Learning, Modularity, Compression, and Biological Brains
Ω
scasper
3y
Ω
4
19EIS XI: Moving Forward
Ω
scasper
3y
Ω
2
19EIS XII: Summary
Ω
scasper
3y
Ω
0
157EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
Ω
scasper
1y
Ω
16
68EIS XIV: Is mechanistic interpretability about to be practically useful?
Ω
scasper
1y
Ω
4
30EIS XV: A New Proof of Concept for Useful Interpretability
Ω
scasper
6mo
Ω
2