x

LESSWRONG

LW

GDM Interp Progress Updates — LessWrong

GDM Interp Progress Updates

Apr 19, 2024 by Neel Nanda

73[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

2y

0

80[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

2y

10

48The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

Arthur Conmy, Neel Nanda

1y

1

117Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah, Neel Nanda

1y

15

140A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

8mo

39

68How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

8mo

1

90Models May Behave Worse When Eval Aware

Senthooran Rajamanoharan, Neel Nanda

2mo

8

62Building and evaluating model diffing agents

bilalchughtai, Josh Engels, Neel Nanda

2mo

2

90SFT Drives Gemini’s Safety Properties

Josh Engels, Arthur Conmy, bilalchughtai, Neel Nanda

2mo

4

60Why Do Naive SFT Filters For Safety Properties Fail?

Josh Engels, Neel Nanda

1mo

7

63Synthetic document finetuning for instilling positive traits

CallumMcDougall, Arthur Conmy, Neel Nanda

1mo

1