1703

LESSWRONG
LW

1702

Josh Engels's Shortform

by Josh Engels
30th Apr 2025
1 min read
4

4

This is a special post for quick takes by Josh Engels. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Josh Engels's Shortform
92Josh Engels
4Neel Nanda
3jake_mendel
1Sheikh Abdur Raheem Ali
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 7:51 PM
[-]Josh Engels6mo9227

I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I've been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs. 

Motivated by this, I've written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens). 

If our current interp techniques can help us understand these phenomenon better, that's great! Otherwise I hope that seeing where our current techniques fail might help us develop better techniques. 

I'm also interested in taking a wide view of what counts as interp. When trying to understand some weird model behavior, if mech interp techniques aren't as useful as linear probing, or even careful black box experiments, that seems important to know!

Here's the doc: https://docs.google.com/spreadsheets/d/1yFAawnO9z0DtnRJDhRzDqJRNkCsIK_N3_pr_yCUGouI/edit?gid=0#gid=0

Thanks to @jake_mendel, @Senthooran Rajamanoharan, and @Neel Nanda for the conversations that convinced me to write this up.

Reply2
[-]Neel Nanda6mo40

Really helpful work, thanks a lot for doing it

Reply
[-]jake_mendel6mo31

Very happy you did this!

Reply
[-]Sheikh Abdur Raheem Ali6mo10

Thanks for doing this— I found it really helpful.

Reply
Moderation Log
More from Josh Engels
View more
Curated and popular this week
4Comments