I think this is important to define anyway! (and likely pretty obvious). This would create a lot more friction for someone to take on such a role though, or move out
But only a small fraction work on evaluations, so the increased cost is much smaller than you make out
Cool work! This is the outcome I expected, but I'm glad someone actually went and did it
Yeah, if I made an introduction it would ruin the spirit of it!
I don't see important differences between that and ce loss delta in the context Lucius is describing
This seems true to me, though finding the right scaling curve for models is typically quite hard so the conversion to effective compute is difficult. I typically use CE loss change, not loss recovered. I think we just don't know how to evaluate SAE quality.
My personal guess is that SAEs can be a useful interpretability tool despite making a big difference in effective compute, and we should think more in terms of useful they are for downstream tasks. But I agree this is a real phenomena, that is easy to overlook, and is bad.
These are LLM generated labels, there are no "real" labels (because they're expensive!). Especially in our demo, Neuronpedia made them with gpt 3.5 which is kinda dumb.
I mostly think they're much better than nothing, but shouldn't be trusted, and I'm glad our demo makes this apparent to people! I'm excited about work to improve autointerp, though unfortunately the easiest way is to use a better model, which gets expensive
I dislike clickbait when it's misleading, or takes a really long time to get to the point (esp if it's then underwhelming). I was fine with this post on that front.
Cold emailing Youtubers offering to chat about mechanistic interpretability turns out to be a way, way more effective strategy than I predicted! I'm super excited about that video (and it came out so well!). The video
Yeah, fair point, disagreement retracted