Very useful list Neel!! Thanks for mentioning AxBench, but unfortunately we don't own the domain you linked to 😅 the actual link is https://github.com/stanfordnlp/axbench
with a bunch of reflexes to eg stop and say “that doesn’t sound right” or “I think I’ve gone wrong, let’s backtrack and try another path”
Shannon Sands says he's found a backtracking vector in R1:
https://x.com/chrisbarber/status/1885047105741611507
- Improving our current techniques for using LLMs to interpret SAE latents
As far as you're aware, is there any autointerp work that's based on actively steering (boosting/suppressing) the latent to be labeled and generating completions, rather than searching a dataset for activating examples?
Hmm, there is a related thing called "intervention scoring" ( https://arxiv.org/abs/2410.13928 ) but this appears to be only for scoring the descriptions produced by the traditional method, not using interventions to generate the descriptions in the first place.
Huh, seems to be working for me. What do you see when you click on it?
I've just opened summer MATS applications (where I'll supervise people to write mech interp papers) I'd love to get applications from any readers who are interested! Apply here, due Feb 28
As part of this, I wrote up a list of research areas I'm currently excited about, and thoughts for promising directions within those, which I thought might be of wider interest, so I've copied it in below:
Understanding thinking models
Eg o1, r1, Gemini Flash Thinking, etc - ie models that produce a really long chain of thought when reasoning through complex problems, and seem to be much more capable as a result. These seem like a big deal, and we understand so little about them! And now we have small thinking models like r1 distilled Qwen 1.5B, they seem quite tractable to study (though larger distilled versions of r1 will be better. I doubt you need full r1 though).
Sparse Autoencoders
In previous rounds I was predominantly interested in Sparse Autoencoder projects, but I’m comparatively less excited about SAEs now - I still think they’re cool, and am happy to get SAE applications/supervise SAE projects, but think they’re unlikely to be a silver bullet and expect to diversify my projects a bit more (I’ll hopefully write more on my overall takes soon).
Within SAEs, I’m most excited about:
Feature absorption
Whether SAEs learn the “right” concepts
Whether our interpretations of SAE latents (aka features)[1] are correct
Model diffing
What happens to a model during finetuning? If we have both the original and tuned model, can we somehow take the “diff” between the two to just interpret what changed during finetuning?
Understanding sophisticated/safety relevant behaviour
LLMs are getting good enough that they start to directly demonstrate some alignment relevant behaviours. Most interpretability work tries to advance the field in general by studying arbitrary, often toy, problems, but I’d be very excited to study these phenomena directly!
Being useful
Interpretability is often pretty abstract, pursuing lofty blue skies goals, and it’s hard to tell if your work is total BS or not. I’m excited about projects that take a real task, one that can be defined without ever referencing interpretability, and trying to beat non-interp baselines in a fair(ish) fight - if you can do this, it’s strong evidence you’ve learned *something *real
Investigate fundamental assumptions
There’s a lot of assumptions behind common mechanistic interpretability works, both scientific assumptions and theory of change assumptions, that in my opinion have insufficient evidence. I’d be keen to gather evidence for and against!
I favour the term latent over feature, because feature also refers to the subtly but importantly different concept of “the interpretable concept”, which an SAE “feature” imperfectly corresponds to, and it’s very confusing for it to mean both. ↩︎