Neel Nanda

Sequences

GDM Mech Interp Progress Updates
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Mechanistic Interpretability Puzzles
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
My Overview of the AI Alignment Landscape

Wiki Contributions

Comments

Fair point, I've been procrastinating on putting out an updated version (and don't have anything else I back enough to want to recommend in it's place - I haven't read this post closely enough yet), but adding that note to the top seems reasonable

Thanks for writing this post! For the avoidance of confusion, my MATS stream has a very different admissions process, that is heavily based on a work task and doesn't have interviews (and weights quite different things), see more details here: https://tinyurl.com/neel-mats-app

Neel NandaΩ8130

Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.

Can confirm, that list is SO out of date and does not represent the current frontiers. Zero offence taken. Thanks for publishing this list!

Imo they're just completely different techniques, which aren't really comparable. Activation patching is about understanding the difference between two activations by patching one to replace the other and seeing what happens. SAEs are a technique for decomposing an activation into interpretable pieces

Interesting! You might be interested in a post from my team on inference-time optimization

It's not clear to me what the right call here is though, because you want f to be something the model could extract. The encoder being so simple is in some ways a feature, not a bug - I wouldn't want it to be eg a deep model, because the LLM can't easily extract that!

I'm pleasantly surprised by how short the Google DeepMind section is. How much do you think readers should read into that, vs eg "you're in the Bay and hear more about Bay Area drama" or "you didn't try very hard for GDM"

I don't quite understand the question. I've heard various bits of gossip, both as an employee and now. I wouldn't say I'm confident in my understanding of any of it. I was somewhat sad about Jack and Dario's public comments about thinking it's too early to regulate (if I understood them correctly), which I also found surprising as I thought they had fairly short timelines, but policy is not at all my area of expertise so I am not confident in this take.

I think it's totally plausible Anthropic has net negative impact, but the same is true for almost any significant actor in a complex situation. I agree that policy is one such way that their impact could be negative, though I'd generally bet Anthropic will push more for policies I personally support than any other lab, even if they may not push as much as I want them to.

Load More