Neel Nanda


Mechanistic Interpretability Puzzles
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
My Overview of the AI Alignment Landscape

Wiki Contributions


Is research published elsewhere but cross posted here eligible? Eg I think that Toy Models of Superposition was one of the best papers of last year, and it was [cross posted to LessWrong] ( out of the overall alignment space, but isn't exactly a LessWrong post per se.

(notably, my grokking work and casual scrubbing were mech interp research that WAS published on LessWrong first and foremost)

For what it's worth, as a MATS mentor, I gave a bunch of 7s and 8s for people I'm excited about, and felt bad giving people 9s or 10s unless it was super obviously justified

This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept.

I'd be curious to hear more about this - IMO we're talking past each other given that we disagree on this point! Like, in my opinion, the reason low rank approximations work at all is because of superposition.

For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?

My understanding was that John wanted to only have a few variables mattering on a given input, which SAEs give you. The causal graph is large in general, but IMO that's just an unavoidable property of models and superposition.

I'm confused by why you don't consider "only a few neurons being non-zero" to be a "low dimensional summary of the relevant information in the layer"

Thanks for the catch, I deleted "Note that the hidden dimen". Transformers do blow up the hidden dimension, but that's not very relevant here - they have many more neurons than residual stream dimensions, and they have many more features than neurons (as shown in the recent Anthropic paper)

Seems clearly true, the Fourier Multiplication Algorithm for modular addition is not the program easiest for me to understand to perform modular addition!

Thanks for writing this, this is a great post and I broadly agree with most of it!

If you get rejected without being invited to an interview, this is unfortunate but still valuable feedback. It basically means “You clearly aren’t there yet”. So you should probably build more skills for 6 months or so before applying again.

This feels false to me. I've done a lot of CV (aka resume) screening, and it is a super noisy process, and it's easy to be overly credentialist and favour people with legible signalling. There's probably also a fair amount of noise in how well you write your CV to make the crucial information prominent. (relevant work experience, relevant publications, degrees, relevant projects, anything else impressive you've done). Further, "6 months of upskilling" may not turn out anything super legible (though it's great if it does, and this is worth aiming for!)

My MATS application has a 10 hour work task, and it's like night and day looking at the difference between how much signal I get from that and from just the CV, and I accept a lot of candidates who look mediocre on paper (and vice versa).

If you're getting desk rejected from jobs, I'd recommend asking a friend (ideally one with some experience in the relevant field/industry or who's done hiring before) to look at your CV/application to some recent jobs and give feedback.

Strong +1 to asking the mentor being a great way to get information! My guess is many mentors aren't going out of their way to volunteer this kind of info, but will share it if asked. Especially if they've already decided that they want to work with you.

My MATS admission doc has some info on that for me, though I can give more detailed answers if anyone emails me with specific questions.

I'd guess this varies by field? I think this would be bad advice in mech interp - there's a lot of concepts and existing mech interp theory that you need to understand a bunch of good projects, and people new to the field are often bad at explaining these (and, importantly, I think I have decent judgement about whether a project is any good). But I'd guess this is decent advice in some areas of alignment.

Load More