Neel Nanda

Sequences

GDM Mech Interp Progress Updates
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Mechanistic Interpretability Puzzles
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
My Overview of the AI Alignment Landscape

Wikitag Contributions

Comments

Sorted by

Extremely reasonable strategic pivot. How would you explain AI risk to a TikTok audience?

Political influence seems a very different skill to me? Lots of very influential politicians have been very incompetent in other real world ways

Alternatively, they are linked to some big major idea in governance or technical safety, often by spotting something missing years before it became relevant.

This is just a special case (and an unusually important one) of a good forecasting record, right?

I think the correct question is how much of an update should you make in an absolute sense rather than a relative sense? Many people in this community are overconfident and if you decide that every person is less worth listening to than you thought this doesn't change who you listen to, but it should make you a lot more uncertain in your beliefs

Interesting. Thanks for the list. That seemed like a pretty reasonable breakdown to me. I think mechanistic interpretability does train some of them in particular, two, three and maybe six. But I agree that things involve thinking about society and politics and power and economics etc as a whole do seem clearly more relevant.

One major concern I have is that it's hard to judge skill in domains with worse feedback loops because there is not feedback on who is correct. I'm curious how confident you are in your assessment of who has good takes or is good in these fields, and how you determine this?

Thanks!

okay, but, how actually DO we evaluate strategic takes?

Yeah, I don't have a great answer to this one. I'm mostly trying to convey the spirit of: we're all quite confused, and the people who seem competent disagree a lot, so they can't actually be that correct. And given that the ground truth is confusion, it is epistemically healthier to be aware of this.

Actually solving these problems is way harder! I haven't found a much better substitute than looking at people who have a good non-trivial track record of predictions, and people who have what to me seem like coherent models of the world that make legitimate and correct seeming predictions. Though the latter one is fuzzier and has a lot more false positives. A particularly salient form of a good track record is people who had positions in domains I know well (eg interpretability) that I previously thought were wrong/ridiculous, but who I later decided were right (eg I give Buck decent points here, and also a fair amount of points to Chris Olah)

I think it's pretty plausible that something pathological like that is happening. We're releasing this as an interesting idea that others might find useful for their use case, not as something we're confident is a superior method. If we were continuing with SAE work, we would likely sanity check it more but we thought it better to release it than not

Yes, I agree. It's very annoying for general epistemics (though obviously pragmatically useful to me in various ways if people respect my opinion)

Though, to be clear, my main goal in writing this post was not to request that people defer less to me specifically, but more to make the general point about please defer more intelligently using myself as an example and to avoid calling any specific person out

I agree that I'd be shocked if GDM was training on eval sets. But I do think hill climbing on benchmarks is also very bad for those benchmarks being an accurate metric of progress and I don't trust any AI lab not to hill climb on particularly flashy metrics

I'm not trying to agree with that one. I think that if someone has thought a bunch about the general topic of AI and has a bunch of useful takes. They can probably convert this on the fly to something somewhat useful, even if it's not as reliable as it would be if they spent a long time thinking about it. Like I think I can give useful technical mechanistic interpretability takes even if the question is about topics I've not spent much time thinking about before

Are the joint names separated by spaces if not, the tokenization is going to be totally broken more generally I would be interested to see this Tried with a code that EG maps familiar tokens to obscure ones or something like mapping token with id k to id maximum minus K. Tokens feel like the natural way in llm would represent its processing and thus encoded processing. Doing things in individual letters is kind of hard

Load More