Note that number of scholars is a much more important metric than number of mentors when it comes to evaluating MATS resources, as scholar per mentors varies a bunch (eg over winter I had 10 scholars, which is much more than most mentors). Harder to evaluate from the outside though!

Reply

Refusal in LLMs is mediated by a single direction

Neel Nanda4hΩ330

Thanks, I'd be very curious to hear if this meets your bar for being impressed, or what else it would take! Further evidence:

Passing the Twitter test (for at least one user)
Being used by Simon Lerman, an author on Bad LLama (admittedly with help of Andy Arditi, our first author) to jailbreak LLaMA3 70B to help create data for some red-teaming research.

Reply

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda13h20

Nnsight, pyvene, inseq, torchlens are other libraries coming to mind that it would be good to discuss in a related work. Also penzai in JAX

Reply

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda2dΩ220

I hadn't seen the latter, thanks for sharing!

Reply

Refusal in LLMs is mediated by a single direction

Neel Nanda5dΩ230

Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.

Idk. This shows that if you wanted to optimally get rid of refusal, you might want to do this. But, really, you want to balance between refusal and not damaging the model. Probably many layers are just kinda irrelevant for refusal. Though really this argues that we're both wrong, and the most surgical intervention is deleting the direction from key layers only.

Reply

Refusal in LLMs is mediated by a single direction

Neel Nanda6dΩ351

Thanks! I'm personally skeptical of ablating a separate direction per block, it feels less surgical than a single direction everywhere, and we show that a single direction works fine for LLAMA3 8B and 70B

The transformer lens library does not have a save feature :(

Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.

Reply

Introducing AI-Powered Audiobooks of Rational Fiction Classics

Neel Nanda7d60

Thanks for making these! How expensive is it?

Reply

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda7d22

Makes sense! Sounds like a fairly good fit

It just seems intuitively like a natural fit: Everyone in mech interp needs to inspect models. This tool makes it easier to inspect models.

Another way of framing it: Try to write your paper in such a way that a mech interp researcher reading it says "huh, I want to go and use this library for my research". Eg give examples of things that were previously hard that are now easy.

Reply

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda9d22

Looks relevant to me on a skim! I'd probably want to see some arguments in the submission for why this is useful tooling for mech interp people specifically (though being useful to non mech interp people too is a bonus!)

Reply

Transcoders enable fine-grained interpretable circuit analysis for language models

Neel Nanda10d40

That's awesome, and insanely fast! Thanks so much, I really appreciate it

Reply

1