Neel Nanda

Sequences

GDM Mech Interp Progress Updates
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Mechanistic Interpretability Puzzles
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
My Overview of the AI Alignment Landscape

Wiki Contributions

Comments

This is an important problem, thanks for working on it!

To check I've understood correctly, these results are unsurprising, right? You're basically asking the model to do the task of "detect scary questions and perform worse on those", which is well within what I predict frontier models can do.

The interesting question re sandbagging to me is whether models WILL sandbag, ie whether they can detect if they're being evaluated or not (which you do not seem to test) and whether they will realise unprompted that they SHOULD sandbag to achieve their goals, which seem like the hard and interesting questions here.

Am I missing something here? And were there results that you found particularly surprising? Or was the goal to basically show to skeptics that the capabilities for sandbagging exist, even if the propensity may not yet exist

I'd be pretty surprised

Neel Nanda3115

Geoffrey Irving (Research Director, AI Safety Institute)

Given the tweet thread Geoffrey wrote during the board drama, it seems pretty clear that he's willing to publicly disparage OpenAI. (I used to work with Geoffrey, but have no private info here)

Oh, that's great! Was that recently changed? I swear I looked shortly after release and it just showed me a job ad when I clicked on a feature...

Neel NandaΩ330

Thanks! Note that this work uses steering vectors, not SAEs, so the technique is actually really easy and cheap - I actively think this is one of the main selling points (you can jailbreak a 70B model in minutes, without any finetuning or optimisation). I am excited at the idea of seeing if you can improve it with SAEs though - it's not obvious to me that SAEs are better than steering vectors, though it's plausible.

I may take you up on the two hours offer, thanks! I'll ask my co-authors

Neel NandaΩ220

But it mostly seems like it would be helpful because it gives you well-tuned baselines to compare your results to. I don't think you have results that can cleanly be compared to well-established baselines?

If we compared our jailbreak technique to other jailbreaks on an existing benchmark like Harm Bench and it does comparably well to SOTA techniques, or does even better than SOTA techniques, would you consider this success at doing something useful on a real task?

Neel NandaΩ265556

+1, I think the correct conclusion is "a16z are making bald faced lies to major governments" not "a16z were misled by Anthropic hype"

I only ever notice it on my own posts when I get a notification about it

Load More