To check I've understood correctly, these results are unsurprising, right? You're basically asking the model to do the task of "detect scary questions and perform worse on those", which is well within what I predict frontier models can do.

The interesting question re sandbagging to me is whether models WILL sandbag, ie whether they can detect if they're being evaluated or not (which you do not seem to test) and whether they will realise unprompted that they SHOULD sandbag to achieve their goals, which seem like the hard and interesting questions here.

Am I missing something here? And were there results that you found particularly surprising? Or was the goal to basically show to skeptics that the capabilities for sandbagging exist, even if the propensity may not yet exist

Reply

yanni's Shortform

Neel Nanda9d63

I'd be pretty surprised

Reply

Non-Disparagement Canaries for OpenAI

Neel Nanda15d52

Fair point

Reply

Non-Disparagement Canaries for OpenAI

Neel Nanda16d3115

Geoffrey Irving (Research Director, AI Safety Institute)

Given the tweet thread Geoffrey wrote during the board drama, it seems pretty clear that he's willing to publicly disparage OpenAI. (I used to work with Geoffrey, but have no private info here)

Reply

1

I am the Golden Gate Bridge

Neel Nanda17d22

Oh, that's great! Was that recently changed? I swear I looked shortly after release and it just showed me a job ad when I clicked on a feature...

Reply

Refusal in LLMs is mediated by a single direction

Neel Nanda23dΩ330

Thanks! Note that this work uses steering vectors, not SAEs, so the technique is actually really easy and cheap - I actively think this is one of the main selling points (you can jailbreak a 70B model in minutes, without any finetuning or optimisation). I am excited at the idea of seeing if you can improve it with SAEs though - it's not obvious to me that SAEs are better than steering vectors, though it's plausible.

I may take you up on the two hours offer, thanks! I'll ask my co-authors

Reply

Refusal in LLMs is mediated by a single direction

Neel Nanda23dΩ220

But it mostly seems like it would be helpful because it gives you well-tuned baselines to compare your results to. I don't think you have results that can cleanly be compared to well-established baselines?

If we compared our jailbreak technique to other jailbreaks on an existing benchmark like Harm Bench and it does comparably well to SOTA techniques, or does even better than SOTA techniques, would you consider this success at doing something useful on a real task?

Reply

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

Neel Nanda24dΩ265556

+1, I think the correct conclusion is "a16z are making bald faced lies to major governments" not "a16z were misled by Anthropic hype"

Reply

Open Thread Spring 2024

Neel Nanda26d20

I only ever notice it on my own posts when I get a notification about it

Reply