I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
Why do you think it's an issue if defenses block jailbreaks with benign questions? Benign users don't use jailbreaks. The metric I actually care about is recall on very harmful questions that the policy answers (e.g. because of a jailbreak) at a very low FPR on real traffic.
I think that classifying jailbreaks is maybe a bad approach against sophisticated attackers (because they can find new jailbreaks) but I think it's a non-garbage approach if you trust the policy to refuse queries when no jailbreak is used, and you want to defend against non-sophisticated attackers or if you have offline monitoring + a rapid-response system.
I think it would be easier for me to know how impressed/scared to be if there was a website or a pdf where I could see sample trajectories with all the information that the model has access to by default in its original working directory. It's also a bit unclear to me what the auditing system would have blocked - the high-level trajectory you describe looks like blatant cheating so I am unsure to what extent I should update on "under surveillance".
I agree this is a concern and this is why for some experiments we tried "some anti-refusal training + x". We didn't run it for all x we tested mostly because it would have taken more time and costed more.
I don't think saturation is the whole story because:
What do successful trajectories look like?
1. Where do you get the base/pre-trained model for GPT-4? Would that be through collaboration with OpenAI?
Yes
2. For this, it would be also interesting to measure/evaluate the model's performance on capability tasks within the same model type (base, instruct) to see the relationship among capabilities, ability to take instructions, and "fake alignment".
I am not sure I understand what you want to know. For single-token capabilities (e.g. MMLU 5-shot no-CoT), base models and instruct models often have similar abilities. For CoT capabilities / longform answers, the capabilities are different (in big part because base models answers don't follow an ideal format). Base models are also bad at following instructions (because they "don't always want to" / are unsure whether this is the best way to predict the next token even when few-shotted).
Good point about GradDiff ~ RL. Though it feels more like a weird rebranding since RL is the obvious way to present the algorithm and "unlearning" feels like a very misleading way of saying "we train the model to do less X".
If you have environments where evil is easy to notice, you can:
I think all 3 are fine-ish. I think you can try to use ""unlearning"" to improve 1, but I think it's unclear if that helps.
I am interested in "anti-erosion training" (methods to train models to have a behavior such that training on random other stuff on different prompts does not erode the original behavior). It feels directly useful for this and would also be great to build better model organisms (which often have the issue of being solved by training on random stuff). Are you planning on doing any work on this?
I tracked down the original source from the Wikipedia page. The average increase is much smaller than the headline number of "up to 2.5C" and is closer to 0.4C. I think the rough order of magnitude checks out (see Ben's comment for more details) since an increase by 0.4C means a 0.005 increase in power (if Claude's math is correct).
Thanks for the additional data!
Not sure if I fully understood the second bullet point. But I'd say "not be able to relearn how to do evil things" may be too much too ask in case of tendency unlearning and I'd aim for robustness to some cheaper attacks. So I mean that [behavioral suppression] < [removal from the weights / resistance to cheap attacks] < [resistance to arbitrary FT attacks], and here we should aim for the second thing.
I see. I am still not sure what exactly you want to make hard to learn that doesn't work by just adding new RL training points.
One thing you could do if you were able to recognize evilness IID is to unlearn that. But then you could have just negatively rewarded it.
If you can't recognize evilness IID, then maybe you can "unlearn" evilness by doing the meta-learning thing on settings where it is easy to check if the model is being evil or not. But I would bet against it preventing the model from learning the conditional policy "If in a setting where it's easy to check, be nice, if hard to check, be evil" better than just doing regular RL on the easy setting and penalizing evil heavily (both as init and continuously during RL).
Maybe there is a less toy version of the second idea which you think would work?
I would guess using unlearning like that in prod is worse than spending time on things like trying harder to look for evil behaviors, patching RL environments, and building better RL environments where evil is easy to notice - but further experiments could change my mind if the gains relative to the baselines were huge.
In this sort of situation, you can use a prefill and/or asking for an output format with a specific format (e.g. json).