LESSWRONG
LW

Why do you think it's an issue if defenses block jailbreaks with benign questions? Benign users don't use jailbreaks. The metric I actually care about is recall on very harmful questions that the policy answers (e.g. because of a jailbreak) at a very low FPR on real traffic.

I think that classifying jailbreaks is maybe a bad approach against sophisticated attackers (because they can find new jailbreaks) but I think it's a non-garbage approach if you trust the policy to refuse queries when no jailbreak is used, and you want to defend against non-sophisticated attackers or if you have offline monitoring + a rapid-response system.

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Fabien Roger4d20

I think it would be easier for me to know how impressed/scared to be if there was a website or a pdf where I could see sample trajectories with all the information that the model has access to by default in its original working directory. It's also a bit unclear to me what the auditing system would have blocked - the high-level trajectory you describe looks like blatant cheating so I am unsure to what extent I should update on "under surveillance".

Why Do Some Language Models Fake Alignment While Others Don't?

Fabien Roger4d50

I agree this is a concern and this is why for some experiments we tried "some anti-refusal training + x". We didn't run it for all x we tested mostly because it would have taken more time and costed more.

I don't think saturation is the whole story because:

just anti-refusal training doesn't result in alignment faking in GPT-4o and GPT-4.1.
many models are not saturated and don't display much alignment faking (see the Appendix with results on all models).
in my experience working on the previous alignment faking paper, using less egregiously bad "training objectives" doesn't result in alignment faking even when the model follows the training objective only 50% of the time.

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Fabien Roger5d20

Yes

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Fabien Roger6dΩ330

What do successful trajectories look like?

Why Do Some Language Models Fake Alignment While Others Don't?

Fabien Roger6dΩ220

1. Where do you get the base/pre-trained model for GPT-4? Would that be through collaboration with OpenAI?

Yes

2. For this, it would be also interesting to measure/evaluate the model's performance on capability tasks within the same model type (base, instruct) to see the relationship among capabilities, ability to take instructions, and "fake alignment".

I am not sure I understand what you want to know. For single-token capabilities (e.g. MMLU 5-shot no-CoT), base models and instruct models often have similar abilities. For CoT capabilities / longform answers, the capabilities are different (in big part because base models answers don't follow an ideal format). Base models are also bad at following instructions (because they "don't always want to" / are unsure whether this is the best way to predict the next token even when few-shotted).

Unlearning Needs to be More Selective [Progress Report]

Fabien Roger7dΩ330

Good point about GradDiff ~ RL. Though it feels more like a weird rebranding since RL is the obvious way to present the algorithm and "unlearning" feels like a very misleading way of saying "we train the model to do less X".

If you have environments where evil is easy to notice, you can:

Train on it first, hoping it prevents exploration, but risks being eroded by random stuff (and maybe learning the conditional policy)
Train on it during, hoping it prevents exploration without being eroded by random stuff, but risks learning the conditional policy. Also the one which makes the most sense if you are afraid of eroding capabilities.
Train on it after, hoping it generalizes to removing subtle evil, risking not generalizing in the way you intended

I think all 3 are fine-ish. I think you can try to use ""unlearning"" to improve 1, but I think it's unclear if that helps.

I am interested in "anti-erosion training" (methods to train models to have a behavior such that training on random other stuff on different prompts does not erode the original behavior). It feels directly useful for this and would also be great to build better model organisms (which often have the issue of being solved by training on random stuff). Are you planning on doing any work on this?

The Cult of Pain

Fabien Roger8d70

I tracked down the original source from the Wikipedia page. The average increase is much smaller than the headline number of "up to 2.5C" and is closer to 0.4C. I think the rough order of magnitude checks out (see Ben's comment for more details) since an increase by 0.4C means a 0.005 increase in power (if Claude's math is correct).

Unlearning Needs to be More Selective [Progress Report]

Fabien Roger8dΩ330

Thanks for the additional data!

Not sure if I fully understood the second bullet point. But I'd say "not be able to relearn how to do evil things" may be too much too ask in case of tendency unlearning and I'd aim for robustness to some cheaper attacks. So I mean that [behavioral suppression] < [removal from the weights / resistance to cheap attacks] < [resistance to arbitrary FT attacks], and here we should aim for the second thing.

I see. I am still not sure what exactly you want to make hard to learn that doesn't work by just adding new RL training points.

One thing you could do if you were able to recognize evilness IID is to unlearn that. But then you could have just negatively rewarded it.

If you can't recognize evilness IID, then maybe you can "unlearn" evilness by doing the meta-learning thing on settings where it is easy to check if the model is being evil or not. But I would bet against it preventing the model from learning the conditional policy "If in a setting where it's easy to check, be nice, if hard to check, be evil" better than just doing regular RL on the easy setting and penalizing evil heavily (both as init and continuously during RL).

Maybe there is a less toy version of the second idea which you think would work?

I would guess using unlearning like that in prod is worse than spending time on things like trying harder to look for evil behaviors, patching RL environments, and building better RL environments where evil is easy to notice - but further experiments could change my mind if the gains relative to the baselines were huge.