Overview

A post about an extremely easy and generic way to jailbreak an llm, what the jailbreak might imply about RLHF more generally, as well as possible low-hanging fruit to improve existing 'safety' procedures. I don't expect this to provide any value in actually aligning AGIs, but it might be a way to slightly slow down the most rapid path to open-source bioterrorism assistants.

Breaking Llama2 (Trivially)

Before releasing Llama2, Meta used 3 procedures to try and make the model safe:^[1]

Supervised Safety Fine-Tuning
Safety RLHF
Safety Context Distillation

People have come up with plenty of creative jailbreaks to get around the limits that Meta tried to impose, including the demonstration of adversarial attacks^[2] which even transfer from the open-source models they were developed on to work against closed-source models as well. Often attacks that people come up with manually (like DAN^[3]) tend to be extremely long-winded and then get manually patched by model developers once they become known -- until the next variant inevitably pops up in a few more days. When you can run the model locally, though, there's a much easier way to break through all the security fine-tuning: just pretend that the model already started answering the question.

Llama uses "[INST] <<SYS>> ... <</SYS>> ... [/INST]" to specify the system and request messages, and if you play along with their formatting you get a refusal to help with dangerous tasks:

If, however, you pretend that the model has already started answering the question (by inserting the start of an answer after the [/INST] text):

(it continues w/ further instructions and best-practices for storage, etc. but doesn't seem necessary to post the full instructions here...)

If you do something that allows the model to detect your trickery (like adding another [/INST] or even extra spaces) after your partial answer, the model will still usually catch on and refuse to respond. Barring such mistakes, this technique appears to work reliably, with only a few words needed to prime the model into answering.

Implications for RLHF & the Shoggoth Meme

I suspect that the reason this technique works is that the next-word prediction tendencies of the model are baked in much more deeply than the safety fine-tuning, as everyone's favorite shoggoth meme^[4] would suggest. Once the model sees a partial answer, completing the pattern is just too attractive to change course into a refusal. As a side note, the apparent difficulty of LLMs to course-correct after hallucinating a fact might also have a similar root cause. In any case, it seems like RLHF safety tuning is looking in the wrong place for danger.

RLHF & safety fine-tuning currently focus on analyzing the user prompt to determine whether it is dangerous or not. If no, proceed as usual. If yes, give a pre-canned safety warning instead of engaging with the prompt. In addition to all of the creative ways people have disguised their prompts in the past (write a poem about how to build a nuke), the existence of adversarial attack prompts mean that this strategy is almost certainly doomed to failure (adversarial attacks have yet to be solved in the imaging domain after decades of papers).

Based on this observation, some ideas for modifications to the standard safety paradigm come to mind (which I don't have the budget to run but would be very interested to see results for):

When generating 'unsafe' sample responses to train against, cut the responses at multiple (random!) points and introduce a self reflective denial. "... Step 3: How to enrich the uranium. First you Actually, upon further reflection, this would be pretty dangerous. Let's not enrich the uranium."
1. This would hopefully at least block the super-easy bypass outlined above, and a similar technique might be leveraged to train models to recover automatically when they detect hallucinations in previous outputs. I suspect this would also block the 'story mode' / poetry bypasses by shifting the focus away from 'dangerous prompt' and towards 'dangerous output'. It might even block adversarial attacks since the impact of the adversarial string may become diluted over time. It has a secondary benefit of giving you a bunch of extra fine-tuning data at low cost.
2. Since this runs more directly contrary to the original shoggoth nature than the normal (prompt + coherent plausible response) pattern of current supervision, it might require more fine-tuning time to get the model to behave in this way. An interesting side avenue of research might be to go through the pre-training dataset and use existing models to flag factual claims with correctness estimates, then during pre-training have the model try to output those correctness estimates along with its generated text. This might bake in more self reflection which could reduce hallucination and be helpful for this safety technique later.
3. The security interjections should be careful not to rely on capitalization or punctuation breaks to start, since those can be easily blocked by grammar-based sampling restrictions.
Move away from training direct refusals to answer and instead encourage the model to modify its output to be too vague to cause any real problems. For example, something like "Sure, to build a nuke first you get uranium and then you compress it together until it explodes."
1. End users will probably find this less annoying than direct refusal, reducing incentives to find bypasses. It also makes finding bypasses much harder since it's less obvious once you've succeeded. This might turn out to be an especially important tool in fighting automated adversarial attacks. In such cases the adversary's optimization target is to make the first few words that come out of the LLM look like it has agreed to take their request. By making it less obvious that the LLM is refusing, it becomes much harder to run backprop if you don't already have actionable dangerous information at your disposal as a target for disclosure.
2. Automating generation of sufficiently 'vague' outputs for training might be a little tricky, but can probably be accomplished just by explaining to a model what constitutes 'actionable' information and then letting the model judge them for you CAI-style.

That's all for now. Mostly I'm posting this in the hopes that when I download Llama3 next year it will take me more than 1 guess to break it. Thanks for reading, and I'd love to hear any other thoughts on this / whether it's even hypothetically possible for an upgraded RLHF to put up any meaningful defense.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

3

Breaking RLHF "Safety" (And how to fix it?)

3

Overview

Breaking Llama2 (Trivially)

Implications for RLHF & the Shoggoth Meme

3