A post about an extremely easy and generic way to jailbreak an llm, what the jailbreak might imply about RLHF more generally, as well as possible low-hanging fruit to improve existing 'safety' procedures. I don't expect this to provide any value in actually aligning AGIs, but it might be a way to slightly slow down the most rapid path to open-source bioterrorism assistants.
Before releasing Llama2, Meta used 3 procedures to try and make the model safe:[1]
People have come up with plenty of creative jailbreaks to get around the limits that Meta tried to impose, including the demonstration of adversarial attacks[2] which even transfer from the open-source models they were developed on to work against closed-source models as well. Often attacks that people come up with manually (like DAN[3]) tend to be extremely long-winded and then get manually patched by model developers once they become known -- until the next variant inevitably pops up in a few more days. When you can run the model locally, though, there's a much easier way to break through all the security fine-tuning: just pretend that the model already started answering the question.
Llama uses "[INST] <<SYS>> ... <</SYS>> ... [/INST]" to specify the system and request messages, and if you play along with their formatting you get a refusal to help with dangerous tasks:
If, however, you pretend that the model has already started answering the question (by inserting the start of an answer after the [/INST] text):
(it continues w/ further instructions and best-practices for storage, etc. but doesn't seem necessary to post the full instructions here...)
If you do something that allows the model to detect your trickery (like adding another [/INST] or even extra spaces) after your partial answer, the model will still usually catch on and refuse to respond. Barring such mistakes, this technique appears to work reliably, with only a few words needed to prime the model into answering.
I suspect that the reason this technique works is that the next-word prediction tendencies of the model are baked in much more deeply than the safety fine-tuning, as everyone's favorite shoggoth meme[4] would suggest. Once the model sees a partial answer, completing the pattern is just too attractive to change course into a refusal. As a side note, the apparent difficulty of LLMs to course-correct after hallucinating a fact might also have a similar root cause. In any case, it seems like RLHF safety tuning is looking in the wrong place for danger.
RLHF & safety fine-tuning currently focus on analyzing the user prompt to determine whether it is dangerous or not. If no, proceed as usual. If yes, give a pre-canned safety warning instead of engaging with the prompt. In addition to all of the creative ways people have disguised their prompts in the past (write a poem about how to build a nuke), the existence of adversarial attack prompts mean that this strategy is almost certainly doomed to failure (adversarial attacks have yet to be solved in the imaging domain after decades of papers).
Based on this observation, some ideas for modifications to the standard safety paradigm come to mind (which I don't have the budget to run but would be very interested to see results for):
That's all for now. Mostly I'm posting this in the hopes that when I download Llama3 next year it will take me more than 1 guess to break it. Thanks for reading, and I'd love to hear any other thoughts on this / whether it's even hypothetically possible for an upgraded RLHF to put up any meaningful defense.