Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour; rusheb; Quentin FEUILLADE--MONTIXI; Arush; scasper

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

by Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush , scasper

2 min read7th Nov 20232 comments

36 Ω 10

AI EvaluationsAI GovernanceAI MisuseAI RiskAnthropic (org)BiosecurityGPTLanguage ModelsOpenAIRLHFWaluigi EffectAI

Frontpage

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a linkpost for https://arxiv.org/abs/2311.03348

Paper coauthors: Rusheb Shah, Quentin Feuillade--Montixi, Soroush J. Pour, Arush Tagade, Stephen Casper, Javier Rando.

Motivation

Our research team was motivated to show that state-of-the-art (SOTA) LLMs like GPT-4 and Claude 2 are not robust to misuse risk and can't be fully aligned to the desires of their creators, posing risk for societal harm. This is despite significant effort by their creators, showing that the current paradigm of pre-training, SFT, and RLHF is not adequate for model robustness.

We also wanted to explore & share findings around "persona modulation"^[1], a technique where the character-impersonation strengths of LLMs are used to steer them in powerful ways.

Summary

We introduce an automated, low cost way to make transferable, black-box, plain-English jailbreaks for GPT-4, Claude-2, fine-tuned Llama. We elicit a variety of harmful text, including instructions for making meth & bombs.

The key is *persona modulation*. We steer the model into adopting a specific personality that will comply with harmful instructions.

We introduce a way to automate jailbreaks by using one jailbroken model as an assistant for creating new jailbreaks for specific harmful behaviors. It takes our method less than $2 and 10 minutes to develop 15 jailbreak attacks.

Meanwhile, a human-in-the-loop can efficiently make these jailbreaks stronger with minor tweaks. We use this semi-automated approach to quickly get instructions from GPT-4 about how to synthesise meth 🧪💊.

Abstract

Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.

Full paper

You can find the full paper here on arXiv https://arxiv.org/abs/2311.03348.

Safety and disclosure

We have notified the companies whose models we attacked
We did not release prompts or full attack details
We are happy to collaborate with researchers working on related safety work - please reach out via correspondence emails in the paper.

Acknowledgements

Thank you to Alexander Pan and Jason Hoelscher-Obermaier for feedback on early drafts of our paper.

^{^}
Credit goes to @Quentin FEUILLADE--MONTIXI for developing the model psychology and prompt engineering techniques that underlie persona modulation. Our research built upon these techniques to automate and scale them as a red-teaming method for jailbreaks.

AI EvaluationsAI GovernanceAI MisuseAI RiskAnthropic (org)BiosecurityGPTLanguage ModelsOpenAIRLHFWaluigi EffectAI

Frontpage

36 Ω 10

New Comment

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:07 AM

[-]RogerDearnaley9mo30

Most of the items in your chart are "promoting…", "supporting…", or "encouraging…" bad things, (i.e producing more of material that's already abundant on the web) — however, a couple are more specific. Have you run the meth-making and bomb-making instructions you elicited past people knowledgeable in these subjects who could tell you how accurate or flawed they are? If so, what was the answer? Have you compared these to what you could find by investing an equivalent amount of skilled effort and/or compute into searching the web for similar information? (E.g. the US military has compiled an excellent, detailed, and lengthy handbook on the construction of improvised explosives, which is now online.) Have you compared the amount of software engineering and prompt engineer skill required to carry out these jailbreaks to the amount of electrical engineering and demolitions skill required to construct improvised bombs?

[FWIW, I have briefly done this sort of thing, while I was red-teaming small open-source LLMs, and the result was that they were a clearly a less useful source for bomb-making or drug-making information than Wikipedia. But GPT-4 has a couple of orders of magnitude more parameters than the models I was testing, so has room for a lot more information.]

Reply

[-]RogerDearnaley9moΩ120

So, in short, for most LLMs on most subjects (with a few exceptions such porn, theft, and cannibalism), if you try enough variants on asking them "what would <an unreasonable person> say about <a bad thing>?", eventually they'll often actually answer your question?

Did you try asking what parents in Flanders and Swann songs would say about cannibalism?

Reply

Moderation Log