LESSWRONG
LW

Arush

Message

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Paper coauthors: Rusheb Shah, Quentin Feuillade--Montixi, Soroush J. Pour, Arush Tagade, Stephen Casper, Javier Rando. Motivation Our research team was motivated to show that state-of-the-art (SOTA) LLMs like GPT-4 and Claude 2 are not robust to misuse risk and can't be fully aligned to the desires of their creators, posing...

Nov 7, 2023•38

Arush

Arush — LessWrong

Arush

Message

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Nov 7, 2023•38

Arush

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush, scasper+ 0 more

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush, scasper

Paper coauthors: Rusheb Shah, Quentin Feuillade--Montixi, Soroush J. Pour, Arush Tagade, Stephen Casper, Javier Rando.

Motivation

Our research team was motivated to show that state-of-the-art (SOTA) LLMs like GPT-4 and Claude 2 are not robust to misuse risk and can't be fully aligned to the desires of their creators, posing risk for societal harm. This is despite significant effort by their creators, showing that the current paradigm of pre-training, SFT, and RLHF is not adequate for model robustness.

We also wanted to explore & share findings around "persona modulation"^[1], a technique where the character-impersonation strengths of LLMs are used to steer them in powerful ways.

Summary

We introduce an automated, low cost way to make transferable,... (read 312 more words →)