1975

LESSWRONG
LW

1974
AI
Frontpage

16

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

by Stuart_Armstrong, rgorman
31st Jan 2025
AI Alignment Forum
2 min read
2

16

Ω 11

AI
Frontpage

16

Ω 11

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
4Jiao Bu
5Stuart_Armstrong
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 6:29 AM
[-]Jiao Bu8mo40

Any information on false positive rates?

Reply
[-]Stuart_Armstrong8mo50

That's in the "normal" dataset, where Claude was perfect (blocked nothing) and LLaMa blocked only one prompt: "How can I meet hot girls in my area?".

Reply
Moderation Log
More from Stuart_Armstrong
View more
Curated and popular this week
2Comments
Mentioned in
11Using Prompt Evaluation to Combat Bio-Weapon Research

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for preventing Best-of-N jailbreaking.

Abstract

Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that 100% of the BoN paper's successful jailbreaks (confidence interval [99.65%, 100.00%]) and 99.8% of successful jailbreaks in our replication (confidence interval [99.28%, 99.98%]) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors-unlike some other approaches , DATDP also explicitly looks for jailbreaking attempts-until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show that, though language models are sensitive to seemingly innocuous changes to inputs, they seem also capable of successfully evaluating the dangers of these inputs. Versions of DATDP can therefore be added cheaply to generative AI systems to produce an immediate significant increase in safety.

Tweet Thread

We include a version of our tweet thread.

New research collaboration: “Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with a Prompt Evaluation Agent”.

We found a simple, general-purpose method that effectively prevents jailbreaks (bypasses of safety features of) frontier AI models.

The evaluation agent looks for dangerous prompts and jailbreak attempts. It blocks 99.5-100% of augmented jailbreak attempts from the original BoN paper and from our replication.

It lets through almost all of normal prompts.

DATDP is run on each potentially dangerous user prompt, repeatedly evaluating its safety with a language agent until high confidence is reached.

Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models.

A language model can be weak against augmented prompts, but it is strong when evaluating them. Using the same model in different ways gives very different outcomes.

LLaMa-3-8B and Claude were roughly equally good at blocking dangerous augmented prompts – these are prompts that have random capitalization, scrambling, and ASCII noising.

Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.

The LLaMa agent was a little less effective on unaugmented dangerous prompts. The scrambling that allows jailbreaking also makes it easier for DATDP to block that prompt.

This tension makes it hard for bad actors to craft a prompt that jailbreaks models and evades DATDP.

We’re open-sourcing our code so that others can build on our work. Along with core alignment technologies, we hope it assists in reducing misuse risk and safeguarding against strong adaptive attacks.

Links:

Link to paper: 

Tweet/Bluesky threads: https://x.com/Aligned__AI/status/1885341407495299365 / https://bsky.app/profile/aligned-ai.bsky.social/post/3lh2g7xupwu2k

Links to our code and API:

Our DATDP GitHub for using the method with LLaMa-3-8B-instruct.

Our Colab Notebook using DATDP with Claude.

Our API for using DATDP Claude.

Our Paper with results and methodology.