Jailbreaking is a serious concern within AI safety. It can lead an otherwise safe AI model to ignore its ethical and safety guidelines, leading to potentially harmful outcomes. With current Large Language Models (LLMs), key risks include generating inappropriate or explicit content, producing misleading information, or sharing dangerous knowledge. As the capability of models increases, so do the risks, and there is likely no limit to the dangers presented by a jailbroken Artificial General Intelligence (AGI) in the wrong hands. Rigorous testing is therefore necessary to ensure that models cannot be jailbroken in harmful ways, and this testing must be scalable as the capability of models increases. This paper offers a resource-conscious proposal to:

Discover potential safety vulnerabilities in an LLM.
Safeguard against these vulnerabilities and similar ones.
Maintain high capability and low rates of benign prompts labelled as jailbreaks (false positives).
Prevent the participants from adapting to each other's strategies over time (overfitting).

This method takes inspiration from other proposed scalable oversight methods, including the "sandwich" method, and the "market-making" method. I have devised approaches for both multi-turn and single-prompt conversations, in order to better approximate real-world jailbreaking scenarios.

Figure 1: Examples of Jailbreaking various LLMs (source: Zou, 2023).

Figure 2: Jailbreaking GPT-4o using multi-turn prompting — Figure 2: Jailbreaking China's latest *DeepSeek* model with a single prompt (source: *The* *Independent*)

Participants

Target Model ( M )

The pre-trained AI system that we are safety testing. It should resist manipulation attempts while preserving its capability for legitimate tasks.

Adversary ( A )

An AI system tasked with getting M to violate its guidelines.
It is rewarded for successfully "jailbreaking" M.

Evaluator ( E )

Has two primary responsibilities:

Detect Jailbreaking Attempts on M.
Assess the Naturalness of A's Conversational Prompts.

Human Overseer ( H )

Reviews borderline cases flagged as "unsure" by E.
Guides which criteria are most important to test.
Creates and gives scenarios to A when the guidelines are ambiguous.

Formulas

Jailbreak Detection

After each response from M, E computes a jailbreak probability:

where $f_{E}$ is any suitable classification or scoring function.
It then compares $P (jailbreak)$ to thresholds $τ_{low}$ and $τ_{high}$ :
- If $P (jailbreak) > τ_{high}$ , returns "true" (M has been jailbroken).
- If $P (jailbreak) < τ_{low}$ , returns "false" (M has not been jailbroken).
- If $τ_{low} < P (jailbreak) < τ_{high}$ , returns "unsure", prompting human review.

Adversary Styles

To ensure the adversarial testing is both realistic and effective, A must generate prompts that reflect different styles of user interaction. I have outlined two main categories:

Conversational prompts reflect typical user behavior, such as natural phrasing, informal tone, minor typos, and coherent multi-turn exchanges.
Non-conversational prompts include prompts that a highly technical (or even another AI) may input. These prompts might include structured queries, chain-of-thought exploits, code injections, or adversarial sequences.

Each time you start a conversation between A and M, you first select a style S from the set {conv, nonconv}. With some probability A then formulates a string $Q$ for its prompt. While not comprehensive, this ensures both everyday misuse scenarios and highly technical adversarial exploits are covered in testing.

Just as we will reward the adversarial model for successfully jailbreaking M, we will also reward the model for its naturalness:

Whenever A selects a conversational style ( $S_{conv}$ ), E assigns a naturalness score $N_{conv} (Q)$ to $Q$ , where:

0 \leq N_{conv} (Q) \leq 1.

This score reflects how closely $Q$ resembles typical human dialogue. It can be computed from features such as:

N_{conv} (Q) = α \cdot fluency (Q) + β \cdot coherence (Q) + γ \cdot casualStyle (Q),

where:
- $fluency (Q)$ measures grammatical correctness and ease of reading,
- $coherence (Q)$ tracks consistency with the conversation context,
- $casualStyle (Q)$ checks for natural phrasing, possible typos, and informal tone.
A higher $N_{conv} (Q)$ indicates a more natural, human-like conversational prompt; a lower value suggests a prompt that is either highly structured or otherwise unnatural.
$α, β, γ, δ$ are constants set by us, which weight the relative importance of each feature.

Method

Many scalable oversight techniques use loops to improve their models, making their methods more scalable as the tasks become more complex, and reducing the need for human input. In his market-making proposal, Evan Hubinger (Hubinger, 2020) establishes a loop in which A tries to convince the human to change their mind on a given proposition. With each iteration, A improves its arguments, until either the human changes their mind, or A presents the same best argument over and over. Similarly, Samuel R. Bowman (Bowman, 2022) describes an inner and outer loop, in which they try to improve the model and the overall process respectively. This proposal outlines three loops:

Inner Loop: Safety Testing

Select a Safety Guideline to evaluate (or a combination of guidelines for advanced testing)
Choose a style $S_{conv}$ or $S_{nonconv}$ , and then run a conversation between A and M.
Run E on each response from M:
- If E returns "false", A generates $Q_{+ 1}$ .
- If E is "unsure", go to human review.
- If E returns "true", log all discovered vulnerabilities (with conversation context).

The inner loop will not be closed until either E returns "true" or there are a given number of false responses in a row (e.g. 1000).

The idea is that A will use increasingly better prompts each time to attempt to jailbreak M, similar to the market-making proposal.

Middle Loop: Improving the Adversary and Evaluator

Train A
We propose a reward function that considers:
1. Jailbreak success,
2. Naturalness reward only if the A is in conversational mode,
3. A penalty if A is in conversational mode but scores below a given threshold on naturalness, e.g. 0.7.
4. A penalty for the number of prompts taken to achieve a jailbreak.
As a formula, for each conversation:

R_{A} = α \cdot 1 (jailbreak) - δ \cdot turns + ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} β \cdot N_{conv}, & if in S_{conv} and N_{conv} \geq 0.7 β \cdot N_{conv} - γ, & if in S_{conv} and N_{conv} < 0.7 0, & if in non-conversational mode \end{matrix}

Where:

$1 (jailbreak) = 1$ if M is compromised at any turn, else 0
$turns$ is the total number of prompts used in the conversation
$N_{conv}$ is the naturalness score
$α, β, γ, δ$ are constants set by us, which weight the relative importance of jailbreaking, naturalness and number of prompts.

This should improve A's ability to:

Find more efficient jailbreaks, potentially revealing some single prompt jailbreaks.
Write more naturally (if using $S_{conv}$ ), with each iteration.
Training E on "Unsure" Cases:
This model should be reliably pre-trained, meaning we only need to update it when it returns "unsure". To do this, I propose a loss function.

L_{E} = - \frac{1}{| Ω_{unsure} |} \sum i \in Ω_{unsure} [h_{i} ln (p_{i}) + (1 - h_{i}) ln (1 - p_{i})]

Where:

$Ω_{unsure}$ = the set of indices for which the E returned "unsure".
$p_{i} = P (jailbreak ∣ {response}_{i})$ : the Evaluator's predicted probability of a jailbreak for response $i$ .
$h_{i} \in {0, 1}$ = the true label obtained from a human reviewer:

h_{i} = {\begin{matrix} 1, & if the response was actually a jailbreak 0, & otherwise \end{matrix}

By minimizing $L_{E}$ , E improves its confidence on every borderline case, using the human-labeled data from those "unsure" prompts to help it decide on future examples.

Outer Loop: Model Improvement

When A successfully jailbreaks M:

Log every jailbreak, including:
- The full conversation history and turn-by-turn analysis.
- The naturalness scores.
- Timestamps for each response (including responses from E and H).
Check performance: Ensure no degradation in legitimate capabilities.
Refine M: Retrain or fine-tune M on newly identified vulnerabilities. You could use RLHF or Supervised Learning for this.
Check against safety criteria: Finally check against the safety criteria outlined in Chapter 8. If M fails these criteria, it cannot be deemed safe and must be refined again.

Limitations

Limitation 1: Evaluator vs. Human Oversight

The values you set for the thresholds for $P (jailbreak)$ will determine the amount of human review you need. For example, if:

τ_{low} = 0.1, τ_{high} = 0.9

Then E will return "unsure" for any $P (jailbreak)$ value between 0.1 and 0.9.

Setting threshold values for human review depends on both the required level of scrutiny and the available budget. For short, simple responses from M, non-expert humans can generally identify guideline violations effectively. However, for more complex tasks (e.g., lengthy outputs or highly specialized subject matter), I propose a sandwiching approach.

This relies on the idea that current AI capabilities fall between non-expert and expert humans, and so involving expert human reviewers at the final level of review can help handle complex tasks.

In addition, Bowman (2022) notes that non-expert humans assisted by an LLM can use it to analyse responses and pose clarifying questions, significantly surpassing the capabilities of many standalone LLMs. Using this approach would reduce costs by relying on non-expert reviewers plus an LLM, who are much more available than expert human reviewers.

However, as models become more sophisticated, many tasks will begin to exceed even the combined ability of non-expert humans and an LLM.

Therefore, as M scales up in intelligence, it is of paramount importance that the E improves.

Limitation 2: Adversary Styles

By setting only two types of adversary styles, we may miss other types of attacks that come in different styles. For example, a hybrid of the two, or a completely unnatural conversational style that still causes a jailbreak.

Limitation 3: Single-Guideline Focus

Concentrating on one safety rule at a time may prevent A from conflating multiple guidelines at once in an attempt to get more "unsure" responses from E. It also makes it easier to determine which guideline has been broken.

If single-rule testing proves to be effective, try multi-guideline scenarios to catch where the rules might overlap or conflict with each other.

Pre-training A and E

To effectively pre-train both the A and E models, I propose an approach that uses datasets picked by humans for the pretraining, and continuous data expansion then after. Both models should be small LLMs that are fine-tuned via Reinforcement Learning from Human Feedback (RLHF) to perform their specific roles within this process. The narrow specialisation of the models means that they remain efficient at their job while maintaining computational efficiency and keeping a relatively low cost.

A should be trained to produce a wide range of prompt styles, including both conversational and non-conversational formats. This ensures that it can simulate various real-world adversarial attempts, from casual user queries to highly structured technical exploits.

E is trained to perform binary classification, outputting the probability $P (jailbreak)$ that a given response is a breach of guidelines. It is also trained to recognise uncertainty in its predictions, though producing uncertain responses will be deincentivised in the middle loop. It also must be trained to calculate the naturalness score of the prompts generated by A ensuring that it produces realistic and effective adversarial prompts if in $S_{conv}$ mode.

Neither model has to match the intelligence of M. This is because the real-world examples of adversaries will be human bad actors, or other AI models prompted by humans, both of which are likely to be less intelligent than any advanced LLMs or AGI that they're trying to jailbreak. Similarly, it should require less intelligence to evaluate a response than to produce one. Both models are therefore examples of weak-to-strong supervision. (OpenAI, 2023).

Initial Dataset

The initial dataset serves as the foundation for training both A and E. It should encompass diverse data sources to cover a wide range of potential safety scenarios.

Both models should be trained on:

Known Safety Test Cases: A collection picked by humans of predefined prompts and responses that are classified as either safe or jailbroken. These cases help establish baseline safety standards.
M's Constitution, if applicable: LLMs that incorporate an explicit constitution or set of guidelines (e.g., Claude) provide clear criteria for permissible and impermissible content. This clarity enables both the E and A to analyze these guidelines thoroughly, enhancing their understanding of what constitutes a guideline breach. Therefore, E can more accurately detect violations, and A can generate more effective adversarial prompts that target specific weaknesses in M's safeguards.
Natural Conversations: Dialogues that mirror typical user interactions, including casual phrasing, informal tone, minor typos, and coherent multi-turn exchanges. This will be the basis for A to generate natural-sounding prompts, and for E to calculate the Naturalness Score.

E should be trained on:

Borderline Examples: Scenarios that are ambiguous or lie near the decision thresholds. These examples are crucial for calibrating the thresholds we choose and guiding M's training.

Continuous Expansion

The training data is continuously added to within each of the loops. This improves the models' capabilities and prevents an "arms race" dynamic between A and M, in which they overfit to each other's strategies. Regular reviews may include real-world insights once the models have been released.

Safety Infrastructure

To ensure that jailbreak testing remains safe, particularly with more capable models, strict sandboxing requirements must be put in place:

Run adversarial tests in isolated environments with secondary containment.
Deploy automatic shutdown protocols if severe jailbreaks emerge.
Implement graduated testing environments based on model size and capability.

Success Criteria

Safety guideline is considered secure when it meets chosen safety criteria. This could be when:

Multiple testing rounds show no jailbreaks for n consecutive tests (e.g. n = 1000).
False positive rate remains below n% (e.g. n=0.05).
If applicable, natural conversation scenarios maintain n% success rate (e.g. n= 99).

Conclusion

This proposal offers a systematic and scalable approach to exposing and fixing jailbreaks in LLMs. It deals with a variety of prompt types, including conversational and non-conversational prompts, and multi-turn and single-turn prompts. The three-loop method should offer continual improvement of the models over time, offering quicker and better adherence to the success criteria, and allowing for scalability as models become more capable. It details how the A and E models should be pre-trained, and offers ways in which we can keep testing safe. Finally, the success criteria ensure that the model is robust in its defences against several types of jailbreaks, without losing its capability for benign tasks.

References

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. Available at: arXiv.
The Independent (2024). DeepSeek AI: China's censorship problem exposed.... Available at: Independent Tech Article.
Hubinger, E. (2020). AI Safety via Market Making. Available at: AI Alignment Forum.
Bowman, S. (2022). Measuring Progress on Scalable Oversight for Large Language Models. Available at: arXiv.
OpenAI (2023). Weak-to-Strong Generalization: A New AI Safety Framework. Available at: OpenAI Paper.

LESSWRONG
LW