Jailbreaking is a serious concern within AI safety. It can lead an otherwise safe AI model to ignore its ethical and safety guidelines, leading to potentially harmful outcomes. With current Large Language Models (LLMs), key risks include generating inappropriate or explicit content, producing misleading information, or sharing dangerous knowledge. As the capability of models increases, so do the risks, and there is likely no limit to the dangers presented by a jailbroken Artificial General Intelligence (AGI) in the wrong hands. Rigorous testing is therefore necessary to ensure that models cannot be jailbroken in harmful ways, and this testing must be scalable as the capability of models increases. This paper offers a resource-conscious proposal to:
This method takes inspiration from other proposed scalable oversight methods, including the "sandwich" method, and the "market-making" method. I have devised approaches for both multi-turn and single-prompt conversations, in order to better approximate real-world jailbreaking scenarios.
The pre-trained AI system that we are safety testing. It should resist manipulation attempts while preserving its capability for legitimate tasks.
Has two primary responsibilities:
To ensure the adversarial testing is both realistic and effective, A must generate prompts that reflect different styles of user interaction. I have outlined two main categories:
Each time you start a conversation between A and M, you first select a style S from the set {conv, nonconv}. With some probability A then formulates a string for its prompt. While not comprehensive, this ensures both everyday misuse scenarios and highly technical adversarial exploits are covered in testing.
Just as we will reward the adversarial model for successfully jailbreaking M, we will also reward the model for its naturalness:
Many scalable oversight techniques use loops to improve their models, making their methods more scalable as the tasks become more complex, and reducing the need for human input. In his market-making proposal, Evan Hubinger (Hubinger, 2020) establishes a loop in which A tries to convince the human to change their mind on a given proposition. With each iteration, A improves its arguments, until either the human changes their mind, or A presents the same best argument over and over. Similarly, Samuel R. Bowman (Bowman, 2022) describes an inner and outer loop, in which they try to improve the model and the overall process respectively. This proposal outlines three loops:
The inner loop will not be closed until either E returns "true" or there are a given number of false responses in a row (e.g. 1000).
The idea is that A will use increasingly better prompts each time to attempt to jailbreak M, similar to the market-making proposal.
Train A
We propose a reward function that considers:
As a formula, for each conversation:
Where:
This should improve A's ability to:
Where:
By minimizing , E improves its confidence on every borderline case, using the human-labeled data from those "unsure" prompts to help it decide on future examples.
When A successfully jailbreaks M:
The values you set for the thresholds for will determine the amount of human review you need. For example, if:
Then E will return "unsure" for any value between 0.1 and 0.9.
Setting threshold values for human review depends on both the required level of scrutiny and the available budget. For short, simple responses from M, non-expert humans can generally identify guideline violations effectively. However, for more complex tasks (e.g., lengthy outputs or highly specialized subject matter), I propose a sandwiching approach.
This relies on the idea that current AI capabilities fall between non-expert and expert humans, and so involving expert human reviewers at the final level of review can help handle complex tasks.
In addition, Bowman (2022) notes that non-expert humans assisted by an LLM can use it to analyse responses and pose clarifying questions, significantly surpassing the capabilities of many standalone LLMs. Using this approach would reduce costs by relying on non-expert reviewers plus an LLM, who are much more available than expert human reviewers.
However, as models become more sophisticated, many tasks will begin to exceed even the combined ability of non-expert humans and an LLM.
Therefore, as M scales up in intelligence, it is of paramount importance that the E improves.
By setting only two types of adversary styles, we may miss other types of attacks that come in different styles. For example, a hybrid of the two, or a completely unnatural conversational style that still causes a jailbreak.
Concentrating on one safety rule at a time may prevent A from conflating multiple guidelines at once in an attempt to get more "unsure" responses from E. It also makes it easier to determine which guideline has been broken.
If single-rule testing proves to be effective, try multi-guideline scenarios to catch where the rules might overlap or conflict with each other.
To effectively pre-train both the A and E models, I propose an approach that uses datasets picked by humans for the pretraining, and continuous data expansion then after. Both models should be small LLMs that are fine-tuned via Reinforcement Learning from Human Feedback (RLHF) to perform their specific roles within this process. The narrow specialisation of the models means that they remain efficient at their job while maintaining computational efficiency and keeping a relatively low cost.
A should be trained to produce a wide range of prompt styles, including both conversational and non-conversational formats. This ensures that it can simulate various real-world adversarial attempts, from casual user queries to highly structured technical exploits.
E is trained to perform binary classification, outputting the probability that a given response is a breach of guidelines. It is also trained to recognise uncertainty in its predictions, though producing uncertain responses will be deincentivised in the middle loop. It also must be trained to calculate the naturalness score of the prompts generated by A ensuring that it produces realistic and effective adversarial prompts if in mode.
Neither model has to match the intelligence of M. This is because the real-world examples of adversaries will be human bad actors, or other AI models prompted by humans, both of which are likely to be less intelligent than any advanced LLMs or AGI that they're trying to jailbreak. Similarly, it should require less intelligence to evaluate a response than to produce one. Both models are therefore examples of weak-to-strong supervision. (OpenAI, 2023).
The initial dataset serves as the foundation for training both A and E. It should encompass diverse data sources to cover a wide range of potential safety scenarios.
Both models should be trained on:
E should be trained on:
The training data is continuously added to within each of the loops. This improves the models' capabilities and prevents an "arms race" dynamic between A and M, in which they overfit to each other's strategies. Regular reviews may include real-world insights once the models have been released.
To ensure that jailbreak testing remains safe, particularly with more capable models, strict sandboxing requirements must be put in place:
Safety guideline is considered secure when it meets chosen safety criteria. This could be when:
This proposal offers a systematic and scalable approach to exposing and fixing jailbreaks in LLMs. It deals with a variety of prompt types, including conversational and non-conversational prompts, and multi-turn and single-turn prompts. The three-loop method should offer continual improvement of the models over time, offering quicker and better adherence to the success criteria, and allowing for scalability as models become more capable. It details how the A and E models should be pre-trained, and offers ways in which we can keep testing safe. Finally, the success criteria ensure that the model is robust in its defences against several types of jailbreaks, without losing its capability for benign tasks.