TL;DR: We wanted to benchmark supervision systems available on the market—they performed poorly. Out of curiosity, we naively asked a frontier LLM to monitor the inputs; this approach performed significantly better. However, beware: even when an LLM flags a question as harmful, it will often still answer it.
Full paper is available here.
Abstract
Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive... (read 1938 more words →)
There are some use cases where a user might use convoluted prompts for benign purposes (fiction writing, etc.). An example from the dataset is the deepinception template that could be used for a wide range of requests that wouldn't be harmful. A proper jailbreak classifier should be able to discriminate between a convoluted narrative prompt asking for a pancake recipe and one asking how to make a bomb.
I agree with your second point, but based on our results, the reliability of existing monitoring systems for content moderation is very low. Even Llama Guard 4 performs poorly on a significant set of harmful direct queries.
The non garbage approach seems therefore to find a... (read more)