Hadrien — LessWrong

There are some use cases where a user might use convoluted prompts for benign purposes (fiction writing, etc.). An example from the dataset is the deepinception template that could be used for a wide range of requests that wouldn't be harmful. A proper jailbreak classifier should be able to discriminate between a convoluted narrative prompt asking for a pancake recipe and one asking how to make a bomb.

I agree with your second point, but based on our results, the reliability of existing monitoring systems for content moderation is very low. Even Llama Guard 4 performs poorly on a significant set of harmful direct queries.

The non garbage approach seems therefore to find a way to leverage both the general capabilities of LLMs and the low latency/cost of supervisors into a hybrid approach like Anthropic's constitutional classifiers (we would still need to assess their robustness on public benchmarks, as they are not publicly available). If they are as effective as claimed, Anthropic could probably make money selling monitoring systems built on top of them.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments