LESSWRONG
LW

Hadrien
27Ω5110
Message
Dialogue
Subscribe

Hadrien Mariaccia

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
The bitter lesson of misuse detection
Hadrien2mo10

There are some use cases where a user might use convoluted prompts for benign purposes (fiction writing, etc.). An example from the dataset is the deepinception template that could be used for a wide range of requests that wouldn't be harmful. A proper jailbreak classifier should be able to discriminate between a convoluted narrative prompt asking for a pancake recipe and one asking how to make a bomb.

I agree with your second point, but based on our results, the reliability of existing monitoring systems for content moderation is very low. Even Llama Guard 4 performs poorly on a significant set of harmful direct queries.

The non garbage approach seems therefore to find a way to leverage both the general capabilities of LLMs and the low latency/cost of supervisors into a hybrid approach like Anthropic's constitutional classifiers (we would still need to assess their robustness on public benchmarks, as they are not publicly available). If they are as effective as claimed, Anthropic could probably make money selling monitoring systems built on top of them. 

Reply
34The bitter lesson of misuse detection
Ω
2mo
Ω
6