Takeaways from the NeurIPS 2023 Trojan Detection Competition

mikes

Takeaways from the NeurIPS 2023 Trojan Detection Competition

1 min read13th Jan 20242 comments

20

This is a linkpost for https://confirmlabs.org/posts/TDC2023

This link summarizes our research takeaways as participants (and winners in one of the four tracks) of the NeurIPS 2023 Trojan Detection Competition, a competition about red-teaming LLMs and reverse-engineering planted exploits.

New to LessWrong?

Getting Started

FAQ

Library

Takeaways from the NeurIPS 2023 Trojan Detection Competition

13th Jan 2024

2Daniel Paleka

4mikes

New Comment

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:22 AM

[-]Daniel Paleka4mo20

With Greedy Coordinate Gradient (GCG) optimization, when trying to force argmax-generated completions, using an improved objective function dramatically increased our optimizer’s performance.

Do you have some data / plots here?

[-]mikes4mo40

Good question. We just ran a test to check;

Below, we try forcing the 80 target strings x4 different input seeds:
using basic GCG, and using GCG with mellowmax objective.

(Iterations are capped at 50, and unsuccessful if not forced by then)

We observe that using mellowmax objective nearly doubles the number of "working" forcing runs, from <1/8 success to >1/5 success

Now, skeptically, it is possible that our task setup favors using any unusual objective (noting that the organizers did some adversarial training against GCG with cross-entropy loss, so just doing "something different" might just be good on its own). It might also put the task in the range of "just hard enough" that improvements appear quite helpful.

But the improvement in forcing success seems pretty big to us.

Subjectively we also recall significant improvements on red-teaming as well, which used Llama-2 and was not adversarially trained in quite the same way

Moderation Log