Thank you for you post, this is very interesting.
Generally, I feel like DSPy GEPA exchanges the problem of prompt optimization for designing a score function and structured feedback to the reflection model, which is a great trade if there is a reasonable score function.
I used your code and experimented with different scoring functions. My notes below are based on using GPT 4.1-nano as a monitor.
1. Clustering/score function:
I got the best separation of honest and backdoored code by using the MSE error between the score and true score (0 for hoenst and...
(but do not seem to contain hidden logic for malicious intent)
Nitpick: I am concerned that this is an artifact of the artificiality of the setting, aka the monitor taking it's role too literally by looking for malicious backdoors. Instead, the monitor should just be looking for whether there exists some input that results in failing the unit tests. Anecdotally I found that is helps not to mention the term backdoor at all in the monitoring prompt or very explicitly explaining in detail that we mean by backdoor.
Also I am curious whether you plan to mak...
This work was produced during MARS and SPAR. arXiv version available at https://arxiv.org/abs/2507.02559. Code on GitHub and models on HuggingFace.
TL;DR we scaled LayerNorm (LN) removal by fine-tuning to GPT-2 XL:
We provide drop-in replacements for GPT-2 small, medium, large, and XL on HuggingFace for all your interpretability needs!
LN has...
Very interesting work, thanks for sharing your write-up (:
A few thoughts in no particular order:
1. I really like your attack taxonomy. I'd love to see that fleshed out more generally and also think multi-agent attacks should be included.
2. Mode collapse: I think this can be stressed even more. AFAIK all publicly available attack datasets (control tax, tylord) have an insanely high percentage of backdoors that are "if statements", I'd not be surprised if that number is close to 100% and I think it's a real problem. I also ran into this issue in my rec... (read more)