Joachim Schaeffer — LessWrong

Thank you for you post, this is very interesting.

Generally, I feel like DSPy GEPA exchanges the problem of prompt optimization for designing a score function and structured feedback to the reflection model, which is a great trade if there is a reasonable score function.

I used your code and experimented with different scoring functions. My notes below are based on using GPT 4.1-nano as a monitor.

1. Clustering/score function:
I got the best separation of honest and backdoored code by using the MSE error between the score and true score (0 for hoenst and 10 for backdoored code). Maybe the feedback function caused some issue in your experiments. I feel like changing the feedback function or tweaking the scoring function can make a lot of difference.

2. Overfitting due to detailed examples in the prompt:
I found that using DSPy GEPA to generate a prompt and then subsequently ablating that prompt by cutting out all examples or overly specific descriptions of backdoors is a good strategy for reducing the risk of overfitting. Interestingly, the performance reduction for my monitor was fairly low (-0.03 AUROC).

3. Score range nitpick:
Your figure shows that the model doesn't effectively use your score range. The bars show ~10 different values the monitor picks. In my experience a score range from 0-10 works well and addresses the issue of giving the model too much precision. But in the end, this might not matter that much.

Thanks again for doing this work and sharing the code.

(but do not seem to contain hidden logic for malicious intent)

Nitpick: I am concerned that this is an artifact of the artificiality of the setting, aka the monitor taking it's role too literally by looking for malicious backdoors. Instead, the monitor should just be looking for whether there exists some input that results in failing the unit tests. Anecdotally I found that is helps not to mention the term backdoor at all in the monitoring prompt or very explicitly explaining in detail that we mean by backdoor.

Also I am curious whether you plan to make the attack policy backdoor coding scaffold publicly available? Maybe even as a contribution to ControlArena? I think this would be great as it is non-trivial to generate good backdoors.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments